Science.gov

Sample records for algorithm random forests

  1. Fault detection of aircraft system with random forest algorithm and similarity measure.

    PubMed

    Lee, Sanghyuk; Park, Wookje; Jung, Sikhang

    2014-01-01

    Research on fault detection algorithm was developed with the similarity measure and random forest algorithm. The organized algorithm was applied to unmanned aircraft vehicle (UAV) that was readied by us. Similarity measure was designed by the help of distance information, and its usefulness was also verified by proof. Fault decision was carried out by calculation of weighted similarity measure. Twelve available coefficients among healthy and faulty status data group were used to determine the decision. Similarity measure weighting was done and obtained through random forest algorithm (RFA); RF provides data priority. In order to get a fast response of decision, a limited number of coefficients was also considered. Relation of detection rate and amount of feature data were analyzed and illustrated. By repeated trial of similarity calculation, useful data amount was obtained. PMID:25057508

  2. Tissue segmentation of computed tomography images using a Random Forest algorithm: a feasibility study.

    PubMed

    Polan, Daniel F; Brady, Samuel L; Kaufman, Robert A

    2016-09-01

    There is a need for robust, fully automated whole body organ segmentation for diagnostic CT. This study investigates and optimizes a Random Forest algorithm for automated organ segmentation; explores the limitations of a Random Forest algorithm applied to the CT environment; and demonstrates segmentation accuracy in a feasibility study of pediatric and adult patients. To the best of our knowledge, this is the first study to investigate a trainable Weka segmentation (TWS) implementation using Random Forest machine-learning as a means to develop a fully automated tissue segmentation tool developed specifically for pediatric and adult examinations in a diagnostic CT environment. Current innovation in computed tomography (CT) is focused on radiomics, patient-specific radiation dose calculation, and image quality improvement using iterative reconstruction, all of which require specific knowledge of tissue and organ systems within a CT image. The purpose of this study was to develop a fully automated Random Forest classifier algorithm for segmentation of neck-chest-abdomen-pelvis CT examinations based on pediatric and adult CT protocols. Seven materials were classified: background, lung/internal air or gas, fat, muscle, solid organ parenchyma, blood/contrast enhanced fluid, and bone tissue using Matlab and the TWS plugin of FIJI. The following classifier feature filters of TWS were investigated: minimum, maximum, mean, and variance evaluated over a voxel radius of 2 (n) , (n from 0 to 4), along with noise reduction and edge preserving filters: Gaussian, bilateral, Kuwahara, and anisotropic diffusion. The Random Forest algorithm used 200 trees with 2 features randomly selected per node. The optimized auto-segmentation algorithm resulted in 16 image features including features derived from maximum, mean, variance Gaussian and Kuwahara filters. Dice similarity coefficient (DSC) calculations between manually segmented and Random Forest algorithm segmented images from 21

  3. High-resolution climate data over conterminous US using random forest algorithm

    NASA Astrophysics Data System (ADS)

    Hashimoto, H.; Nemani, R. R.; Wang, W.

    2014-12-01

    We developed a new methodology to create high-resolution precipitation data using the random forest algorithm. We have used two approaches: physical downscaling from GCM data using a regional climate model, and interpolation from ground observation data. Physical downscaling method can be applied only for a small region because it is computationally expensive and complex to deploy. On the other hand, interpolation schemes from ground observations do not consider physical processes. In this study, we utilized the random forest algorithm to integrate atmospheric reanalysis data, satellite data, topography data, and ground observation data. First we considered situations where precipitation is same across the domain, largely dominated by storm like systems. We then picked several points to train random forest algorithm. The random forest algorithm estimates out-of-bag errors spatially, and produces the relative importance of each of the input variable.This methodology has the following advantages. (1) The methodology can ingest any spatial dataset to improve downscaling. Even non-precipitation datasets can be ingested such as satellite cloud cover data, radar reflectivity image, or modeled convective available potential energy. (2) The methodology is purely statistical so that physical assumptions are not required. Meanwhile, most of interpolation schemes assume empirical relationship between precipitation and elevation for orographic precipitation. (3) Low quality value in ingested data does not cause critical bias in the results because of the ensemble feature of random forest. Therefore, users do not need to pay a special attention to quality control of input data compared to other interpolation methodologies. (4) Same methodology can be applied to produce other high-resolution climate datasets, such as wind and cloud cover. Those variables are usually hard to be interpolated by conventional algorithms. In conclusion, the proposed methodology can produce reasonable

  4. Autoclassification of the Variable 3XMM Sources Using the Random Forest Machine Learning Algorithm

    NASA Astrophysics Data System (ADS)

    Farrell, Sean A.; Murphy, Tara; Lo, Kitty K.

    2015-11-01

    In the current era of large surveys and massive data sets, autoclassification of astrophysical sources using intelligent algorithms is becoming increasingly important. In this paper we present the catalog of variable sources in the Third XMM-Newton Serendipitous Source catalog (3XMM) autoclassified using the Random Forest machine learning algorithm. We used a sample of manually classified variable sources from the second data release of the XMM-Newton catalogs (2XMMi-DR2) to train the classifier, obtaining an accuracy of ∼92%. We also evaluated the effectiveness of identifying spurious detections using a sample of spurious sources, achieving an accuracy of ∼95%. Manual investigation of a random sample of classified sources confirmed these accuracy levels and showed that the Random Forest machine learning algorithm is highly effective at automatically classifying 3XMM sources. Here we present the catalog of classified 3XMM variable sources. We also present three previously unidentified unusual sources that were flagged as outlier sources by the algorithm: a new candidate supergiant fast X-ray transient, a 400 s X-ray pulsar, and an eclipsing 5 hr binary system coincident with a known Cepheid.

  5. Combining Spectral and Texture Features Using Random Forest Algorithm: Extracting Impervious Surface Area in Wuhan

    NASA Astrophysics Data System (ADS)

    Shao, Zhenfeng; Zhang, Yuan; Zhang, Lei; Song, Yang; Peng, Minjun

    2016-06-01

    Impervious surface area (ISA) is one of the most important indicators of urban environments. At present, based on multi-resolution remote sensing images, numerous approaches have been proposed to extract impervious surface, using statistical estimation, sub-pixel classification and spectral mixture analysis method of sub-pixel analysis. Through these methods, impervious surfaces can be effectively applied to regional-scale planning and management. However, for the large scale region, high resolution remote sensing images can provide more details, and therefore they will be more conducive to analysis environmental monitoring and urban management. Since the purpose of this study is to map impervious surfaces more effectively, three classification algorithms (random forests, decision trees, and artificial neural networks) were tested for their ability to map impervious surface. Random forests outperformed the decision trees, and artificial neural networks in precision. Combining the spectral indices and texture, random forests is applied to impervious surface extraction with a producer's accuracy of 0.98, a user's accuracy of 0.97, and an overall accuracy of 0.98 and a kappa coefficient of 0.97.

  6. Fault diagnosis in spur gears based on genetic algorithm and random forest

    NASA Astrophysics Data System (ADS)

    Cerrada, Mariela; Zurita, Grover; Cabrera, Diego; Sánchez, René-Vinicio; Artés, Mariano; Li, Chuan

    2016-03-01

    There are growing demands for condition-based monitoring of gearboxes, and therefore new methods to improve the reliability, effectiveness, accuracy of the gear fault detection ought to be evaluated. Feature selection is still an important aspect in machine learning-based diagnosis in order to reach good performance of the diagnostic models. On the other hand, random forest classifiers are suitable models in industrial environments where large data-samples are not usually available for training such diagnostic models. The main aim of this research is to build up a robust system for the multi-class fault diagnosis in spur gears, by selecting the best set of condition parameters on time, frequency and time-frequency domains, which are extracted from vibration signals. The diagnostic system is performed by using genetic algorithms and a classifier based on random forest, in a supervised environment. The original set of condition parameters is reduced around 66% regarding the initial size by using genetic algorithms, and still get an acceptable classification precision over 97%. The approach is tested on real vibration signals by considering several fault classes, one of them being an incipient fault, under different running conditions of load and velocity.

  7. Urban Road Detection in Airbone Laser Scanning Point Cloud Using Random Forest Algorithm

    NASA Astrophysics Data System (ADS)

    Kaczałek, B.; Borkowski, A.

    2016-06-01

    The objective of this research is to detect points that describe a road surface in an unclassified point cloud of the airborne laser scanning (ALS). For this purpose we use the Random Forest learning algorithm. The proposed methodology consists of two stages: preparation of features and supervised point cloud classification. In this approach we consider ALS points, representing only the last echo. For these points RGB, intensity, the normal vectors, their mean values and the standard deviations are provided. Moreover, local and global height variations are taken into account as components of a feature vector. The feature vectors are calculated on a basis of the 3D Delaunay triangulation. The proposed methodology was tested on point clouds with the average point density of 12 pts/m2 that represent large urban scene. The significance level of 15% was set up for a decision tree of the learning algorithm. As a result of the Random Forest classification we received two subsets of ALS points. One of those groups represents points belonging to the road network. After the classification evaluation we achieved from 90% of the overall classification accuracy. Finally, the ALS points representing roads were merged and simplified into road network polylines using morphological operations.

  8. Early Seizure Detection Algorithm Based on Intracranial EEG and Random Forest Classification.

    PubMed

    Donos, Cristian; Dümpelmann, Matthias; Schulze-Bonhage, Andreas

    2015-08-01

    The goal of this study is to provide a seizure detection algorithm that is relatively simple to implement on a microcontroller, so it can be used for an implantable closed loop stimulation device. We propose a set of 11 simple time domain and power bands features, computed from one intracranial EEG contact located in the seizure onset zone. The classification of the features is performed using a random forest classifier. Depending on the training datasets and the optimization preferences, the performance of the algorithm were: 93.84% mean sensitivity (100% median sensitivity), 3.03 s mean (1.75 s median) detection delays and 0.33/h mean (0.07/h median) false detections per hour. PMID:26022388

  9. Enhanced cancer recognition system based on random forests feature elimination algorithm.

    PubMed

    Ozcift, Akin

    2012-08-01

    Accurate classifiers are vital to design precise computer aided diagnosis (CADx) systems. Classification performances of machine learning algorithms are sensitive to the characteristics of data. In this aspect, determining the relevant and discriminative features is a key step to improve performance of CADx. There are various feature extraction methods in the literature. However, there is no universal variable selection algorithm that performs well in every data analysis scheme. Random Forests (RF), an ensemble of trees, is used in classification studies successfully. The success of RF algorithm makes it eligible to be used as kernel of a wrapper feature subset evaluator. We used best first search RF wrapper algorithm to select optimal features of four medical datasets: colon cancer, leukemia cancer, breast cancer and lung cancer. We compared accuracies of 15 widely used classifiers trained with all features versus to extracted features of each dataset. The experimental results demonstrated the efficiency of proposed feature extraction strategy with the increase in most of the classification accuracies of the algorithms. PMID:21567124

  10. Automatic classification of endogenous seismic sources within a landslide body using random forest algorithm

    NASA Astrophysics Data System (ADS)

    Provost, Floriane; Hibert, Clément; Malet, Jean-Philippe; Stumpf, André; Doubre, Cécile

    2016-04-01

    Different studies have shown the presence of microseismic activity in soft-rock landslides. The seismic signals exhibit significantly different features in the time and frequency domains which allow their classification and interpretation. Most of the classes could be associated with different mechanisms of deformation occurring within and at the surface (e.g. rockfall, slide-quake, fissure opening, fluid circulation). However, some signals remain not fully understood and some classes contain few examples that prevent any interpretation. To move toward a more complete interpretation of the links between the dynamics of soft-rock landslides and the physical processes controlling their behaviour, a complete catalog of the endogeneous seismicity is needed. We propose a multi-class detection method based on the random forests algorithm to automatically classify the source of seismic signals. Random forests is a supervised machine learning technique that is based on the computation of a large number of decision trees. The multiple decision trees are constructed from training sets including each of the target classes. In the case of seismic signals, these attributes may encompass spectral features but also waveform characteristics, multi-stations observations and other relevant information. The Random Forest classifier is used because it provides state-of-the-art performance when compared with other machine learning techniques (e.g. SVM, Neural Networks) and requires no fine tuning. Furthermore it is relatively fast, robust, easy to parallelize, and inherently suitable for multi-class problems. In this work, we present the first results of the classification method applied to the seismicity recorded at the Super-Sauze landslide between 2013 and 2015. We selected a dozen of seismic signal features that characterize precisely its spectral content (e.g. central frequency, spectrum width, energy in several frequency bands, spectrogram shape, spectrum local and global maxima

  11. A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites.

    PubMed

    Wei, Zhi-Sen; Yang, Jing-Yu; Shen, Hong-Bin; Yu, Dong-Jun

    2015-10-01

    Protein-protein interactions exist ubiquitously and play important roles in the life cycles of living cells. The interaction sites (residues) are essential to understanding the underlying mechanisms of protein-protein interactions. Previous research has demonstrated that the accurate identification of protein-protein interaction sites (PPIs) is helpful for developing new therapeutic drugs because many drugs will interact directly with those residues. Because of its significant potential in biological research and drug development, the prediction of PPIs has become an important topic in computational biology. However, a severe data imbalance exists in the PPIs prediction problem, where the number of the majority class samples (non-interacting residues) is far larger than that of the minority class samples (interacting residues). Thus, we developed a novel cascade random forests algorithm (CRF) to address the serious data imbalance that exists in the PPIs prediction problem. The proposed CRF resolves the negative effect of data imbalance by connecting multiple random forests in a cascade-like manner, each of which is trained with a balanced training subset that includes all minority samples and a subset of majority samples using an effective ensemble protocol. Based on the proposed CRF, we implemented a new sequence-based PPIs predictor, called CRF-PPI, which takes the combined features of position-specific scoring matrices, averaged cumulative hydropathy, and predicted relative solvent accessibility as model inputs. Benchmark experiments on both the cross validation and independent validation datasets demonstrated that the proposed CRF-PPI outperformed the state-of-the-art sequence-based PPIs predictors. The source code for CRF-PPI and the benchmark datasets are available online at http://csbio.njust.edu.cn/bioinf/CRF-PPI for free academic use. PMID:26441427

  12. Characterizing stand-level forest canopy cover and height using Landsat time series, samples of airborne LiDAR, and the Random Forest algorithm

    NASA Astrophysics Data System (ADS)

    Ahmed, Oumer S.; Franklin, Steven E.; Wulder, Michael A.; White, Joanne C.

    2015-03-01

    Many forest management activities, including the development of forest inventories, require spatially detailed forest canopy cover and height data. Among the various remote sensing technologies, LiDAR (Light Detection and Ranging) offers the most accurate and consistent means for obtaining reliable canopy structure measurements. A potential solution to reduce the cost of LiDAR data, is to integrate transects (samples) of LiDAR data with frequently acquired and spatially comprehensive optical remotely sensed data. Although multiple regression is commonly used for such modeling, often it does not fully capture the complex relationships between forest structure variables. This study investigates the potential of Random Forest (RF), a machine learning technique, to estimate LiDAR measured canopy structure using a time series of Landsat imagery. The study is implemented over a 2600 ha area of industrially managed coastal temperate forests on Vancouver Island, British Columbia, Canada. We implemented a trajectory-based approach to time series analysis that generates time since disturbance (TSD) and disturbance intensity information for each pixel and we used this information to stratify the forest land base into two strata: mature forests and young forests. Canopy cover and height for three forest classes (i.e. mature, young and mature and young (combined)) were modeled separately using multiple regression and Random Forest (RF) techniques. For all forest classes, the RF models provided improved estimates relative to the multiple regression models. The lowest validation error was obtained for the mature forest strata in a RF model (R2 = 0.88, RMSE = 2.39 m and bias = -0.16 for canopy height; R2 = 0.72, RMSE = 0.068% and bias = -0.0049 for canopy cover). This study demonstrates the value of using disturbance and successional history to inform estimates of canopy structure and obtain improved estimates of forest canopy cover and height using the RF algorithm.

  13. [Segmentation of Winter Wheat Canopy Image Based on Visual Spectral and Random Forest Algorithm].

    PubMed

    Liu, Ya-dong; Cui, Ri-xian

    2015-12-01

    Digital image analysis has been widely used in non-destructive monitoring of crop growth and nitrogen nutrition status due to its simplicity and efficiency. It is necessary to segment winter wheat plant from soil background for accessing canopy cover, intensity level of visible spectrum (R, G, and B) and other color indices derived from RGB. In present study, according to the variation in R, G, and B components of sRGB color space and L*, a*, and b* components of CIEL* a* b* color space between wheat plant and soil background, the segmentation of wheat plant from soil background were conducted by the Otsu's method based on a* component of CIEL* a* b* color space, and RGB based random forest method, and CIEL* a* b* based random forest method, respectively. Also the ability to segment wheat plant from soil background was evaluated with the value of segmentation accuracy. The results showed that all three methods had revealed good ability to segment wheat plant from soil background. The Otsu's method had lowest segmentation accuracy in comparison with the other two methods. There were only little difference in segmentation error between the two random forest methods. In conclusion, the random forest method had revealed its capacity to segment wheat plant from soil background with only the visual spectral information of canopy image without any color components combinations or any color space transformation. PMID:26964234

  14. Forest Fires in a Random Forest

    NASA Astrophysics Data System (ADS)

    Leuenberger, Michael; Kanevski, Mikhaïl; Vega Orozco, Carmen D.

    2013-04-01

    Forest fires in Canton Ticino (Switzerland) are very complex phenomena. Meteorological data can explain some occurrences of fires in time, but not necessarily in space. Using anthropogenic and geographical feature data with the random forest algorithm, this study tries to highlight factors that most influence the fire-ignition and to identify areas under risk. The fundamental scientific problem considered in the present research deals with an application of random forest algorithms for the analysis and modeling of forest fires patterns in a high dimensional input feature space. This study is focused on the 2,224 anthropogenic forest fires among the 2,401 forest fire ignition points that have occurred in Canton Ticino from 1969 to 2008. Provided by the Swiss Federal Institute for Forest, Snow and Landscape Research (WSL), the database characterizes each fire by their location (x,y coordinates of the ignition point), start date, duration, burned area, and other information such as ignition cause and topographic features such as slope, aspect, altitude, etc. In addition, the database VECTOR25 from SwissTopo was used to extract information of the distances between fire ignition points and anthropogenic structures like buildings, road network, rail network, etc. Developed by L. Breiman and A. Cutler, the Random Forests (RF) algorithm provides an ensemble of classification and regression trees. By a pseudo-random variable selection for each split node, this method grows a variety of decision trees that do not return the same results, and thus by a committee system, returns a value that has a better accuracy than other machine learning methods. This algorithm incorporates directly measurement of importance variable which is used to display factors affecting forest fires. Dealing with this parameter, several models can be fit, and thus, a prediction can be made throughout the validity domain of Canton Ticino. Comprehensive RF analysis was carried out in order to 1

  15. Estimation of spatial variability in humidity, wind, and solar radiation using the random forest algorithm for the conterminous USA

    NASA Astrophysics Data System (ADS)

    Hashimoto, H.; Nemani, R. R.

    2015-12-01

    Regional scale ecosystem modeling requires high-resolution data of surface climate variables. Spatial variability in temperature and precipitation has been well studied over the past two decades resulting in several sophisticated algorithms. However, compared to temperature and precipitation, other surface climate variables, such as humidity, solar radiation and wind speed, are not available to use, even though those data are equally important for ecosystem modeling. The main reason for this is the lack of governing physical equations for interpolating observations and the lack of comparable satellite observations. Therefore, scientists have been using reanalysis data or simply interpolated data for ecosystem modeling, though they are too coarse for regional scale ecosystem analysis. In this study, we developed a method to spatially map daily climate variables, including humidity, solar radiation, wind, precipitation, and temperature. We applied the method to the conterminous USA from 1980 to 2015. Previously, we successfully developed a precipitation interpolation method using random forest algorithm, and now we extended it to the other variables. Because this method does not require any assumptions about physical equations, this method can potentially be applicable to any climate variable if measured data are available. The method requires point data along with a host of spatial data sets . Satellite data, reanalysis data, and radar data were used and the importance of each dataset was analyzed using random forest algorithm. The only parameter we need to adjust is the radius from the target point, in which statistically meaningful relationships between observed and spatial co-variate data is calculated. The radius was optimized using mean absolute error and bias. We also analyzed temporal consistency and spatial patterns of the results. Because it is relatively easy to customize the setup depending on user's request, the resulting datasets may be useful for

  16. Gray level co-occurrence and random forest algorithm-based gender determination with maxillary tooth plaster images.

    PubMed

    Akkoç, Betül; Arslan, Ahmet; Kök, Hatice

    2016-06-01

    Gender is one of the intrinsic properties of identity, with performance enhancement reducing the cluster when a search is performed. Teeth have durable and resistant structure, and as such are important sources of identification in disasters (accident, fire, etc.). In this study, gender determination is accomplished by maxillary tooth plaster models of 40 people (20 males and 20 females). The images of tooth plaster models are taken with a lighting mechanism set-up. A gray level co-occurrence matrix of the image with segmentation is formed and classified via a Random Forest (RF) algorithm by extracting pertinent features of the matrix. Automatic gender determination has a 90% success rate, with an applicable system to determine gender from maxillary tooth plaster images. PMID:27104495

  17. Land cover and land use mapping of the iSimangaliso Wetland Park, South Africa: comparison of oblique and orthogonal random forest algorithms

    NASA Astrophysics Data System (ADS)

    Bassa, Zaakirah; Bob, Urmilla; Szantoi, Zoltan; Ismail, Riyad

    2016-01-01

    In recent years, the popularity of tree-based ensemble methods for land cover classification has increased significantly. Using WorldView-2 image data, we evaluate the potential of the oblique random forest algorithm (oRF) to classify a highly heterogeneous protected area. In contrast to the random forest (RF) algorithm, the oRF algorithm builds multivariate trees by learning the optimal split using a supervised model. The oRF binary algorithm is adapted to a multiclass land cover and land use application using both the "one-against-one" and "one-against-all" combination approaches. Results show that the oRF algorithms are capable of achieving high classification accuracies (>80%). However, there was no statistical difference in classification accuracies obtained by the oRF algorithms and the more popular RF algorithm. For all the algorithms, user accuracies (UAs) and producer accuracies (PAs) >80% were recorded for most of the classes. Both the RF and oRF algorithms poorly classified the indigenous forest class as indicated by the low UAs and PAs. Finally, the results from this study advocate and support the utility of the oRF algorithm for land cover and land use mapping of protected areas using WorldView-2 image data.

  18. RANDOM FORESTS FOR PHOTOMETRIC REDSHIFTS

    SciTech Connect

    Carliles, Samuel; Szalay, Alexander S.; Budavari, Tamas; Heinis, Sebastien; Priebe, Carey

    2010-03-20

    The main challenge today in photometric redshift estimation is not in the accuracy but in understanding the uncertainties. We introduce an empirical method based on Random Forests to address these issues. The training algorithm builds a set of optimal decision trees on subsets of the available spectroscopic sample, which provide independent constraints on the redshift of each galaxy. The combined forest estimates have intriguing statistical properties, notable among which are Gaussian errors. We demonstrate the power of our approach on multi-color measurements of the Sloan Digital Sky Survey.

  19. Comparison between WorldView-2 and SPOT-5 images in mapping the bracken fern using the random forest algorithm

    NASA Astrophysics Data System (ADS)

    Odindi, John; Adam, Elhadi; Ngubane, Zinhle; Mutanga, Onisimo; Slotow, Rob

    2014-01-01

    Plant species invasion is known to be a major threat to socioeconomic and ecological systems. Due to high cost and limited extents of urban green spaces, high mapping accuracy is necessary to optimize the management of such spaces. We compare the performance of the new-generation WorldView-2 (WV-2) and SPOT-5 images in mapping the bracken fern [Pteridium aquilinum (L) kuhn] in a conserved urban landscape. Using the random forest algorithm, grid-search approaches based on out-of-bag estimate error were used to determine the optimal ntree and mtry combinations. The variable importance and backward feature elimination techniques were further used to determine the influence of the image bands on mapping accuracy. Additionally, the value of the commonly used vegetation indices in enhancing the classification accuracy was tested on the better performing image data. Results show that the performance of the new WV-2 bands was better than that of the traditional bands. Overall classification accuracies of 84.72 and 72.22% were achieved for the WV-2 and SPOT images, respectively. Use of selected indices from the WV-2 bands increased the overall classification accuracy to 91.67%. The findings in this study show the suitability of the new generation in mapping the bracken fern within the often vulnerable urban natural vegetation cover types.

  20. PRBP: Prediction of RNA-Binding Proteins Using a Random Forest Algorithm Combined with an RNA-Binding Residue Predictor.

    PubMed

    Ma, Xin; Guo, Jing; Xiao, Ke; Sun, Xiao

    2015-01-01

    The prediction of RNA-binding proteins is an incredibly challenging problem in computational biology. Although great progress has been made using various machine learning approaches with numerous features, the problem is still far from being solved. In this study, we attempt to predict RNA-binding proteins directly from amino acid sequences. A novel approach, PRBP predicts RNA-binding proteins using the information of predicted RNA-binding residues in conjunction with a random forest based method. For a given protein, we first predict its RNA-binding residues and then judge whether the protein binds RNA or not based on information from that prediction. If the protein cannot be identified by the information associated with its predicted RNA-binding residues, then a novel random forest predictor is used to determine if the query protein is a RNA-binding protein. We incorporated features of evolutionary information combined with physicochemical features (EIPP) and amino acid composition feature to establish the random forest predictor. Feature analysis showed that EIPP contributed the most to the prediction of RNA-binding proteins. The results also showed that the information from the RNA-binding residue prediction improved the overall performance of our RNA-binding protein prediction. It is anticipated that the PRBP method will become a useful tool for identifying RNA-binding proteins. A PRBP Web server implementation is freely available at http://www.cbi.seu.edu.cn/PRBP/. PMID:26671809

  1. A tale of two "forests": random forest machine learning AIDS tropical forest carbon mapping.

    PubMed

    Mascaro, Joseph; Asner, Gregory P; Knapp, David E; Kennedy-Bowdoin, Ty; Martin, Roberta E; Anderson, Christopher; Higgins, Mark; Chadwick, K Dana

    2014-01-01

    Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including--in the latter case--x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called "out-of-bag"), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha(-1) when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation. PMID:24489686

  2. Detecting targets hidden in random forests

    NASA Astrophysics Data System (ADS)

    Kouritzin, Michael A.; Luo, Dandan; Newton, Fraser; Wu, Biao

    2009-05-01

    Military tanks, cargo or troop carriers, missile carriers or rocket launchers often hide themselves from detection in the forests. This plagues the detection problem of locating these hidden targets. An electro-optic camera mounted on a surveillance aircraft or unmanned aerial vehicle is used to capture the images of the forests with possible hidden targets, e.g., rocket launchers. We consider random forests of longitudinal and latitudinal correlations. Specifically, foliage coverage is encoded with a binary representation (i.e., foliage or no foliage), and is correlated in adjacent regions. We address the detection problem of camouflaged targets hidden in random forests by building memory into the observations. In particular, we propose an efficient algorithm to generate random forests, ground, and camouflage of hidden targets with two dimensional correlations. The observations are a sequence of snapshots consisting of foliage-obscured ground or target. Theoretically, detection is possible because there are subtle differences in the correlations of the ground and camouflage of the rocket launcher. However, these differences are well beyond human perception. To detect the presence of hidden targets automatically, we develop a Markov representation for these sequences and modify the classical filtering equations to allow the Markov chain observation. Particle filters are used to estimate the position of the targets in combination with a novel random weighting technique. Furthermore, we give positive proof-of-concept simulations.

  3. Automated classification of seismic sources in large database using random forest algorithm: First results at Piton de la Fournaise volcano (La Réunion).

    NASA Astrophysics Data System (ADS)

    Hibert, Clément; Provost, Floriane; Malet, Jean-Philippe; Stumpf, André; Maggi, Alessia; Ferrazzini, Valérie

    2016-04-01

    In the past decades the increasing quality of seismic sensors and capability to transfer remotely large quantity of data led to a fast densification of local, regional and global seismic networks for near real-time monitoring. This technological advance permits the use of seismology to document geological and natural/anthropogenic processes (volcanoes, ice-calving, landslides, snow and rock avalanches, geothermal fields), but also led to an ever-growing quantity of seismic data. This wealth of seismic data makes the construction of complete seismicity catalogs, that include earthquakes but also other sources of seismic waves, more challenging and very time-consuming as this critical pre-processing stage is classically done by human operators. To overcome this issue, the development of automatic methods for the processing of continuous seismic data appears to be a necessity. The classification algorithm should satisfy the need of a method that is robust, precise and versatile enough to be deployed to monitor the seismicity in very different contexts. We propose a multi-class detection method based on the random forests algorithm to automatically classify the source of seismic signals. Random forests is a supervised machine learning technique that is based on the computation of a large number of decision trees. The multiple decision trees are constructed from training sets including each of the target classes. In the case of seismic signals, these attributes may encompass spectral features but also waveform characteristics, multi-stations observations and other relevant information. The Random Forests classifier is used because it provides state-of-the-art performance when compared with other machine learning techniques (e.g. SVM, Neural Networks) and requires no fine tuning. Furthermore it is relatively fast, robust, easy to parallelize, and inherently suitable for multi-class problems. In this work, we present the first results of the classification method applied

  4. Weighted Hybrid Decision Tree Model for Random Forest Classifier

    NASA Astrophysics Data System (ADS)

    Kulkarni, Vrushali Y.; Sinha, Pradeep K.; Petare, Manisha C.

    2016-06-01

    Random Forest is an ensemble, supervised machine learning algorithm. An ensemble generates many classifiers and combines their results by majority voting. Random forest uses decision tree as base classifier. In decision tree induction, an attribute split/evaluation measure is used to decide the best split at each node of the decision tree. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation among them. The work presented in this paper is related to attribute split measures and is a two step process: first theoretical study of the five selected split measures is done and a comparison matrix is generated to understand pros and cons of each measure. These theoretical results are verified by performing empirical analysis. For empirical analysis, random forest is generated using each of the five selected split measures, chosen one at a time. i.e. random forest using information gain, random forest using gain ratio, etc. The next step is, based on this theoretical and empirical analysis, a new approach of hybrid decision tree model for random forest classifier is proposed. In this model, individual decision tree in Random Forest is generated using different split measures. This model is augmented by weighted voting based on the strength of individual tree. The new approach has shown notable increase in the accuracy of random forest.

  5. Multivariable integration method for estimating sea surface salinity in coastal waters from in situ data and remotely sensed data using random forest algorithm

    NASA Astrophysics Data System (ADS)

    Liu, Meiling; Liu, Xiangnan; Liu, Da; Ding, Chao; Jiang, Jiale

    2015-02-01

    A random forest (RF) model was created to estimate sea surface salinity (SSS) in the Hong Kong Sea, China, by integrating in situ and remotely sensed data. Optical remotely sensed data from China's HJ-1 satellite and in situ data were collected. The prediction model of salinity was developed by in situ environmental variables in the ocean, namely sea surface temperature (SST), pH, total inorganic nitrogen (TIN) and Chl-a, which are strongly related to SSS according to Pearson's correlation analysis. The large-scale SSS was estimated using the established salinity model with the same input parameters. The ordinary kriging interpolation using in situ data and the retrieval model based on remotely sensed data were developed to obtain the large-scale input parameters of the model. The different number of trees in the forest (ntree) and the number of features at each node (mtry) were adjusted in the RF model. The results showed that an optimum RF model was obtained with mtry=32 and ntree=2000, and the most important variable of the model for SSS prediction was SST, followed by TIN, Chl-a and pH. Such an RF model was successful in evaluating the temporal-spatial distribution of SSS and had a relatively low estimation error. The root mean square error (RMSE) was less than 2.0 psu, the mean absolute error (MAE) was below 1.5 psu, and the absolute percent error (APE) was lower than 5%. The final RF salinity model was then compared with a multiple linear regression model (MLR), a back-propagation artificial neural network model, and a classification and regression trees (CART) model. The RF had a lower estimation error than the other three models. In addition, the RF model was used extensively under different periods and could be universal. This demonstrated that the RF algorithm has the capability to estimate SSS in coastal waters by integrating in situ and remotely sensed data.

  6. Random Forest Classification for Surficial Material Mapping in Northern Canada

    NASA Astrophysics Data System (ADS)

    Parkinson, William

    There is a need at the Geological Survey of Canada to apply improved accuracy assessments of satellite image classification and to support remote predictive mapping techniques for geological map production and field operations. Most existing image classification algorithms, however, lack any robust capabilities for assessing classification accuracy and its variability throughout the landscape. In this study, a random forest classification workflow is introduced to improve understanding of overall image classification accuracy and to better describe its spatial variability across a heterogeneous landscape in Northern Canada. Random Forest model is a stochastic implementation of classification and regression trees, which is computationally efficient, effectively handles outlier bias can be used on non-parametric data sources. A variable selection methodology and stochastic accuracy assessment for Random Forest is introduced. Random forest provides an enhanced classification compared to the standard maximum likelihood algorithms improving predictive capacity of satellite imagery for surficial material mapping.

  7. Evaluating total inorganic nitrogen in coastal waters through fusion of multi-temporal RADARSAT-2 and optical imagery using random forest algorithm

    NASA Astrophysics Data System (ADS)

    Liu, Meiling; Liu, Xiangnan; Li, Jin; Ding, Chao; Jiang, Jiale

    2014-12-01

    Satellites routinely provide frequent, large-scale, near-surface views of many oceanographic variables pertinent to plankton ecology. However, the nutrient fertility of water can be challenging to detect accurately using remote sensing technology. This research has explored an approach to estimate the nutrient fertility in coastal waters through the fusion of synthetic aperture radar (SAR) images and optical images using the random forest (RF) algorithm. The estimation of total inorganic nitrogen (TIN) in the Hong Kong Sea, China, was used as a case study. In March of 2009 and May and August of 2010, a sequence of multi-temporal in situ data and CCD images from China's HJ-1 satellite and RADARSAT-2 images were acquired. Four sensitive parameters were selected as input variables to evaluate TIN: single-band reflectance, a normalized difference spectral index (NDSI) and HV and VH polarizations. The RF algorithm was used to merge the different input variables from the SAR and optical imagery to generate a new dataset (i.e., the TIN outputs). The results showed the temporal-spatial distribution of TIN. The TIN values decreased from coastal waters to the open water areas, and TIN values in the northeast area were higher than those found in the southwest region of the study area. The maximum TIN values occurred in May. Additionally, the estimation accuracy for estimating TIN was significantly improved when the SAR and optical data were used in combination rather than a single data type alone. This study suggests that this method of estimating nutrient fertility in coastal waters by effectively fusing data from multiple sensors is very promising.

  8. Mapping the distributions of C3 and C4 grasses in the mixed-grass prairies of southwest Oklahoma using the Random Forest classification algorithm

    NASA Astrophysics Data System (ADS)

    Yan, Dong; de Beurs, Kirsten M.

    2016-05-01

    The objective of this paper is to demonstrate a new method to map the distributions of C3 and C4 grasses at 30 m resolution and over a 25-year period of time (1988-2013) by combining the Random Forest (RF) classification algorithm and patch stable areas identified using the spatial pattern analysis software FRAGSTATS. Predictor variables for RF classifications consisted of ten spectral variables, four soil edaphic variables and three topographic variables. We provided a confidence score in terms of obtaining pure land cover at each pixel location by retrieving the classification tree votes. Classification accuracy assessments and predictor variable importance evaluations were conducted based on a repeated stratified sampling approach. Results show that patch stable areas obtained from larger patches are more appropriate to be used as sample data pools to train and validate RF classifiers for historical land cover mapping purposes and it is more reasonable to use patch stable areas as sample pools to map land cover in a year closer to the present rather than years further back in time. The percentage of obtained high confidence prediction pixels across the study area ranges from 71.18% in 1988 to 73.48% in 2013. The repeated stratified sampling approach is necessary in terms of reducing the positive bias in the estimated classification accuracy caused by the possible selections of training and validation pixels from the same patch stable areas. The RF classification algorithm was able to identify the important environmental factors affecting the distributions of C3 and C4 grasses in our study area such as elevation, soil pH, soil organic matter and soil texture.

  9. A Tale of Two “Forests”: Random Forest Machine Learning Aids Tropical Forest Carbon Mapping

    PubMed Central

    Mascaro, Joseph; Asner, Gregory P.; Knapp, David E.; Kennedy-Bowdoin, Ty; Martin, Roberta E.; Anderson, Christopher; Higgins, Mark; Chadwick, K. Dana

    2014-01-01

    Accurate and spatially-explicit maps of tropical forest carbon stocks are needed to implement carbon offset mechanisms such as REDD+ (Reduced Deforestation and Degradation Plus). The Random Forest machine learning algorithm may aid carbon mapping applications using remotely-sensed data. However, Random Forest has never been compared to traditional and potentially more reliable techniques such as regionally stratified sampling and upscaling, and it has rarely been employed with spatial data. Here, we evaluated the performance of Random Forest in upscaling airborne LiDAR (Light Detection and Ranging)-based carbon estimates compared to the stratification approach over a 16-million hectare focal area of the Western Amazon. We considered two runs of Random Forest, both with and without spatial contextual modeling by including—in the latter case—x, and y position directly in the model. In each case, we set aside 8 million hectares (i.e., half of the focal area) for validation; this rigorous test of Random Forest went above and beyond the internal validation normally compiled by the algorithm (i.e., called “out-of-bag”), which proved insufficient for this spatial application. In this heterogeneous region of Northern Peru, the model with spatial context was the best preforming run of Random Forest, and explained 59% of LiDAR-based carbon estimates within the validation area, compared to 37% for stratification or 43% by Random Forest without spatial context. With the 60% improvement in explained variation, RMSE against validation LiDAR samples improved from 33 to 26 Mg C ha−1 when using Random Forest with spatial context. Our results suggest that spatial context should be considered when using Random Forest, and that doing so may result in substantially improved carbon stock modeling for purposes of climate change mitigation. PMID:24489686

  10. Aggregated Recommendation through Random Forests

    PubMed Central

    2014-01-01

    Aggregated recommendation refers to the process of suggesting one kind of items to a group of users. Compared to user-oriented or item-oriented approaches, it is more general and, therefore, more appropriate for cold-start recommendation. In this paper, we propose a random forest approach to create aggregated recommender systems. The approach is used to predict the rating of a group of users to a kind of items. In the preprocessing stage, we merge user, item, and rating information to construct an aggregated decision table, where rating information serves as the decision attribute. We also model the data conversion process corresponding to the new user, new item, and both new problems. In the training stage, a forest is built for the aggregated training set, where each leaf is assigned a distribution of discrete rating. In the testing stage, we present four predicting approaches to compute evaluation values based on the distribution of each tree. Experiments results on the well-known MovieLens dataset show that the aggregated approach maintains an acceptable level of accuracy. PMID:25180204

  11. Aggregated recommendation through random forests.

    PubMed

    Zhang, Heng-Ru; Min, Fan; He, Xu

    2014-01-01

    Aggregated recommendation refers to the process of suggesting one kind of items to a group of users. Compared to user-oriented or item-oriented approaches, it is more general and, therefore, more appropriate for cold-start recommendation. In this paper, we propose a random forest approach to create aggregated recommender systems. The approach is used to predict the rating of a group of users to a kind of items. In the preprocessing stage, we merge user, item, and rating information to construct an aggregated decision table, where rating information serves as the decision attribute. We also model the data conversion process corresponding to the new user, new item, and both new problems. In the training stage, a forest is built for the aggregated training set, where each leaf is assigned a distribution of discrete rating. In the testing stage, we present four predicting approaches to compute evaluation values based on the distribution of each tree. Experiments results on the well-known MovieLens dataset show that the aggregated approach maintains an acceptable level of accuracy. PMID:25180204

  12. A random forest classifier for lymph diseases.

    PubMed

    Azar, Ahmad Taher; Elshazly, Hanaa Ismail; Hassanien, Aboul Ella; Elkorany, Abeer Mohamed

    2014-02-01

    Machine learning-based classification techniques provide support for the decision-making process in many areas of health care, including diagnosis, prognosis, screening, etc. Feature selection (FS) is expected to improve classification performance, particularly in situations characterized by the high data dimensionality problem caused by relatively few training examples compared to a large number of measured features. In this paper, a random forest classifier (RFC) approach is proposed to diagnose lymph diseases. Focusing on feature selection, the first stage of the proposed system aims at constructing diverse feature selection algorithms such as genetic algorithm (GA), Principal Component Analysis (PCA), Relief-F, Fisher, Sequential Forward Floating Search (SFFS) and the Sequential Backward Floating Search (SBFS) for reducing the dimension of lymph diseases dataset. Switching from feature selection to model construction, in the second stage, the obtained feature subsets are fed into the RFC for efficient classification. It was observed that GA-RFC achieved the highest classification accuracy of 92.2%. The dimension of input feature space is reduced from eighteen to six features by using GA. PMID:24290902

  13. Random forests for classification in ecology

    USGS Publications Warehouse

    Cutler, D.R.; Edwards, T.C., Jr.; Beard, K.H.; Cutler, A.; Hess, K.T.; Gibson, J.; Lawler, J.J.

    2007-01-01

    Classification procedures are some of the most widely used statistical methods in ecology. Random forests (RF) is a new and powerful statistical classifier that is well established in other disciplines but is relatively unknown in ecology. Advantages of RF compared to other statistical classifiers include (1) very high classification accuracy; (2) a novel method of determining variable importance; (3) ability to model complex interactions among predictor variables; (4) flexibility to perform several types of statistical data analysis, including regression, classification, survival analysis, and unsupervised learning; and (5) an algorithm for imputing missing values. We compared the accuracies of RF and four other commonly used statistical classifiers using data on invasive plant species presence in Lava Beds National Monument, California, USA, rare lichen species presence in the Pacific Northwest, USA, and nest sites for cavity nesting birds in the Uinta Mountains, Utah, USA. We observed high classification accuracy in all applications as measured by cross-validation and, in the case of the lichen data, by independent test data, when comparing RF to other common classification methods. We also observed that the variables that RF identified as most important for classifying invasive plant species coincided with expectations based on the literature. ?? 2007 by the Ecological Society of America.

  14. Robust automated lymph node segmentation with random forests

    NASA Astrophysics Data System (ADS)

    Allen, David; Lu, Le; Yao, Jianhua; Liu, Jiamin; Turkbey, Evrim; Summers, Ronald M.

    2014-03-01

    Enlarged lymph nodes may indicate the presence of illness. Therefore, identification and measurement of lymph nodes provide essential biomarkers for diagnosing disease. Accurate automatic detection and measurement of lymph nodes can assist radiologists for better repeatability and quality assurance, but is challenging as well because lymph nodes are often very small and have a highly variable shape. In this paper, we propose to tackle this problem via supervised statistical learning-based robust voxel labeling, specifically the random forest algorithm. Random forest employs an ensemble of decision trees that are trained on labeled multi-class data to recognize the data features and is adopted to handle lowlevel image features sampled and extracted from 3D medical scans. Here we exploit three types of image features (intensity, order-1 contrast and order-2 contrast) and evaluate their effectiveness in random forest feature selection setting. The trained forest can then be applied to unseen data by voxel scanning via sliding windows (11×11×11), to assign the class label and class-conditional probability to each unlabeled voxel at the center of window. Voxels from the manually annotated lymph nodes in a CT volume are treated as positive class; background non-lymph node voxels as negatives. We show that the random forest algorithm can be adapted and perform the voxel labeling task accurately and efficiently. The experimental results are very promising, with AUCs (area under curve) of the training and validation ROC (receiver operating characteristic) of 0.972 and 0.959, respectively. The visualized voxel labeling results also confirm the validity.

  15. Randomized Algorithms for Matrices and Data

    NASA Astrophysics Data System (ADS)

    Mahoney, Michael W.

    2012-03-01

    This chapter reviews recent work on randomized matrix algorithms. By “randomized matrix algorithms,” we refer to a class of recently developed random sampling and random projection algorithms for ubiquitous linear algebra problems such as least-squares (LS) regression and low-rank matrix approximation. These developments have been driven by applications in large-scale data analysis—applications which place very different demands on matrices than traditional scientific computing applications. Thus, in this review, we will focus on highlighting the simplicity and generality of several core ideas that underlie the usefulness of these randomized algorithms in scientific applications such as genetics (where these algorithms have already been applied) and astronomy (where, hopefully, in part due to this review they will soon be applied). The work we will review here had its origins within theoretical computer science (TCS). An important feature in the use of randomized algorithms in TCS more generally is that one must identify and then algorithmically deal with relevant “nonuniformity structure” in the data. For the randomized matrix algorithms to be reviewed here and that have proven useful recently in numerical linear algebra (NLA) and large-scale data analysis applications, the relevant nonuniformity structure is defined by the so-called statistical leverage scores. Defined more precisely below, these leverage scores are basically the diagonal elements of the projection matrix onto the dominant part of the spectrum of the input matrix. As such, they have a long history in statistical data analysis, where they have been used for outlier detection in regression diagnostics. More generally, these scores often have a very natural interpretation in terms of the data and processes generating the data. For example, they can be interpreted in terms of the leverage or influence that a given data point has on, say, the best low-rank matrix approximation; and this

  16. Phenotype Recognition for RNAi Screening by Random Projection Forest

    NASA Astrophysics Data System (ADS)

    Zhang, Bailing

    2011-06-01

    High-content screening is important in drug discovery. The use of images of living cells as the basic unit for molecule discovery can aid the identification of small compounds altering cellular phenotypes. As such, efficient computational methods are required for the rate limiting task of cellular phenotype identification. In this paper we first investigate the effectiveness of a feature description approach by combining Haralick texture analysis with Curvelet transform and then propose a new ensemble approach for classification. The ensemble contains a set of base classifiers which are trained using random projection (RP) of original features onto higher-dimensional spaces. With Classification and Regression Tree (CART) as the base classifier, it has been empirically demonstrated that the proposed Random Projection Forest ensemble gives better classification results than those achieved by the Boosting, Bagging and Rotation Forest algorithms, offering a classification rate ˜88% with smallest standard deviation, which compares sharply with the published result of 82%.

  17. A random search algorithm for laboratory computers

    NASA Technical Reports Server (NTRS)

    Curry, R. E.

    1975-01-01

    The small laboratory computer is ideal for experimental control and data acquisition. Postexperimental data processing is often performed on large computers because of the availability of sophisticated programs, but costs and data compatibility are negative factors. Parameter optimization can be accomplished on the small computer, offering ease of programming, data compatibility, and low cost. A previously proposed random-search algorithm ('random creep') was found to be very slow in convergence. A method is proposed (the 'random leap' algorithm) which starts in a global search mode and automatically adjusts step size to speed convergence. A FORTRAN executive program for the random-leap algorithm is presented which calls a user-supplied function subroutine. An example of a function subroutine is given which calculates maximum-likelihood estimates of receiver operating-characteristic parameters from binary response data. Other applications in parameter estimation, generalized least squares, and matrix inversion are discussed.

  18. Wildfire smoke detection using temporospatial features and random forest classifiers

    NASA Astrophysics Data System (ADS)

    Ko, Byoungchul; Kwak, Joon-Young; Nam, Jae-Yeal

    2012-01-01

    We propose a wildfire smoke detection algorithm that uses temporospatial visual features and an ensemble of decision trees and random forest classifiers. In general, wildfire smoke detection is particularly important for early warning systems because smoke is usually generated before flames; in addition, smoke can be detected from a long distance owing to its diffusion characteristics. In order to detect wildfire smoke using a video camera, temporospatial characteristics such as color, wavelet coefficients, motion orientation, and a histogram of oriented gradients are extracted from the preceding 100 corresponding frames and the current keyframe. Two RFs are then trained using independent temporal and spatial feature vectors. Finally, a candidate block is declared as a smoke block if the average probability of two RFs in a smoke class is maximum. The proposed algorithm was successfully applied to various wildfire-smoke and smoke-colored videos and performed better than other related algorithms.

  19. Random rotation survival forest for high dimensional censored data.

    PubMed

    Zhou, Lifeng; Wang, Hong; Xu, Qingsong

    2016-01-01

    Recently, rotation forest has been extended to regression and survival analysis problems. However, due to intensive computation incurred by principal component analysis, rotation forest often fails when high-dimensional or big data are confronted. In this study, we extend rotation forest to high dimensional censored time-to-event data analysis by combing random subspace, bagging and rotation forest. Supported by proper statistical analysis, we show that the proposed method random rotation survival forest outperforms state-of-the-art survival ensembles such as random survival forest and popular regularized Cox models. PMID:27625979

  20. RandomForest4Life: a Random Forest for predicting ALS disease progression.

    PubMed

    Hothorn, Torsten; Jung, Hans H

    2014-09-01

    We describe a method for predicting disease progression in amyotrophic lateral sclerosis (ALS) patients. The method was developed as a submission to the DREAM Phil Bowen ALS Prediction Prize4Life Challenge of summer 2012. Based on repeated patient examinations over a three- month period, we used a random forest algorithm to predict future disease progression. The procedure was set up and internally evaluated using data from 1197 ALS patients. External validation by an expert jury was based on undisclosed information of an additional 625 patients; all patient data were obtained from the PRO-ACT database. In terms of prediction accuracy, the approach described here ranked third best. Our interpretation of the prediction model confirmed previous reports suggesting that past disease progression is a strong predictor of future disease progression measured on the ALS functional rating scale (ALSFRS). We also found that larger variability in initial ALSFRS scores is linked to faster future disease progression. The results reported here furthermore suggested that approaches taking the multidimensionality of the ALSFRS into account promise some potential for improved ALS disease prediction. PMID:25141076

  1. Random Bits Forest: a Strong Classifier/Regressor for Big Data.

    PubMed

    Wang, Yi; Li, Yi; Pu, Weilin; Wen, Kathryn; Shugart, Yin Yao; Xiong, Momiao; Jin, Li

    2016-01-01

    Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS). PMID:27444562

  2. Random Bits Forest: a Strong Classifier/Regressor for Big Data

    PubMed Central

    Wang, Yi; Li, Yi; Pu, Weilin; Wen, Kathryn; Shugart, Yin Yao; Xiong, Momiao; Jin, Li

    2016-01-01

    Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS). PMID:27444562

  3. A random walk approach to quantum algorithms.

    PubMed

    Kendon, Vivien M

    2006-12-15

    The development of quantum algorithms based on quantum versions of random walks is placed in the context of the emerging field of quantum computing. Constructing a suitable quantum version of a random walk is not trivial; pure quantum dynamics is deterministic, so randomness only enters during the measurement phase, i.e. when converting the quantum information into classical information. The outcome of a quantum random walk is very different from the corresponding classical random walk owing to the interference between the different possible paths. The upshot is that quantum walkers find themselves further from their starting point than a classical walker on average, and this forms the basis of a quantum speed up, which can be exploited to solve problems faster. Surprisingly, the effect of making the walk slightly less than perfectly quantum can optimize the properties of the quantum walk for algorithmic applications. Looking to the future, even with a small quantum computer available, the development of quantum walk algorithms might proceed more rapidly than it has, especially for solving real problems. PMID:17090467

  4. Abdominal lymphadenopathy detection using random forest

    NASA Astrophysics Data System (ADS)

    Cherry, Kevin M.; Wang, Shijun; Turkbey, Evrim B.; Summers, Ronald M.

    2014-03-01

    We propose a new method for detecting abdominal lymphadenopathy by utilizing a random forest statistical classifier to create voxel-level lymph node predictions, i.e. initial detection of enlarged lymph nodes. The framework permits the combination of multiple statistical lymph node descriptors and appropriate feature selection in order to improve lesion detection beyond traditional enhancement filters. We show that Hessian blobness measurements alone are inadequate for detecting lymph nodes in the abdominal cavity. Of the features tested here, intensity proved to be the most important predictor for lymph node classification. For initial detection, candidate lesions were extracted from the 3D prediction map generated by random forest. Statistical features describing intensity distribution, shape, and texture were calculated from each enlarged lymph node candidate. In the last step, a support vector machine (SVM) was trained and tested based on the calculated features from candidates and labels determined by two experienced radiologists. The computer-aided detection (CAD) system was tested on a dataset containing 30 patients with 119 enlarged lymph nodes. Our method achieved an AUC of 0.762+/-0.022 and a sensitivity of 79.8% with 15 false positives suggesting it can aid radiologists in finding enlarged lymph nodes.

  5. Improving protein fold recognition by random forest

    PubMed Central

    2014-01-01

    Background Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. Results RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. Conclusions The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition. PMID:25350499

  6. Photometric classification of quasars from RCS-2 using Random Forest

    NASA Astrophysics Data System (ADS)

    Carrasco, D.; Barrientos, L. F.; Pichara, K.; Anguita, T.; Murphy, D. N. A.; Gilbank, D. G.; Gladders, M. D.; Yee, H. K. C.; Hsieh, B. C.; López, S.

    2015-12-01

    The classification and identification of quasars is fundamental to many astronomical research areas. Given the large volume of photometric survey data available in the near future, automated methods for doing so are required. In this article, we present a new quasar candidate catalog from the Red-Sequence Cluster Survey 2 (RCS-2), identified solely from photometric information using an automated algorithm suitable for large surveys. The algorithm performance is tested using a well-defined SDSS spectroscopic sample of quasars and stars. The Random Forest algorithm constructs the catalog from RCS-2 point sources using SDSS spectroscopically-confirmed stars and quasars. The algorithm identifies putative quasars from broadband magnitudes (g, r, i, z) and colors. Exploiting NUV GALEX measurements for a subset of the objects, we refine the classifier by adding new information. An additional subset of the data with WISE W1 and W2 bands is also studied. Upon analyzing 542 897 RCS-2 point sources, the algorithm identified 21 501 quasar candidates with a training-set-derived precision (the fraction of true positives within the group assigned quasar status) of 89.5% and recall (the fraction of true positives relative to all sources that actually are quasars) of 88.4%. These performance metrics improve for the GALEX subset: 6529 quasar candidates are identified from 16 898 sources, with a precision and recall of 97.0% and 97.5%, respectively. Algorithm performance is further improved when WISE data are included, with precision and recall increasing to 99.3% and 99.1%, respectively, for 21 834 quasar candidates from 242 902 sources. We compiled our final catalog (38 257) by merging these samples and removing duplicates. An observational follow up of 17 bright (r < 19) candidates with long-slit spectroscopy at DuPont telescope (LCO) yields 14 confirmed quasars. The results signal encouraging progress in the classification of point sources with Random Forest algorithms to search

  7. Random forest automated supervised classification of Hipparcos periodic variable stars

    NASA Astrophysics Data System (ADS)

    Dubath, P.; Rimoldini, L.; Süveges, M.; Blomme, J.; López, M.; Sarro, L. M.; De Ridder, J.; Cuypers, J.; Guy, L.; Lecoeur, I.; Nienartowicz, K.; Jan, A.; Beck, M.; Mowlavi, N.; De Cat, P.; Lebzelter, T.; Eyer, L.

    2011-07-01

    We present an evaluation of the performance of an automated classification of the Hipparcos periodic variable stars into 26 types. The sub-sample with the most reliable variability types available in the literature is used to train supervised algorithms to characterize the type dependencies on a number of attributes. The most useful attributes evaluated with the random forest methodology include, in decreasing order of importance, the period, the amplitude, the V-I colour index, the absolute magnitude, the residual around the folded light-curve model, the magnitude distribution skewness and the amplitude of the second harmonic of the Fourier series model relative to that of the fundamental frequency. Random forests and a multi-stage scheme involving Bayesian network and Gaussian mixture methods lead to statistically equivalent results. In standard 10-fold cross-validation (CV) experiments, the rate of correct classification is between 90 and 100 per cent, depending on the variability type. The main mis-classification cases, up to a rate of about 10 per cent, arise due to confusion between SPB and ACV blue variables and between eclipsing binaries, ellipsoidal variables and other variability types. Our training set and the predicted types for the other Hipparcos periodic stars are available online.

  8. Performance of random forest when SNPs are in linkage disequilibrium

    PubMed Central

    Meng, Yan A; Yu, Yi; Cupples, L Adrienne; Farrer, Lindsay A; Lunetta, Kathryn L

    2009-01-01

    Background Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF. Results We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype. Conclusion Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the

  9. Ancestry assessment using random forest modeling.

    PubMed

    Hefner, Joseph T; Spradley, M Kate; Anderson, Bruce

    2014-05-01

    A skeletal assessment of ancestry relies on morphoscopic traits and skeletal measurements. Using a sample of American Black (n = 38), American White (n = 39), and Southwest Hispanics (n = 72), the present study investigates whether these data provide similar biological information and combines both data types into a single classification using a random forest model (RFM). Our results indicate that both data types provide similar information concerning the relationships among population groups. Also, by combining both in an RFM, the correct allocation of ancestry for an unknown cranium increases. The distribution of cross-validated grouped cases correctly classified using discriminant analyses and RFMs ranges between 75.4% (discriminant function analysis, morphoscopic data only) and 89.6% (RFM). Unlike the traditional, experience-based approach using morphoscopic traits, the inclusion of both data types in a single analysis is a quantifiable approach accounting for more variation within and between groups, reducing misclassification rates, and capturing aspects of cranial shape, size, and morphology. PMID:24502438

  10. A parallel algorithm for random searches

    NASA Astrophysics Data System (ADS)

    Wosniack, M. E.; Raposo, E. P.; Viswanathan, G. M.; da Luz, M. G. E.

    2015-11-01

    We discuss a parallelization procedure for a two-dimensional random search of a single individual, a typical sequential process. To assure the same features of the sequential random search in the parallel version, we analyze the former spatial patterns of the encountered targets for different search strategies and densities of homogeneously distributed targets. We identify a lognormal tendency for the distribution of distances between consecutively detected targets. Then, by assigning the distinct mean and standard deviation of this distribution for each corresponding configuration in the parallel simulations (constituted by parallel random walkers), we are able to recover important statistical properties, e.g., the target detection efficiency, of the original problem. The proposed parallel approach presents a speedup of nearly one order of magnitude compared with the sequential implementation. This algorithm can be easily adapted to different instances, as searches in three dimensions. Its possible range of applicability covers problems in areas as diverse as automated computer searchers in high-capacity databases and animal foraging.

  11. Patch forest: a hybrid framework of random forest and patch-based segmentation

    NASA Astrophysics Data System (ADS)

    Xie, Zhongliu; Gillies, Duncan

    2016-03-01

    The development of an accurate, robust and fast segmentation algorithm has long been a research focus in medical computer vision. State-of-the-art practices often involve non-rigidly registering a target image with a set of training atlases for label propagation over the target space to perform segmentation, a.k.a. multi-atlas label propagation (MALP). In recent years, the patch-based segmentation (PBS) framework has gained wide attention due to its advantage of relaxing the strict voxel-to-voxel correspondence to a series of pair-wise patch comparisons for contextual pattern matching. Despite a high accuracy reported in many scenarios, computational efficiency has consistently been a major obstacle for both approaches. Inspired by recent work on random forest, in this paper we propose a patch forest approach, which by equipping the conventional PBS with a fast patch search engine, is able to boost segmentation speed significantly while retaining an equal level of accuracy. In addition, a fast forest training mechanism is also proposed, with the use of a dynamic grid framework to efficiently approximate data compactness computation and a 3D integral image technique for fast box feature retrieval.

  12. Random Forest (RF) Wrappers for Waveband Selection and Classification of Hyperspectral Data.

    PubMed

    Poona, Nitesh Keshavelal; van Niekerk, Adriaan; Nadel, Ryan Leslie; Ismail, Riyad

    2016-02-01

    Hyperspectral data collected using a field spectroradiometer was used to model asymptomatic stress in Pinus radiata and Pinus patula seedlings infected with the pathogen Fusarium circinatum. Spectral data were analyzed using the random forest algorithm. To improve the classification accuracy of the model, subsets of wavebands were selected using three feature selection algorithms: (1) Boruta; (2) recursive feature elimination (RFE); and (3) area under the receiver operating characteristic curve of the random forest (AUC-RF). Results highlighted the robustness of the above feature selection methods when used in conjunction with the random forest algorithm for analyzing hyperspectral data. Overall, the Boruta feature selection algorithm provided the best results. When discriminating F. circinatum stress in Pinus radiata seedlings, Boruta selected wavebands (n = 69) yielded the best overall classification accuracies (training error of 17.00%, independent test error of 17.00% and an AUC value of 0.91). Classification results were, however, significantly lower for P. patula seedlings, with a training error of 24.00%, independent test error of 38.00%, and an AUC value of 0.65. A hybrid selection method that utilizes combinations of wavebands selected from the three feature selection algorithms was also tested. The hybrid method showed an improvement in classification accuracies for P. patula, and no improvement for P. radiata. The results of this study provide impetus towards implementing a hyperspectral framework for detecting stress within nursery environments. PMID:26903567

  13. Using Random Forest Models to Predict Organizational Violence

    NASA Technical Reports Server (NTRS)

    Levine, Burton; Bobashev, Georgly

    2012-01-01

    We present a methodology to access the proclivity of an organization to commit violence against nongovernment personnel. We fitted a Random Forest model using the Minority at Risk Organizational Behavior (MAROS) dataset. The MAROS data is longitudinal; so, individual observations are not independent. We propose a modification to the standard Random Forest methodology to account for the violation of the independence assumption. We present the results of the model fit, an example of predicting violence for an organization; and finally, we present a summary of the forest in a "meta-tree,"

  14. A Very Simple Safe-Bayesian Random Forest.

    PubMed

    Quadrianto, Novi; Ghahramani, Zoubin

    2015-06-01

    Random forests works by averaging several predictions of de-correlated trees. We show a conceptually radical approach to generate a random forest: random sampling of many trees from a prior distribution, and subsequently performing a weighted ensemble of predictive probabilities. Our approach uses priors that allow sampling of decision trees even before looking at the data, and a power likelihood that explores the space spanned by combination of decision trees. While each tree performs Bayesian inference to compute its predictions, our aggregation procedure uses the power likelihood rather than the likelihood and is therefore strictly speaking not Bayesian. Nonetheless, we refer to it as a Bayesian random forest but with a built-in safety. The safeness comes as it has good predictive performance even if the underlying probabilistic model is wrong. We demonstrate empirically that our Safe-Bayesian random forest outperforms MCMC or SMC based Bayesian decision trees in term of speed and accuracy, and achieves competitive performance to entropy or Gini optimised random forest, yet is very simple to construct. PMID:26357350

  15. Random forest construction with robust semisupervised node splitting.

    PubMed

    Liu, Xiao; Song, Mingli; Tao, Dacheng; Liu, Zicheng; Zhang, Luming; Chen, Chun; Bu, Jiajun

    2015-01-01

    Random forest (RF) is a very important classifier with applications in various machine learning tasks, but its promising performance heavily relies on the size of labeled training data. In this paper, we investigate constructing of RFs with a small size of labeled data and find that the performance bottleneck is located in the node splitting procedures; hence, existing solutions fail to properly partition the feature space if there are insufficient training data. To achieve robust node splitting with insufficient data, we present semisupervised splitting to overcome this limitation by splitting nodes with the guidance of both labeled and abundant unlabeled data. In particular, an accurate quality measure of node splitting is obtained by carrying out the kernel-based density estimation, whereby a multiclass version of asymptotic mean integrated squared error criterion is proposed to adaptively select the optimal bandwidth of the kernel. To avoid the curse of dimensionality, we project the data points from the original high-dimensional feature space onto a low-dimensional subspace before estimation. A unified optimization framework is proposed to select a coupled pair of subspace and separating hyperplane such that the smoothness of the subspace and the quality of the splitting are guaranteed simultaneously. Our algorithm efficiently avoids overfitting caused by bad initialization and local maxima when compared with conventional margin maximization-based semisupervised methods. We demonstrate the effectiveness of the proposed algorithm by comparing it with state-of-the-art supervised and semisupervised algorithms for typical computer vision applications, such as object categorization, face recognition, and image segmentation, on publicly available data sets. PMID:25494503

  16. CloudForest: A Scalable and Efficient Random Forest Implementation for Biological Data

    PubMed Central

    Bressler, Ryan; Kreisberg, Richard B.; Bernard, Brady; Niederhuber, John E.; Vockley, Joseph G.; Shmulevich, Ilya; Knijnenburg, Theo A.

    2015-01-01

    Random Forest has become a standard data analysis tool in computational biology. However, extensions to existing implementations are often necessary to handle the complexity of biological datasets and their associated research questions. The growing size of these datasets requires high performance implementations. We describe CloudForest, a Random Forest package written in Go, which is particularly well suited for large, heterogeneous, genetic and biomedical datasets. CloudForest includes several extensions, such as dealing with unbalanced classes and missing values. Its flexible design enables users to easily implement additional extensions. CloudForest achieves fast running times by effective use of the CPU cache, optimizing for different classes of features and efficiently multi-threading. https://github.com/ilyalab/CloudForest. PMID:26679347

  17. CloudForest: A Scalable and Efficient Random Forest Implementation for Biological Data.

    PubMed

    Bressler, Ryan; Kreisberg, Richard B; Bernard, Brady; Niederhuber, John E; Vockley, Joseph G; Shmulevich, Ilya; Knijnenburg, Theo A

    2015-01-01

    Random Forest has become a standard data analysis tool in computational biology. However, extensions to existing implementations are often necessary to handle the complexity of biological datasets and their associated research questions. The growing size of these datasets requires high performance implementations. We describe CloudForest, a Random Forest package written in Go, which is particularly well suited for large, heterogeneous, genetic and biomedical datasets. CloudForest includes several extensions, such as dealing with unbalanced classes and missing values. Its flexible design enables users to easily implement additional extensions. CloudForest achieves fast running times by effective use of the CPU cache, optimizing for different classes of features and efficiently multi-threading. https://github.com/ilyalab/CloudForest. PMID:26679347

  18. Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias.

    PubMed

    Winham, Stacey J; Jenkins, Gregory D; Biernacka, Joanna M

    2016-02-01

    Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/). PMID:26639183

  19. Global patterns and predictions of seafloor biomass using random forests.

    PubMed

    Wei, Chih-Lin; Rowe, Gilbert T; Escobar-Briones, Elva; Boetius, Antje; Soltwedel, Thomas; Caley, M Julian; Soliman, Yousria; Huettmann, Falk; Qu, Fangyuan; Yu, Zishan; Pitcher, C Roland; Haedrich, Richard L; Wicksten, Mary K; Rex, Michael A; Baguley, Jeffrey G; Sharma, Jyotsna; Danovaro, Roberto; MacDonald, Ian R; Nunnally, Clifton C; Deming, Jody W; Montagna, Paul; Lévesque, Mélanie; Weslawski, Jan Marcin; Wlodarska-Kowalczuk, Maria; Ingole, Baban S; Bett, Brian J; Billett, David S M; Yool, Andrew; Bluhm, Bodil A; Iken, Katrin; Narayanaswamy, Bhavani E

    2010-01-01

    A comprehensive seafloor biomass and abundance database has been constructed from 24 oceanographic institutions worldwide within the Census of Marine Life (CoML) field projects. The machine-learning algorithm, Random Forests, was employed to model and predict seafloor standing stocks from surface primary production, water-column integrated and export particulate organic matter (POM), seafloor relief, and bottom water properties. The predictive models explain 63% to 88% of stock variance among the major size groups. Individual and composite maps of predicted global seafloor biomass and abundance are generated for bacteria, meiofauna, macrofauna, and megafauna (invertebrates and fishes). Patterns of benthic standing stocks were positive functions of surface primary production and delivery of the particulate organic carbon (POC) flux to the seafloor. At a regional scale, the census maps illustrate that integrated biomass is highest at the poles, on continental margins associated with coastal upwelling and with broad zones associated with equatorial divergence. Lowest values are consistently encountered on the central abyssal plains of major ocean basins The shift of biomass dominance groups with depth is shown to be affected by the decrease in average body size rather than abundance, presumably due to decrease in quantity and quality of food supply. This biomass census and associated maps are vital components of mechanistic deep-sea food web models and global carbon cycling, and as such provide fundamental information that can be incorporated into evidence-based management. PMID:21209928

  20. Random Forest for automatic assessment of heart failure severity in a telemonitoring scenario.

    PubMed

    Guidi, G; Pettenati, M C; Miniati, R; Iadanza, E

    2013-01-01

    In this study, we describe an automatic classifier of patients with Heart Failure designed for a telemonitoring scenario, improving the results obtained in our previous works. Our previous studies showed that the technique that better processes the heart failure typical telemonitoring-parameters is the Classification Tree. We therefore decided to analyze the data with its direct evolution that is the Random Forest algorithm. The results show an improvement both in accuracy and in limiting critical errors. PMID:24110416

  1. Genetic algorithms as global random search methods

    NASA Technical Reports Server (NTRS)

    Peck, Charles C.; Dhawan, Atam P.

    1995-01-01

    Genetic algorithm behavior is described in terms of the construction and evolution of the sampling distributions over the space of candidate solutions. This novel perspective is motivated by analysis indicating that that schema theory is inadequate for completely and properly explaining genetic algorithm behavior. Based on the proposed theory, it is argued that the similarities of candidate solutions should be exploited directly, rather than encoding candidate solution and then exploiting their similarities. Proportional selection is characterized as a global search operator, and recombination is characterized as the search process that exploits similarities. Sequential algorithms and many deletion methods are also analyzed. It is shown that by properly constraining the search breadth of recombination operators, convergence of genetic algorithms to a global optimum can be ensured.

  2. Genetic algorithms as global random search methods

    NASA Technical Reports Server (NTRS)

    Peck, Charles C.; Dhawan, Atam P.

    1995-01-01

    Genetic algorithm behavior is described in terms of the construction and evolution of the sampling distributions over the space of candidate solutions. This novel perspective is motivated by analysis indicating that the schema theory is inadequate for completely and properly explaining genetic algorithm behavior. Based on the proposed theory, it is argued that the similarities of candidate solutions should be exploited directly, rather than encoding candidate solutions and then exploiting their similarities. Proportional selection is characterized as a global search operator, and recombination is characterized as the search process that exploits similarities. Sequential algorithms and many deletion methods are also analyzed. It is shown that by properly constraining the search breadth of recombination operators, convergence of genetic algorithms to a global optimum can be ensured.

  3. Propensity score and proximity matching using random forest.

    PubMed

    Zhao, Peng; Su, Xiaogang; Ge, Tingting; Fan, Juanjuan

    2016-03-01

    In order to derive unbiased inference from observational data, matching methods are often applied to produce balanced treatment and control groups in terms of all background variables. Propensity score has been a key component in this research area. However, propensity score based matching methods in the literature have several limitations, such as model mis-specifications, categorical variables with more than two levels, difficulties in handling missing data, and nonlinear relationships. Random forest, averaging outcomes from many decision trees, is nonparametric in nature, straightforward to use, and capable of solving these issues. More importantly, the precision afforded by random forest (Caruana et al., 2008) may provide us with a more accurate and less model dependent estimate of the propensity score. In addition, the proximity matrix, a by-product of the random forest, may naturally serve as a distance measure between observations that can be used in matching. The proposed random forest based matching methods are applied to data from the National Health and Nutrition Examination Survey (NHANES). Our results show that the proposed methods can produce well balanced treatment and control groups. An illustration is also provided that the methods can effectively deal with missing data in covariates. PMID:26706666

  4. Random Forests for Global and Regional Crop Yield Predictions

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Traditional regression models have limitations when applied for predicting crop yield responses at multiple spatial scales. An alternative modeling method, Random Forest (RF) regression, was utilized to predict crop yield responses for wheat, maize, and potato at regional scales. This RF regressio...

  5. Using random forest to model the domain applicability of another random forest model.

    PubMed

    Sheridan, Robert P

    2013-11-25

    In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities. We will call this traditional type of QSAR model an "activity model". The activity model can be used to predict the activities of molecules not in the training set. A relatively new subfield for QSAR is domain applicability. The aim is to estimate the reliability of prediction of a specific molecule on a specific activity model. A number of different metrics have been proposed in the literature for this purpose. It is desirable to build a quantitative model of reliability against one or more of these metrics. We can call this an "error model". A previous publication from our laboratory (Sheridan J. Chem. Inf. Model., 2012, 52, 814-823.) suggested the simultaneous use of three metrics would be more discriminating than any one metric. An error model could be built in the form of a three-dimensional set of bins. When the number of metrics exceeds three, however, the bin paradigm is not practical. An obvious solution for constructing an error model using multiple metrics is to use a QSAR method, in our case random forest. In this paper we demonstrate the usefulness of this paradigm, specifically for determining whether a useful error model can be built and which metrics are most useful for a given problem. For the ten data sets and for the seven metrics we examine here, it appears that it is possible to construct a useful error model using only two metrics (TREE_SD and PREDICTED). These do not require calculating similarities/distances between the molecules being predicted and the molecules used to build the activity model, which can be rate-limiting. PMID:24152204

  6. Parameter identification using a creeping-random-search algorithm

    NASA Technical Reports Server (NTRS)

    Parrish, R. V.

    1971-01-01

    A creeping-random-search algorithm is applied to different types of problems in the field of parameter identification. The studies are intended to demonstrate that a random-search algorithm can be applied successfully to these various problems, which often cannot be handled by conventional deterministic methods, and, also, to introduce methods that speed convergence to an extremal of the problem under investigation. Six two-parameter identification problems with analytic solutions are solved, and two application problems are discussed in some detail. Results of the study show that a modified version of the basic creeping-random-search algorithm chosen does speed convergence in comparison with the unmodified version. The results also show that the algorithm can successfully solve problems that contain limits on state or control variables, inequality constraints (both independent and dependent, and linear and nonlinear), or stochastic models.

  7. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

    PubMed

    Shah, Anoop D; Bartlett, Jonathan W; Carpenter, James; Nicholas, Owen; Hemingway, Harry

    2014-03-15

    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data. PMID:24589914

  8. Random Volumetric MRI Trajectories via Genetic Algorithms

    PubMed Central

    Curtis, Andrew Thomas; Anand, Christopher Kumar

    2008-01-01

    A pseudorandom, velocity-insensitive, volumetric k-space sampling trajectory is designed for use with balanced steady-state magnetic resonance imaging. Individual arcs are designed independently and do not fit together in the way that multishot spiral, radial or echo-planar trajectories do. Previously, it was shown that second-order cone optimization problems can be defined for each arc independent of the others, that nulling of zeroth and higher moments can be encoded as constraints, and that individual arcs can be optimized in seconds. For use in steady-state imaging, sampling duty cycles are predicted to exceed 95 percent. Using such pseudorandom trajectories, aliasing caused by under-sampling manifests itself as incoherent noise. In this paper, a genetic algorithm (GA) is formulated and numerically evaluated. A large set of arcs is designed using previous methods, and the GA choses particular fit subsets of a given size, corresponding to a desired acquisition time. Numerical simulations of 1 second acquisitions show good detail and acceptable noise for large-volume imaging with 32 coils. PMID:18604305

  9. Random Forest and Rotation Forest for fully polarized SAR image classification using polarimetric and spatial features

    NASA Astrophysics Data System (ADS)

    Du, Peijun; Samat, Alim; Waske, Björn; Liu, Sicong; Li, Zhenhong

    2015-07-01

    Fully Polarimetric Synthetic Aperture Radar (PolSAR) has the advantages of all-weather, day and night observation and high resolution capabilities. The collected data are usually sorted in Sinclair matrix, coherence or covariance matrices which are directly related to physical properties of natural media and backscattering mechanism. Additional information related to the nature of scattering medium can be exploited through polarimetric decomposition theorems. Accordingly, PolSAR image classification gains increasing attentions from remote sensing communities in recent years. However, the above polarimetric measurements or parameters cannot provide sufficient information for accurate PolSAR image classification in some scenarios, e.g. in complex urban areas where different scattering mediums may exhibit similar PolSAR response due to couples of unavoidable reasons. Inspired by the complementarity between spectral and spatial features bringing remarkable improvements in optical image classification, the complementary information between polarimetric and spatial features may also contribute to PolSAR image classification. Therefore, the roles of textural features such as contrast, dissimilarity, homogeneity and local range, morphological profiles (MPs) in PolSAR image classification are investigated using two advanced ensemble learning (EL) classifiers: Random Forest and Rotation Forest. Supervised Wishart classifier and support vector machines (SVMs) are used as benchmark classifiers for the evaluation and comparison purposes. Experimental results with three Radarsat-2 images in quad polarization mode indicate that classification accuracies could be significantly increased by integrating spatial and polarimetric features using ensemble learning strategies. Rotation Forest can get better accuracy than SVM and Random Forest, in the meantime, Random Forest is much faster than Rotation Forest.

  10. Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

    PubMed Central

    Nguyen, Thanh-Tung; Huang, Joshua Zhexue; Nguyen, Thuy Thi

    2015-01-01

    Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures. PMID:25879059

  11. Simple-random-sampling-based multiclass text classification algorithm.

    PubMed

    Liu, Wuying; Wang, Lin; Yi, Mianzhu

    2014-01-01

    Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements. PMID:24778587

  12. Simple-Random-Sampling-Based Multiclass Text Classification Algorithm

    PubMed Central

    Liu, Wuying; Wang, Lin; Yi, Mianzhu

    2014-01-01

    Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements. PMID:24778587

  13. A radiation and energy budget algorithm for forest canopies

    NASA Astrophysics Data System (ADS)

    Tunick, A.

    2006-01-01

    Previously, it was shown that a one-dimensional, physics-based (conservation-law) computer model can provide a useful mathematical representation of the wind flow, temperatures, and turbulence inside and above a uniform forest stand. A key element of this calculation was a radiation and energy budget algorithm (implemented to predict the heat source). However, to keep the earlier publication brief, a full description of the radiation and energy budget algorithm was not given. Hence, this paper presents our equation set for calculating the incoming total radiation at the canopy top as well as the transmission, reflection, absorption, and emission of the solar flux through a forest stand. In addition, example model output is presented from three interesting numerical experiments, which were conducted to simulate the canopy microclimate for a forest stand that borders the Blossom Point Field Test Facility (located near La Plata, Maryland along the Potomac River). It is anticipated that the current numerical study will be useful to researchers and experimental planners who will be collecting acoustic and meteorological data at the Blossom Point Facility in the near future.

  14. Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions

    PubMed Central

    Hengl, Tomislav; Heuvelink, Gerard B. M.; Kempen, Bas; Leenaars, Johan G. B.; Walsh, Markus G.; Shepherd, Keith D.; Sila, Andrew; MacMillan, Robert A.; Mendes de Jesus, Jorge; Tamene, Lulseged; Tondoh, Jérôme E.

    2015-01-01

    80% of arable land in Africa has low soil fertility and suffers from physical soil problems. Additionally, significant amounts of nutrients are lost every year due to unsustainable soil management practices. This is partially the result of insufficient use of soil management knowledge. To help bridge the soil information gap in Africa, the Africa Soil Information Service (AfSIS) project was established in 2008. Over the period 2008–2014, the AfSIS project compiled two point data sets: the Africa Soil Profiles (legacy) database and the AfSIS Sentinel Site database. These data sets contain over 28 thousand sampling locations and represent the most comprehensive soil sample data sets of the African continent to date. Utilizing these point data sets in combination with a large number of covariates, we have generated a series of spatial predictions of soil properties relevant to the agricultural management—organic carbon, pH, sand, silt and clay fractions, bulk density, cation-exchange capacity, total nitrogen, exchangeable acidity, Al content and exchangeable bases (Ca, K, Mg, Na). We specifically investigate differences between two predictive approaches: random forests and linear regression. Results of 5-fold cross-validation demonstrate that the random forests algorithm consistently outperforms the linear regression algorithm, with average decreases of 15–75% in Root Mean Squared Error (RMSE) across soil properties and depths. Fitting and running random forests models takes an order of magnitude more time and the modelling success is sensitive to artifacts in the input data, but as long as quality-controlled point data are provided, an increase in soil mapping accuracy can be expected. Results also indicate that globally predicted soil classes (USDA Soil Taxonomy, especially Alfisols and Mollisols) help improve continental scale soil property mapping, and are among the most important predictors. This indicates a promising potential for transferring pedological

  15. Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions.

    PubMed

    Hengl, Tomislav; Heuvelink, Gerard B M; Kempen, Bas; Leenaars, Johan G B; Walsh, Markus G; Shepherd, Keith D; Sila, Andrew; MacMillan, Robert A; Mendes de Jesus, Jorge; Tamene, Lulseged; Tondoh, Jérôme E

    2015-01-01

    80% of arable land in Africa has low soil fertility and suffers from physical soil problems. Additionally, significant amounts of nutrients are lost every year due to unsustainable soil management practices. This is partially the result of insufficient use of soil management knowledge. To help bridge the soil information gap in Africa, the Africa Soil Information Service (AfSIS) project was established in 2008. Over the period 2008-2014, the AfSIS project compiled two point data sets: the Africa Soil Profiles (legacy) database and the AfSIS Sentinel Site database. These data sets contain over 28 thousand sampling locations and represent the most comprehensive soil sample data sets of the African continent to date. Utilizing these point data sets in combination with a large number of covariates, we have generated a series of spatial predictions of soil properties relevant to the agricultural management--organic carbon, pH, sand, silt and clay fractions, bulk density, cation-exchange capacity, total nitrogen, exchangeable acidity, Al content and exchangeable bases (Ca, K, Mg, Na). We specifically investigate differences between two predictive approaches: random forests and linear regression. Results of 5-fold cross-validation demonstrate that the random forests algorithm consistently outperforms the linear regression algorithm, with average decreases of 15-75% in Root Mean Squared Error (RMSE) across soil properties and depths. Fitting and running random forests models takes an order of magnitude more time and the modelling success is sensitive to artifacts in the input data, but as long as quality-controlled point data are provided, an increase in soil mapping accuracy can be expected. Results also indicate that globally predicted soil classes (USDA Soil Taxonomy, especially Alfisols and Mollisols) help improve continental scale soil property mapping, and are among the most important predictors. This indicates a promising potential for transferring pedological

  16. Analysis of landslide hazard area in Ludian earthquake based on Random Forests

    NASA Astrophysics Data System (ADS)

    Xie, J.-C.; Liu, R.; Li, H.-W.; Lai, Z.-L.

    2015-04-01

    With the development of machine learning theory, more and more algorithms are evaluated for seismic landslides. After the Ludian earthquake, the research team combine with the special geological structure in Ludian area and the seismic filed exploration results, selecting SLOPE(PODU); River distance(HL); Fault distance(DC); Seismic Intensity(LD) and Digital Elevation Model(DEM), the normalized difference vegetation index(NDVI) which based on remote sensing images as evaluation factors. But the relationships among these factors are fuzzy, there also exists heavy noise and high-dimensional, we introduce the random forest algorithm to tolerate these difficulties and get the evaluation result of Ludian landslide areas, in order to verify the accuracy of the result, using the ROC graphs for the result evaluation standard, AUC covers an area of 0.918, meanwhile, the random forest's generalization error rate decreases with the increase of the classification tree to the ideal 0.08 by using Out Of Bag(OOB) Estimation. Studying the final landslides inversion results, paper comes to a statistical conclusion that near 80% of the whole landslides and dilapidations are in areas with high susceptibility and moderate susceptibility, showing the forecast results are reasonable and adopted.

  17. RAQ-A Random Forest Approach for Predicting Air Quality in Urban Sensing Systems.

    PubMed

    Yu, Ruiyun; Yang, Yu; Yang, Leyou; Han, Guangjie; Move, Oguti Ann

    2016-01-01

    Air quality information such as the concentration of PM2.5 is of great significance for human health and city management. It affects the way of traveling, urban planning, government policies and so on. However, in major cities there is typically only a limited number of air quality monitoring stations. In the meantime, air quality varies in the urban areas and there can be large differences, even between closely neighboring regions. In this paper, a random forest approach for predicting air quality (RAQ) is proposed for urban sensing systems. The data generated by urban sensing includes meteorology data, road information, real-time traffic status and point of interest (POI) distribution. The random forest algorithm is exploited for data training and prediction. The performance of RAQ is evaluated with real city data. Compared with three other algorithms, this approach achieves better prediction precision. Exciting results are observed from the experiments that the air quality can be inferred with amazingly high accuracy from the data which are obtained from urban sensing. PMID:26761008

  18. RAQ–A Random Forest Approach for Predicting Air Quality in Urban Sensing Systems

    PubMed Central

    Yu, Ruiyun; Yang, Yu; Yang, Leyou; Han, Guangjie; Move, Oguti Ann

    2016-01-01

    Air quality information such as the concentration of PM2.5 is of great significance for human health and city management. It affects the way of traveling, urban planning, government policies and so on. However, in major cities there is typically only a limited number of air quality monitoring stations. In the meantime, air quality varies in the urban areas and there can be large differences, even between closely neighboring regions. In this paper, a random forest approach for predicting air quality (RAQ) is proposed for urban sensing systems. The data generated by urban sensing includes meteorology data, road information, real-time traffic status and point of interest (POI) distribution. The random forest algorithm is exploited for data training and prediction. The performance of RAQ is evaluated with real city data. Compared with three other algorithms, this approach achieves better prediction precision. Exciting results are observed from the experiments that the air quality can be inferred with amazingly high accuracy from the data which are obtained from urban sensing. PMID:26761008

  19. Disaggregating census data for population mapping using random forests with remotely-sensed and ancillary data.

    PubMed

    Stevens, Forrest R; Gaughan, Andrea E; Linard, Catherine; Tatem, Andrew J

    2015-01-01

    High resolution, contemporary data on human population distributions are vital for measuring impacts of population growth, monitoring human-environment interactions and for planning and policy development. Many methods are used to disaggregate census data and predict population densities for finer scale, gridded population data sets. We present a new semi-automated dasymetric modeling approach that incorporates detailed census and ancillary data in a flexible, "Random Forest" estimation technique. We outline the combination of widely available, remotely-sensed and geospatial data that contribute to the modeled dasymetric weights and then use the Random Forest model to generate a gridded prediction of population density at ~100 m spatial resolution. This prediction layer is then used as the weighting surface to perform dasymetric redistribution of the census counts at a country level. As a case study we compare the new algorithm and its products for three countries (Vietnam, Cambodia, and Kenya) with other common gridded population data production methodologies. We discuss the advantages of the new method and increases over the accuracy and flexibility of those previous approaches. Finally, we outline how this algorithm will be extended to provide freely-available gridded population data sets for Africa, Asia and Latin America. PMID:25689585

  20. Exploiting SNP correlations within random forest for genome-wide association studies.

    PubMed

    Botta, Vincent; Louppe, Gilles; Geurts, Pierre; Wehenkel, Louis

    2014-01-01

    The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. Variable importance results and T-Trees source code are all available at www.montefiore.ulg.ac.be/~botta/ttrees/ and github.com/0asa/TTree-source respectively. PMID:24695491

  1. Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies

    PubMed Central

    Botta, Vincent; Louppe, Gilles; Geurts, Pierre; Wehenkel, Louis

    2014-01-01

    The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. Variable importance results and T-Trees source code are all available at www.montefiore.ulg.ac.be/~botta/ttrees/ and github.com/0asa/TTree-source respectively. PMID:24695491

  2. Classification of acoustic emission signals using wavelets and Random Forests : Application to localized corrosion

    NASA Astrophysics Data System (ADS)

    Morizet, N.; Godin, N.; Tang, J.; Maillet, E.; Fregonese, M.; Normand, B.

    2016-03-01

    This paper aims to propose a novel approach to classify acoustic emission (AE) signals deriving from corrosion experiments, even if embedded into a noisy environment. To validate this new methodology, synthetic data are first used throughout an in-depth analysis, comparing Random Forests (RF) to the k-Nearest Neighbor (k-NN) algorithm. Moreover, a new evaluation tool called the alter-class matrix (ACM) is introduced to simulate different degrees of uncertainty on labeled data for supervised classification. Then, tests on real cases involving noise and crevice corrosion are conducted, by preprocessing the waveforms including wavelet denoising and extracting a rich set of features as input of the RF algorithm. To this end, a software called RF-CAM has been developed. Results show that this approach is very efficient on ground truth data and is also very promising on real data, especially for its reliability, performance and speed, which are serious criteria for the chemical industry.

  3. Fire detection system using random forest classification for image sequences of complex background

    NASA Astrophysics Data System (ADS)

    Kim, Onecue; Kang, Dong-Joong

    2013-06-01

    We present a fire alarm system based on image processing that detects fire accidents in various environments. To reduce false alarms that frequently appeared in earlier systems, we combined image features including color, motion, and blinking information. We specifically define the color conditions of fires in hue, saturation and value, and RGB color space. Fire features are represented as intensity variation, color mean and variance, motion, and image differences. Moreover, blinking fire features are modeled by using crossing patches. We propose an algorithm that classifies patches into fire or nonfire areas by using random forest supervised learning. We design an embedded surveillance device made with acrylonitrile butadiene styrene housing for stable fire detection in outdoor environments. The experimental results show that our algorithm works robustly in complex environments and is able to detect fires in real time.

  4. Estimating tropical forest structure using discrete return lidar data and a locally trained synthetic forest algorithm

    NASA Astrophysics Data System (ADS)

    Palace, M. W.; Sullivan, F. B.; Ducey, M.; Czarnecki, C.; Zanin Shimbo, J.; Mota e Silva, J.

    2012-12-01

    Forests are complex ecosystems with diverse species assemblages, crown structures, size class distributions, and historical disturbances. This complexity makes monitoring, understanding and forecasting carbon dynamics difficult. Still, this complexity is also central in carbon cycling of terrestrial vegetation. Lidar data often is used solely to associate plot level biomass measurements with canopy height models. There is much more that may be gleaned from examining the full profile from lidar data. Using discrete return airborne light detection and ranging (lidar) data collected in 2009 by the Tropical Ecology Assessment and Monitoring Network (TEAM), we compared synthetic vegetation profiles to lidar-derived relative vegetation profiles (RVPs) in La Selva, Costa Rica. To accomplish this, we developed RVPs to describe the vertical distribution of plant material on 20 plots at La Selva by transforming cumulative lidar observations to account for obscured plant material. Hundreds of synthetic profiles were developed for forests containing approximately 200,000 trees with random diameter at breast height (DBH), assuming a Weibull distribution with a shape of 1.0, and mean DBH ranging from 0cm to 500cm. For each tree in the synthetic forests, crown shape (width, depth) and total height were estimated using previously developed allometric equations for tropical forests. Profiles for each synthetic forest were generated and compared to TEAM lidar data to determine the best fitting synthetic profile to lidar profiles for each of 20 field plots at La Selva. After determining the best fit synthetic profile using the minimum sum of squared differences, we are able to estimate forest structure (diameter distribution, height, and biomass) and to compare our estimates to field data for each of the twenty field plots. Our preliminary results show promise for estimating forest structure and biomass using lidar data and computer modeling.

  5. An iterative curvelet thresholding algorithm for seismic random noise attenuation

    NASA Astrophysics Data System (ADS)

    Wang, De-Li; Tong, Zhong-Fei; Tang, Chen; Zhu, Heng

    2010-12-01

    In this paper, we explore the use of iterative curvelet thresholding for seismic random noise attenuation. A new method for combining the curvelet transform with iterative thresholding to suppress random noise is demonstrated and the issue is described as a linear inverse optimal problem using the L1 norm. Random noise suppression in seismic data is transformed into an L1 norm optimization problem based on the curvelet sparsity transform. Compared to the conventional methods such as median filter algorithm, FX deconvolution, and wavelet thresholding, the results of synthetic and field data processing show that the iterative curvelet thresholding proposed in this paper can sufficiently improve signal to noise radio (SNR) and give higher signal fidelity at the same time. Furthermore, to make better use of the curvelet transform such as multiple scales and multiple directions, we control the curvelet direction of the result after iterative curvelet thresholding to further improve the SNR.

  6. Random Matrix Approach to Quantum Adiabatic Evolution Algorithms

    NASA Technical Reports Server (NTRS)

    Boulatov, Alexei; Smelyanskiy, Vadier N.

    2004-01-01

    We analyze the power of quantum adiabatic evolution algorithms (Q-QA) for solving random NP-hard optimization problems within a theoretical framework based on the random matrix theory (RMT). We present two types of the driven RMT models. In the first model, the driving Hamiltonian is represented by Brownian motion in the matrix space. We use the Brownian motion model to obtain a description of multiple avoided crossing phenomena. We show that the failure mechanism of the QAA is due to the interaction of the ground state with the "cloud" formed by all the excited states, confirming that in the driven RMT models. the Landau-Zener mechanism of dissipation is not important. We show that the QAEA has a finite probability of success in a certain range of parameters. implying the polynomial complexity of the algorithm. The second model corresponds to the standard QAEA with the problem Hamiltonian taken from the Gaussian Unitary RMT ensemble (GUE). We show that the level dynamics in this model can be mapped onto the dynamics in the Brownian motion model. However, the driven RMT model always leads to the exponential complexity of the algorithm due to the presence of the long-range intertemporal correlations of the eigenvalues. Our results indicate that the weakness of effective transitions is the leading effect that can make the Markovian type QAEA successful.

  7. Algorithmic randomness, physical entropy, measurements, and the second law

    SciTech Connect

    Zurek, W.H.

    1989-01-01

    Algorithmic information content is equal to the size -- in the number of bits -- of the shortest program for a universal Turing machine which can reproduce a state of a physical system. In contrast to the statistical Boltzmann-Gibbs-Shannon entropy, which measures ignorance, the algorithmic information content is a measure of the available information. It is defined without a recourse to probabilities and can be regarded as a measure of randomness of a definite microstate. I suggest that the physical entropy S -- that is, the quantity which determines the amount of the work {Delta}W which can be extracted in the cyclic isothermal expansion process through the equation {Delta}W = k{sub B}T{Delta}S -- is a sum of two contributions: the mission information measured by the usual statistical entropy and the known randomness measured by the algorithmic information content. The sum of these two contributions is a constant of motion'' in the process of a dissipation less measurement on an equilibrium ensemble. This conservation under a measurement, which can be traced back to the noiseless coding theorem of Shannon, is necessary to rule out existence of a successful Maxwell's demon. 17 refs., 3 figs.

  8. Combinatorial approximation algorithms for MAXCUT using random walks.

    SciTech Connect

    Seshadhri, Comandur; Kale, Satyen

    2010-11-01

    We give the first combinatorial approximation algorithm for MaxCut that beats the trivial 0.5 factor by a constant. The main partitioning procedure is very intuitive, natural, and easily described. It essentially performs a number of random walks and aggregates the information to provide the partition. We can control the running time to get an approximation factor-running time tradeoff. We show that for any constant b > 1.5, there is an {tilde O}(n{sup b}) algorithm that outputs a (0.5 + {delta})-approximation for MaxCut, where {delta} = {delta}(b) is some positive constant. One of the components of our algorithm is a weak local graph partitioning procedure that may be of independent interest. Given a starting vertex i and a conductance parameter {phi}, unless a random walk of length {ell} = O(log n) starting from i mixes rapidly (in terms of {phi} and {ell}), we can find a cut of conductance at most {phi} close to the vertex. The work done per vertex found in the cut is sublinear in n.

  9. Recycling random numbers in the stochastic simulation algorithm.

    PubMed

    Yates, Christian A; Klingbeil, Guido

    2013-03-01

    The stochastic simulation algorithm (SSA) was introduced by Gillespie and in a different form by Kurtz. Since its original formulation there have been several attempts at improving the efficiency and hence the speed of the algorithm. We briefly discuss some of these methods before outlining our own simple improvement, the recycling direct method (RDM), and demonstrating that it is capable of increasing the speed of most stochastic simulations. The RDM involves the statistically acceptable recycling of random numbers in order to reduce the computational cost associated with their generation and is compatible with several of the pre-existing improvements on the original SSA. Our improvement is also sufficiently simple (one additional line of code) that we hope will be adopted by both trained mathematical modelers and experimentalists wishing to simulate their model systems. PMID:23485273

  10. Gearbox fault diagnosis based on deep random forest fusion of acoustic and vibratory signals

    NASA Astrophysics Data System (ADS)

    Li, Chuan; Sanchez, René-Vinicio; Zurita, Grover; Cerrada, Mariela; Cabrera, Diego; Vásquez, Rafael E.

    2016-08-01

    Fault diagnosis is an effective tool to guarantee safe operations in gearboxes. Acoustic and vibratory measurements in such mechanical devices are all sensitive to the existence of faults. This work addresses the use of a deep random forest fusion (DRFF) technique to improve fault diagnosis performance for gearboxes by using measurements of an acoustic emission (AE) sensor and an accelerometer that are used for monitoring the gearbox condition simultaneously. The statistical parameters of the wavelet packet transform (WPT) are first produced from the AE signal and the vibratory signal, respectively. Two deep Boltzmann machines (DBMs) are then developed for deep representations of the WPT statistical parameters. A random forest is finally suggested to fuse the outputs of the two DBMs as the integrated DRFF model. The proposed DRFF technique is evaluated using gearbox fault diagnosis experiments under different operational conditions, and achieves 97.68% of the classification rate for 11 different condition patterns. Compared to other peer algorithms, the addressed method exhibits the best performance. The results indicate that the deep learning fusion of acoustic and vibratory signals may improve fault diagnosis capabilities for gearboxes.

  11. Sequential Monte Carlo tracking of the marginal artery by multiple cue fusion and random forest regression.

    PubMed

    Cherry, Kevin M; Peplinski, Brandon; Kim, Lauren; Wang, Shijun; Lu, Le; Zhang, Weidong; Liu, Jianfei; Wei, Zhuoshi; Summers, Ronald M

    2015-01-01

    Given the potential importance of marginal artery localization in automated registration in computed tomography colonography (CTC), we have devised a semi-automated method of marginal vessel detection employing sequential Monte Carlo tracking (also known as particle filtering tracking) by multiple cue fusion based on intensity, vesselness, organ detection, and minimum spanning tree information for poorly enhanced vessel segments. We then employed a random forest algorithm for intelligent cue fusion and decision making which achieved high sensitivity and robustness. After applying a vessel pruning procedure to the tracking results, we achieved statistically significantly improved precision compared to a baseline Hessian detection method (2.7% versus 75.2%, p<0.001). This method also showed statistically significantly improved recall rate compared to a 2-cue baseline method using fewer vessel cues (30.7% versus 67.7%, p<0.001). These results demonstrate that marginal artery localization on CTC is feasible by combining a discriminative classifier (i.e., random forest) with a sequential Monte Carlo tracking mechanism. In so doing, we present the effective application of an anatomical probability map to vessel pruning as well as a supplementary spatial coordinate system for colonic segmentation and registration when this task has been confounded by colon lumen collapse. PMID:25461335

  12. Predictive lithological mapping of Canada's North using Random Forest classification applied to geophysical and geochemical data

    NASA Astrophysics Data System (ADS)

    Harris, J. R.; Grunsky, E. C.

    2015-07-01

    A recent method for mapping lithology which involves the Random Forest (RF) machine classification algorithm is evaluated. Random Forests, a supervised classifier, requires training data representative of each lithology to produce a predictive or classified map. We use two training strategies, one based on the location of lake sediment geochemical samples where the rock type is recorded from a legacy geology map at each sample station and the second strategy is based on lithology recorded from field stations derived from reconnaissance field mapping. We apply the classification to interpolated major and minor lake sediment geochemical data as well as airborne total field magnetic and gamma ray spectrometer data. Using this method we produce predictions of the lithology of a large section of the Hearne Archean - Paleoproterozoic tectonic domain, in northern Canada. The results indicate that meaningful predictive lithologic maps can be produced using RF classification for both training strategies. The best results were achieved when all data were used; however, the geochemical and gamma ray data were the strongest predictors of the various lithologies. The maps generated from this research can be used to compliment field mapping activities by focusing field work on areas where the predicted geology and legacy geology do not match and as first order geological maps in poorly mapped areas.

  13. Polynomial iterative algorithms for coloring and analyzing random graphs.

    PubMed

    Braunstein, A; Mulet, R; Pagnani, A; Weigt, M; Zecchina, R

    2003-09-01

    We study the graph coloring problem over random graphs of finite average connectivity c. Given a number q of available colors, we find that graphs with low connectivity admit almost always a proper coloring whereas graphs with high connectivity are uncolorable. Depending on q, we find with a one-step replica-symmetry breaking approximation the precise value of the critical average connectivity c(q). Moreover, we show that below c(q) there exists a clustering phase c in [c(d),c(q)] in which ground states spontaneously divide into an exponential number of clusters. Furthermore, we extended our considerations to the case of single instances showing consistent results. This leads us to propose a different algorithm that is able to color in polynomial time random graphs in the hard but colorable region, i.e., when c in [c(d),c(q)]. PMID:14524921

  14. Random search optimization based on genetic algorithm and discriminant function

    NASA Technical Reports Server (NTRS)

    Kiciman, M. O.; Akgul, M.; Erarslanoglu, G.

    1990-01-01

    The general problem of optimization with arbitrary merit and constraint functions, which could be convex, concave, monotonic, or non-monotonic, is treated using stochastic methods. To improve the efficiency of the random search methods, a genetic algorithm for the search phase and a discriminant function for the constraint-control phase were utilized. The validity of the technique is demonstrated by comparing the results to published test problem results. Numerical experimentation indicated that for cases where a quick near optimum solution is desired, a general, user-friendly optimization code can be developed without serious penalties in both total computer time and accuracy.

  15. Hyperspectral remote sensing algorithms for retrieving forest chlorophyll content

    NASA Astrophysics Data System (ADS)

    Zhang, Yongqin

    Quantitative estimates of forest chlorophyll content from hyperspectral remote sensing are of great use for terrestrial carbon cycle modeling and sustainable forest management. Open forest canopies present a big challenge for the separation of the effects from canopy structure and leaf optical properties, and thus the retrieval of biochemical parameters. Process-based algorithms were developed to estimate the chlorophyll content of broadleaves and needleleaves from hyperspectral measurements. Field experiments were conducted from 2003 to 2004 near Sudbury and Haliburton, Ontario, to collect canopy structural, leaf biophysical and biochemical data. Experiments show that optical properties and biochemical contents of broadleaves change with the growing season and canopy height. Needleleaves from different sites, age classes, and branch orientations demonstrate different visible optical properties in relation to their chlorophyll contents. A process-based radiative transfer model PROSPECT was modified to retrieve leaf chlorophyll content from measured leaf spectra. For broadleaves, leaf thickness was introduced to consider the seasonal and canopy-gradient variation in light absorption. The accuracy of chlorophyll retrieval is increased from 67% to 91%. For needleleaves, the effects of needleleaf width and thickness, and geometrical effects of leaf-holding devices on spectra measurements were taken into account. These modifications improve the accuracy of chlorophyll retrieval from 31% to 59%. Correct exposure for digital hemispherical photographs is crucial for estimating canopy structural parameters. A photographic exposure theory was tested for different forest types with various canopy closures and under different sky conditions. The exposure method improves the estimates of leaf area index by 40% in comparison with commonly used automatic exposure. The effects of canopy structure on optical remote sensing signals were investigated using the geometrical

  16. Random Forest Classification of Depression Status Based On Subcortical Brain Morphometry Following Electroconvulsive Therapy

    PubMed Central

    Wade, Benjamin S.C.; Joshi, Shantanu H.; Pirnia, Tara; Leaver, Amber M.; Woods, Roger P.; Thompson, Paul M.; Espinoza, Randall; Narr, Katherine L.

    2015-01-01

    Disorders of the central nervous system are often accompanied by brain abnormalities detectable with MRI. Advances in biomedical imaging and pattern detection algorithms have led to classification methods that may help diagnose and track the progression of a brain disorder and/or predict successful response to treatment. These classification systems often use high-dimensional signals or images, and must handle the computational challenges of high dimensionality as well as complex data types such as shape descriptors. Here, we used shape information from subcortical structures to test a recently developed feature-selection method based on regularized random forests to 1) classify depressed subjects versus controls, and 2) patients before and after treatment with electroconvulsive therapy. We subsequently compared the classification performance of high-dimensional shape features with traditional volumetric measures. Shape-based models outperformed simple volumetric predictors in several cases, highlighting their utility as potential automated alternatives for establishing diagnosis and predicting treatment response. PMID:26413200

  17. Prediction of Protein-Protein Interactions with Physicochemical Descriptors and Wavelet Transform via Random Forests.

    PubMed

    Jia, Jianhua; Xiao, Xuan; Liu, Bingxiang

    2016-06-01

    Protein-protein interactions (PPIs) provide valuable insight into the inner workings of cells, and it is significant to study the network of PPIs. It is vitally important to develop an automated method as a high-throughput tool to timely predict PPIs. Based on the physicochemical descriptors, a protein was converted into several digital signals, and then wavelet transform was used to analyze them. With such a formulation frame to represent the samples of protein sequences, the random forests algorithm was adopted to conduct prediction. The results on a large-scale independent-test data set show that the proposed model can achieve a good performance with an accuracy value of about 0.86 and a geometric mean value of about 0.85. Therefore, it can be a usefully supplementary tool for PPI prediction. The predictor used in this article is freely available at http://www.jci-bioinfo.cn/PPI_RF. PMID:25882187

  18. Disaggregating Census Data for Population Mapping Using Random Forests with Remotely-Sensed and Ancillary Data

    PubMed Central

    Stevens, Forrest R.; Gaughan, Andrea E.; Linard, Catherine; Tatem, Andrew J.

    2015-01-01

    High resolution, contemporary data on human population distributions are vital for measuring impacts of population growth, monitoring human-environment interactions and for planning and policy development. Many methods are used to disaggregate census data and predict population densities for finer scale, gridded population data sets. We present a new semi-automated dasymetric modeling approach that incorporates detailed census and ancillary data in a flexible, “Random Forest” estimation technique. We outline the combination of widely available, remotely-sensed and geospatial data that contribute to the modeled dasymetric weights and then use the Random Forest model to generate a gridded prediction of population density at ~100 m spatial resolution. This prediction layer is then used as the weighting surface to perform dasymetric redistribution of the census counts at a country level. As a case study we compare the new algorithm and its products for three countries (Vietnam, Cambodia, and Kenya) with other common gridded population data production methodologies. We discuss the advantages of the new method and increases over the accuracy and flexibility of those previous approaches. Finally, we outline how this algorithm will be extended to provide freely-available gridded population data sets for Africa, Asia and Latin America. PMID:25689585

  19. Continental-scale ICESat canopy height modelling sensitivity and random forest simulations in Australia and Canada

    NASA Astrophysics Data System (ADS)

    Hopkinson, C.; Mahoney, C.; Held, A. A.; Hall, R.

    2014-12-01

    The Geoscience Laser Altimeter System (GLAS), previously onboard the Ice, Cloud, and land Elevation Satellite (ICESat) uniquely offers near global waveform LiDAR coverage, however, data quality are subject to system, temporal, and spatial issues. These subtleties are investigated here with respect to canopy height comparisons with 3 airborne LiDAR sites in Australia. Optimal GLAS results were obtained from high energy laser transmissions from laser 3 during leaf-on conditions; GLAS data best corresponded with 95th percentile heights from an all return airborne LiDAR point cloud. In addition, best GLAS results were obtained over relatively open canopies, where prominent ground returns can be retrieved. Optimized GLAS data within Australian forests were employed as canopy height observations, and related to 6 predictor variables (landcover, cover fraction, elevation, slope, soils, and species) by random forest (RF) models. Fifty seven RF models were trained, varying by binomial combinations of predictor data, from 2 to 6 inputs. Trained models were separately utilized to predict Australia wide canopy heights; RF canopy height outputs were validated against spatially concurrent airborne LiDAR 95th percentile canopy heights from an all return point cloud for 10 sites, encompassing multiple ecosystems. The best RF output was obtained from predictor data inputs: landcover, cover fraction, elevation soils, and species, yielding a RMSE=7.98 m, and R2=0.97. Results indicate inherent issues (noted in existing literature) in GLAS observations that propagate through RF algorithms, manifested as canopy height underestimations for taller vegetation (>45 m). To extend this research to the Canadian boreal forest context, research is also targeting canopy height model development in the Northwest Territories, allowing investigations of time-variant phenology and landcover sensitivity due to wetland extent and growth, snow cover and other land cover changes common within boreal

  20. Random Forests for Global and Regional Crop Yield Predictions

    PubMed Central

    Jeong, Jig Han; Resop, Jonathan P.; Mueller, Nathaniel D.; Fleisher, David H.; Yun, Kyungdahm; Butler, Ethan E.; Timlin, Dennis J.; Shim, Kyo-Moon; Gerber, James S.; Reddy, Vangimalla R.

    2016-01-01

    Accurate predictions of crop yield are critical for developing effective agricultural and food policies at the regional and global scales. We evaluated a machine-learning method, Random Forests (RF), for its ability to predict crop yield responses to climate and biophysical variables at global and regional scales in wheat, maize, and potato in comparison with multiple linear regressions (MLR) serving as a benchmark. We used crop yield data from various sources and regions for model training and testing: 1) gridded global wheat grain yield, 2) maize grain yield from US counties over thirty years, and 3) potato tuber and maize silage yield from the northeastern seaboard region. RF was found highly capable of predicting crop yields and outperformed MLR benchmarks in all performance statistics that were compared. For example, the root mean square errors (RMSE) ranged between 6 and 14% of the average observed yield with RF models in all test cases whereas these values ranged from 14% to 49% for MLR models. Our results show that RF is an effective and versatile machine-learning method for crop yield predictions at regional and global scales for its high accuracy and precision, ease of use, and utility in data analysis. RF may result in a loss of accuracy when predicting the extreme ends or responses beyond the boundaries of the training data. PMID:27257967

  1. Global Marine Gas Hydrate Occurrence Using Random Decision Forest Prediction

    NASA Astrophysics Data System (ADS)

    Wood, W. T.; Becker, J. J.; Martin, K. M.; Jung, W. Y.

    2014-12-01

    We have applied machine learning, specifically the technique of random decision forests (RDF), to predict densely spaced values of sparsely sampled seafloor sediment attributes relevant to gas hydrate occurrence. The results of global gas hydrate stability models using these new grids are similar to previously published predictions (the newly derived heat flow alone changes pore space volume in the global gas hydrate stability zone by ~3%), but our model inputs are statistically rigorous estimates (including uncertainties) of sub-seafloor sediment properties. Specifically we use as input recently updated, sparsely sampled, yet globally extensive datasets of seafloor temperature, salinity, porosity, organic carbon content, and fluid flux. The RDF estimate is based on empirical statistical relationships between the relevant parameters and other parameters for which we have more densely sampled estimates (e.g. water depth, seafloor temperature, mixed layer depth, sediment thickness, sediment grain type and crustal age). We create additional attributes by applying statistical analyses and physical models to existing densely sampled attributes. These statistics include mean, median, variance, and other parameters, over a variety of ranges from 5 to 500km. The physical models include established models of compaction, heat conduction, and diagenesis, as well as recently derived estimates of fluid flux at convergent margins. Over 600 densely sampled attributes are used in each prediction, and for each predicted grid, we calculate the relative importance of each input attribute. The RDF technique and resulting sediment model also show promise for global models outside the discipline of gas hydrates.

  2. Credit Risk Evaluation of Power Market Players with Random Forest

    NASA Astrophysics Data System (ADS)

    Umezawa, Yasushi; Mori, Hiroyuki

    A new method is proposed for credit risk evaluation in a power market. The credit risk evaluation is to measure the bankruptcy risk of the company. The power system liberalization results in new environment that puts emphasis on the profit maximization and the risk minimization. There is a high probability that the electricity transaction causes a risk between companies. So, power market players are concerned with the risk minimization. As a management strategy, a risk index is requested to evaluate the worth of the business partner. This paper proposes a new method for evaluating the credit risk with Random Forest (RF) that makes ensemble learning for the decision tree. RF is one of efficient data mining technique in clustering data and extracting relationship between input and output data. In addition, the method of generating pseudo-measurements is proposed to improve the performance of RF. The proposed method is successfully applied to real financial data of energy utilities in the power market. A comparison is made between the proposed and the conventional methods.

  3. Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO.

    PubMed

    Zhu, Xiang-Wei; Xin, Yan-Jun; Ge, Hui-Lin

    2015-04-27

    Variable selection is of crucial significance in QSAR modeling since it increases the model predictive ability and reduces noise. The selection of the right variables is far more complicated than the development of predictive models. In this study, eight continuous and categorical data sets were employed to explore the applicability of two distinct variable selection methods random forests (RF) and least absolute shrinkage and selection operator (LASSO). Variable selection was performed: (1) by using recursive random forests to rule out a quarter of the least important descriptors at each iteration and (2) by using LASSO modeling with 10-fold inner cross-validation to tune its penalty λ for each data set. Along with regular statistical parameters of model performance, we proposed the highest pairwise correlation rate, average pairwise Pearson's correlation coefficient, and Tanimoto coefficient to evaluate the optimal by RF and LASSO in an extensive way. Results showed that variable selection could allow a tremendous reduction of noisy descriptors (at most 96% with RF method in this study) and apparently enhance model's predictive performance as well. Furthermore, random forests showed property of gathering important predictors without restricting their pairwise correlation, which is contrary to LASSO. The mutual exclusion of highly correlated variables in LASSO modeling tends to skip important variables that are highly related to response endpoints and thus undermine the model's predictive performance. The optimal variables selected by RF share low similarity with those by LASSO (e.g., the Tanimoto coefficients were smaller than 0.20 in seven out of eight data sets). We found that the differences between RF and LASSO predictive performances mainly resulted from the variables selected by different strategies rather than the learning algorithms. Our study showed that the right selection of variables is more important than the learning algorithm for modeling. We hope

  4. An Individual Tree Detection Algorithm for Dense Deciduous Forests with Spreading Branches

    NASA Astrophysics Data System (ADS)

    Shao, G.

    2015-12-01

    Individual tree information derived from LiDAR may have the potential to assist forest inventory and improve the assessment of forest structure and composition for sustainable forest management. The algorithms developed for individual tree detection are commonly focusing on finding tree tops to allocation the tree positions. However, the spreading branches (cylinder crowns) in deciduous forests cause such kind of algorithms work less effectively on dense canopy. This research applies a machine learning algorithm, mean shift, to position individual trees based on the density of LiDAR point could instead of detecting tree tops. The study site locates in a dense oak forest in Indiana, US. The selection of mean shift kernels is discussed. The constant and dynamic bandwidths of mean shit algorithms are applied and compared.

  5. Prediction of O-glycosylation Sites Using Random Forest and GA-Tuned PSO Technique

    PubMed Central

    Hassan, Hebatallah; Badr, Amr; Abdelhalim, MB

    2015-01-01

    O-glycosylation is one of the main types of the mammalian protein glycosylation; it occurs on the particular site of serine (S) or threonine (T). Several O-glycosylation site predictors have been developed. However, a need to get even better prediction tools remains. One challenge in training the classifiers is that the available datasets are highly imbalanced, which makes the classification accuracy for the minority class to become unsatisfactory. In our previous work, we have proposed a new classification approach, which is based on particle swarm optimization (PSO) and random forest (RF); this approach has considered the imbalanced dataset problem. The PSO parameters setting in the training process impacts the classification accuracy. Thus, in this paper, we perform parameters optimization for the PSO algorithm, based on genetic algorithm, in order to increase the classification accuracy. Our proposed genetic algorithm-based approach has shown better performance in terms of area under the receiver operating characteristic curve against existing predictors. In addition, we implemented a glycosylation predictor tool based on that approach, and we demonstrated that this tool could successfully identify candidate glycosylation sites in case study protein. PMID:26244014

  6. Prediction of O-glycosylation Sites Using Random Forest and GA-Tuned PSO Technique.

    PubMed

    Hassan, Hebatallah; Badr, Amr; Abdelhalim, M B

    2015-01-01

    O-glycosylation is one of the main types of the mammalian protein glycosylation; it occurs on the particular site of serine (S) or threonine (T). Several O-glycosylation site predictors have been developed. However, a need to get even better prediction tools remains. One challenge in training the classifiers is that the available datasets are highly imbalanced, which makes the classification accuracy for the minority class to become unsatisfactory. In our previous work, we have proposed a new classification approach, which is based on particle swarm optimization (PSO) and random forest (RF); this approach has considered the imbalanced dataset problem. The PSO parameters setting in the training process impacts the classification accuracy. Thus, in this paper, we perform parameters optimization for the PSO algorithm, based on genetic algorithm, in order to increase the classification accuracy. Our proposed genetic algorithm-based approach has shown better performance in terms of area under the receiver operating characteristic curve against existing predictors. In addition, we implemented a glycosylation predictor tool based on that approach, and we demonstrated that this tool could successfully identify candidate glycosylation sites in case study protein. PMID:26244014

  7. Randomized tree construction algorithm to explore energy landscapes.

    PubMed

    Jaillet, Léonard; Corcho, Francesc J; Pérez, Juan-Jesús; Cortés, Juan

    2011-12-01

    In this work, a new method for exploring conformational energy landscapes is described. The method, called transition-rapidly exploring random tree (T-RRT), combines ideas from statistical physics and robot path planning algorithms. A search tree is constructed on the conformational space starting from a given state. The tree expansion is driven by a double strategy: on the one hand, it is naturally biased toward yet unexplored regions of the space; on the other, a Monte Carlo-like transition test guides the expansion toward energetically favorable regions. The balance between these two strategies is automatically achieved due to a self-tuning mechanism. The method is able to efficiently find both energy minima and transition paths between them. As a proof of concept, the method is applied to two academic benchmarks and the alanine dipeptide. PMID:21919017

  8. Random matrix approach to quantum adiabatic evolution algorithms

    SciTech Connect

    Boulatov, A.; Smelyanskiy, V.N.

    2005-05-15

    We analyze the power of the quantum adiabatic evolution algorithm (QAA) for solving random computationally hard optimization problems within a theoretical framework based on random matrix theory (RMT). We present two types of driven RMT models. In the first model, the driving Hamiltonian is represented by Brownian motion in the matrix space. We use the Brownian motion model to obtain a description of multiple avoided crossing phenomena. We show that nonadiabatic corrections in the QAA are due to the interaction of the ground state with the 'cloud' formed by most of the excited states, confirming that in driven RMT models, the Landau-Zener scenario of pairwise level repulsions is not relevant for the description of nonadiabatic corrections. We show that the QAA has a finite probability of success in a certain range of parameters, implying a polynomial complexity of the algorithm. The second model corresponds to the standard QAA with the problem Hamiltonian taken from the RMT Gaussian unitary ensemble (GUE). We show that the level dynamics in this model can be mapped onto the dynamics in the Brownian motion model. For this reason, the driven GUE model can also lead to polynomial complexity of the QAA. The main contribution to the failure probability of the QAA comes from the nonadiabatic corrections to the eigenstates, which only depend on the absolute values of the transition amplitudes. Due to the mapping between the two models, these absolute values are the same in both cases. Our results indicate that this 'phase irrelevance' is the leading effect that can make both the Markovian- and GUE-type QAAs successful.

  9. Global Crustal Heat Flow Using Random Decision Forest Prediction

    NASA Astrophysics Data System (ADS)

    Becker, J. J.; Wood, W. T.; Martin, K. M.

    2014-12-01

    We have applied supervised learning with random decision forests (RDF) to estimate, or predict, a global, densely spaced grid of crustal heat flow. The results are similar to global heat flow predictions that have been previously published but are more accurate and offer higher resolution. The training inputs are measurement values and uncertainties of existing sparsely sampled, (~8,000 locations), geographically biased, yet globally extensive, datasets of crustal heat flow. The RDF estimate is a highly non-linear empirical relationship between crustal heat flow and dozens of other parameters (attributes) that we have densely sampled, global, estimates of (e.g., crustal age, water depth, crustal thickness, seismic sound speed, seafloor temperature, sediment thickness, and sediment grain type). Synthetic attributes were key to obtaining good results using the RDF method. We created synthetic attributes by applying physical intuition and statistical analyses to the fundamental attributes. Statistics include median, kurtosis, and dozens of other functions, all calculated at every node and averaged over a variety of ranges from 5 to 500km. Other synthetic attributes are simply plausible, (e.g., distance from volcanoes, seafloor porosity, mean grain size). More than 600 densely sampled attributes are used in our prediction, and for each we estimated their relative importance. The important attributes included all those expected from geophysics, (e.g., inverse square root of age, gradient of depth, crustal thickness, crustal density, sediment thickness, distance from trenches), and some unexpected but plausible attributes, (e.g., seafloor temperature), but none that were unphysical. The simplicity of the RDF technique may also be of great interest beyond the discipline of crustal heat flow as it allows for more geologically intelligent predictions, decreasing the effect of sampling bias, and improving predictions in regions with little or no data, while rigorously

  10. Hydrologic Landscape Regionalisation Using Deductive Classification and Random Forests

    PubMed Central

    Brown, Stuart C.; Lester, Rebecca E.; Versace, Vincent L.; Fawcett, Jonathon; Laurenson, Laurie

    2014-01-01

    Landscape classification and hydrological regionalisation studies are being increasingly used in ecohydrology to aid in the management and research of aquatic resources. We present a methodology for classifying hydrologic landscapes based on spatial environmental variables by employing non-parametric statistics and hybrid image classification. Our approach differed from previous classifications which have required the use of an a priori spatial unit (e.g. a catchment) which necessarily results in the loss of variability that is known to exist within those units. The use of a simple statistical approach to identify an appropriate number of classes eliminated the need for large amounts of post-hoc testing with different number of groups, or the selection and justification of an arbitrary number. Using statistical clustering, we identified 23 distinct groups within our training dataset. The use of a hybrid classification employing random forests extended this statistical clustering to an area of approximately 228,000 km2 of south-eastern Australia without the need to rely on catchments, landscape units or stream sections. This extension resulted in a highly accurate regionalisation at both 30-m and 2.5-km resolution, and a less-accurate 10-km classification that would be more appropriate for use at a continental scale. A smaller case study, of an area covering 27,000 km2, demonstrated that the method preserved the intra- and inter-catchment variability that is known to exist in local hydrology, based on previous research. Preliminary analysis linking the regionalisation to streamflow indices is promising suggesting that the method could be used to predict streamflow behaviour in ungauged catchments. Our work therefore simplifies current classification frameworks that are becoming more popular in ecohydrology, while better retaining small-scale variability in hydrology, thus enabling future attempts to explain and visualise broad-scale hydrologic trends at the scale of

  11. Stochastic simulations of sediment connectivity using random forests

    NASA Astrophysics Data System (ADS)

    Masselink, Rens; Keesstra, Saskia; Temme, Arnaud

    2016-04-01

    Modelling sediment connectivity, i.e. determining sediment sources, sinks and pathways has often been done by applying spatially explicit models to catchments and calibrating those models using data obtained at a catchment outlet. This means that modelled locations of sediment sources, sinks and pathways are directly derived from the input data of the model (especially the digital elevation model) and the calibration parameters. On the other hand, measured sediment transport data, e.g. from erosion plots or sediment tracers (e.g. Be7, Cs137, Rare earth oxides) is often only available from small plots or hillslopes. Extrapolation of these measured values often lead to an overestimation of erosion at catchment scale. There is a need to link both the small scale erosion/deposition measurements with large scale catchment scale sediment yield. In this study we propose using random forests (RF) for multivariable regression for determining to which extent certain variables influence sediment transport. The independent variables for the RF are derivatives of a high-resolution digital elevation model and vegetation parameters and the independent variables are sediment erosion and deposition data. For the erosion and deposition data we use sediment tracers (rare-earth oxides) applied at a single hillslope in the winter of 2014/2015. Subsequently, we will do stochastic simulations (e.g. sequential Gaussian simulation) for the entire catchment using the RF output and its residuals. These simulations will then be compared to the total suspended sediment output at the catchment outlet. This way, we hope to get a better view of both sediment yield at the catchment scale and locations of sediment sources, sinks and pathways.

  12. Inner Random Restart Genetic Algorithm for Practical Delivery Schedule Optimization

    NASA Astrophysics Data System (ADS)

    Sakurai, Yoshitaka; Takada, Kouhei; Onoyama, Takashi; Tsukamoto, Natsuki; Tsuruta, Setsuo

    A delivery route optimization that improves the efficiency of real time delivery or a distribution network requires solving several tens to hundreds but less than 2 thousands cities Traveling Salesman Problems (TSP) within interactive response time (less than about 3 second), with expert-level accuracy (less than about 3% of error rate). Further, to make things more difficult, the optimization is subjects to special requirements or preferences of each various delivery sites, persons, or societies. To meet these requirements, an Inner Random Restart Genetic Algorithm (Irr-GA) is proposed and developed. This method combines meta-heuristics such as random restart and GA having different types of simple heuristics. Such simple heuristics are 2-opt and NI (Nearest Insertion) methods, each applied for gene operations. The proposed method is hierarchical structured, integrating meta-heuristics and heuristics both of which are multiple but simple. This method is elaborated so that field experts as well as field engineers can easily understand to make the solution or method easily customized and extended according to customers' needs or taste. Comparison based on the experimental results and consideration proved that the method meets the above requirements more than other methods judging from not only optimality but also simplicity, flexibility, and expandability in order for this method to be practically used.

  13. Applying a weighted random forests method to extract karst sinkholes from LiDAR data

    NASA Astrophysics Data System (ADS)

    Zhu, Junfeng; Pierskalla, William P.

    2016-02-01

    Detailed mapping of sinkholes provides critical information for mitigating sinkhole hazards and understanding groundwater and surface water interactions in karst terrains. LiDAR (Light Detection and Ranging) measures the earth's surface in high-resolution and high-density and has shown great potentials to drastically improve locating and delineating sinkholes. However, processing LiDAR data to extract sinkholes requires separating sinkholes from other depressions, which can be laborious because of the sheer number of the depressions commonly generated from LiDAR data. In this study, we applied the random forests, a machine learning method, to automatically separate sinkholes from other depressions in a karst region in central Kentucky. The sinkhole-extraction random forest was grown on a training dataset built from an area where LiDAR-derived depressions were manually classified through a visual inspection and field verification process. Based on the geometry of depressions, as well as natural and human factors related to sinkholes, 11 parameters were selected as predictive variables to form the dataset. Because the training dataset was imbalanced with the majority of depressions being non-sinkholes, a weighted random forests method was used to improve the accuracy of predicting sinkholes. The weighted random forest achieved an average accuracy of 89.95% for the training dataset, demonstrating that the random forest can be an effective sinkhole classifier. Testing of the random forest in another area, however, resulted in moderate success with an average accuracy rate of 73.96%. This study suggests that an automatic sinkhole extraction procedure like the random forest classifier can significantly reduce time and labor costs and makes its more tractable to map sinkholes using LiDAR data for large areas. However, the random forests method cannot totally replace manual procedures, such as visual inspection and field verification.

  14. TPZ: photometric redshift PDFs and ancillary information by using prediction trees and random forests

    NASA Astrophysics Data System (ADS)

    Carrasco Kind, Matias; Brunner, Robert J.

    2013-06-01

    With the growth of large photometric surveys, accurately estimating photometric redshifts, preferably as a probability density function (PDF), and fully understanding the implicit systematic uncertainties in this process, has become increasingly important. In this paper, we present a new, publicly available, parallel, machine learning algorithm that generates photometric redshift PDFs by using prediction trees and random forest techniques, which we have named TPZ.1 This new algorithm incorporates measurement errors into the calculation while also dealing efficiently with missing values in the data. In addition, our implementation of this algorithm provides supplementary information regarding the data being analysed, including unbiased estimates of the accuracy of the technique without resorting to a validation data set, identification of poor photometric redshift areas within the parameter space occupied by the spectroscopic training data, a quantification of the relative importance of the variables used to construct the PDF, and a robust identification of outliers. This extra information can be used to optimally target new spectroscopic observations and to improve the overall efficacy of the redshift estimation. We have tested TPZ on galaxy samples drawn from the Sloan Digital Sky Survey (SDSS) main galaxy sample and from the Deep Extragalactic Evolutionary Probe-2 (DEEP2) survey, obtaining excellent results in each case. We also have tested our implementation by participating in the PHAT1 project, which is a blind photometric redshift contest, finding that TPZ performs comparable to if not better than other empirical photometric redshift algorithms. Finally, we discuss the various parameters that control the operation of TPZ, the specific limitations of this approach and an application of photometric redshift PDFs.

  15. Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes

    PubMed Central

    2013-01-01

    Motivation Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. Results We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity

  16. Improving the chances of successful protein structure determination with a random forest classifier

    SciTech Connect

    Jahandideh, Samad; Jaroszewski, Lukasz; Godzik, Adam

    2014-03-01

    Using an extended set of protein features calculated separately for protein surface and interior, a new version of XtalPred based on a random forest classifier achieves a significant improvement in predicting the success of structure determination from the primary amino-acid sequence. Obtaining diffraction quality crystals remains one of the major bottlenecks in structural biology. The ability to predict the chances of crystallization from the amino-acid sequence of the protein can, at least partly, address this problem by allowing a crystallographer to select homologs that are more likely to succeed and/or to modify the sequence of the target to avoid features that are detrimental to successful crystallization. In 2007, the now widely used XtalPred algorithm [Slabinski et al. (2007 ▶), Protein Sci.16, 2472–2482] was developed. XtalPred classifies proteins into five ‘crystallization classes’ based on a simple statistical analysis of the physicochemical features of a protein. Here, towards the same goal, advanced machine-learning methods are applied and, in addition, the predictive potential of additional protein features such as predicted surface ruggedness, hydrophobicity, side-chain entropy of surface residues and amino-acid composition of the predicted protein surface are tested. The new XtalPred-RF (random forest) achieves significant improvement of the prediction of crystallization success over the original XtalPred. To illustrate this, XtalPred-RF was tested by revisiting target selection from 271 Pfam families targeted by the Joint Center for Structural Genomics (JCSG) in PSI-2, and it was estimated that the number of targets entered into the protein-production and crystallization pipeline could have been reduced by 30% without lowering the number of families for which the first structures were solved. The prediction improvement depends on the subset of targets used as a testing set and reaches 100% (i.e. twofold) for the top class of predicted

  17. Multivariate classification with random forests for gravitational wave searches of black hole binary coalescence

    NASA Astrophysics Data System (ADS)

    Baker, Paul T.; Caudill, Sarah; Hodge, Kari A.; Talukder, Dipongkar; Capano, Collin; Cornish, Neil J.

    2015-03-01

    Searches for gravitational waves produced by coalescing black hole binaries with total masses ≳25 M⊙ use matched filtering with templates of short duration. Non-Gaussian noise bursts in gravitational wave detector data can mimic short signals and limit the sensitivity of these searches. Previous searches have relied on empirically designed statistics incorporating signal-to-noise ratio and signal-based vetoes to separate gravitational wave candidates from noise candidates. We report on sensitivity improvements achieved using a multivariate candidate ranking statistic derived from a supervised machine learning algorithm. We apply the random forest of bagged decision trees technique to two separate searches in the high mass (≳25 M⊙ ) parameter space. For a search which is sensitive to gravitational waves from the inspiral, merger, and ringdown of binary black holes with total mass between 25 M⊙ and 100 M⊙ , we find sensitive volume improvements as high as 70±13%-109±11% when compared to the previously used ranking statistic. For a ringdown-only search which is sensitive to gravitational waves from the resultant perturbed intermediate mass black hole with mass roughly between 10 M⊙ and 600 M⊙ , we find sensitive volume improvements as high as 61±4%-241±12% when compared to the previously used ranking statistic. We also report how sensitivity improvements can differ depending on mass regime, mass ratio, and available data quality information. Finally, we describe the techniques used to tune and train the random forest classifier that can be generalized to its use in other searches for gravitational waves.

  18. RFMirTarget: Predicting Human MicroRNA Target Genes with a Random Forest Classifier

    PubMed Central

    Mendoza, Mariana R.; da Fonseca, Guilherme C.; Loss-Morais, Guilherme; Alves, Ronnie; Margis, Rogerio; Bazzan, Ana L. C.

    2013-01-01

    MicroRNAs are key regulators of eukaryotic gene expression whose fundamental role has already been identified in many cell pathways. The correct identification of miRNAs targets is still a major challenge in bioinformatics and has motivated the development of several computational methods to overcome inherent limitations of experimental analysis. Indeed, the best results reported so far in terms of specificity and sensitivity are associated to machine learning-based methods for microRNA-target prediction. Following this trend, in the current paper we discuss and explore a microRNA-target prediction method based on a random forest classifier, namely RFMirTarget. Despite its well-known robustness regarding general classifying tasks, to the best of our knowledge, random forest have not been deeply explored for the specific context of predicting microRNAs targets. Our framework first analyzes alignments between candidate microRNA-target pairs and extracts a set of structural, thermodynamics, alignment, seed and position-based features, upon which classification is performed. Experiments have shown that RFMirTarget outperforms several well-known classifiers with statistical significance, and that its performance is not impaired by the class imbalance problem or features correlation. Moreover, comparing it against other algorithms for microRNA target prediction using independent test data sets from TarBase and starBase, we observe a very promising performance, with higher sensitivity in relation to other methods. Finally, tests performed with RFMirTarget show the benefits of feature selection even for a classifier with embedded feature importance analysis, and the consistency between relevant features identified and important biological properties for effective microRNA-target gene alignment. PMID:23922946

  19. Random forests, a novel approach for discrimination of fish populations using parasites as biological tags.

    PubMed

    Perdiguero-Alonso, Diana; Montero, Francisco E; Kostadinova, Aneta; Raga, Juan Antonio; Barrett, John

    2008-10-01

    Due to the complexity of host-parasite relationships, discrimination between fish populations using parasites as biological tags is difficult. This study introduces, to our knowledge for the first time, random forests (RF) as a new modelling technique in the application of parasite community data as biological markers for population assignment of fish. This novel approach is applied to a dataset with a complex structure comprising 763 parasite infracommunities in population samples of Atlantic cod, Gadus morhua, from the spawning/feeding areas in five regions in the North East Atlantic (Baltic, Celtic, Irish and North seas and Icelandic waters). The learning behaviour of RF is evaluated in comparison with two other algorithms applied to class assignment problems, the linear discriminant function analysis (LDA) and artificial neural networks (ANN). The three algorithms are used to develop predictive models applying three cross-validation procedures in a series of experiments (252 models in total). The comparative approach to RF, LDA and ANN algorithms applied to the same datasets demonstrates the competitive potential of RF for developing predictive models since RF exhibited better accuracy of prediction and outperformed LDA and ANN in the assignment of fish to their regions of sampling using parasite community data. The comparative analyses and the validation experiment with a 'blind' sample confirmed that RF models performed more effectively with a large and diverse training set and a large number of variables. The discrimination results obtained for a migratory fish species with largely overlapping parasite communities reflects the high potential of RF for developing predictive models using data that are both complex and noisy, and indicates that it is a promising tool for parasite tag studies. Our results suggest that parasite community data can be used successfully to discriminate individual cod from the five different regions of the North East Atlantic studied

  20. Forest Height Retrieval Algorithm Using a Complex Visibility Function Approach

    NASA Astrophysics Data System (ADS)

    Chu, T.; Zebker, H. A.

    2011-12-01

    Vegetation structure and biomass on earth's terrestrial surface are critical parameters that influences global carbon cycle, habitat, climate, and resources of economic value. Space-borne and air-borne remote sensing instruments are the most practical means of obtaining information such as tree height and biomass on a large scale. SAR (Synthetic aperture radars) especially InSAR (Interferometric SAR) has been utilized in the recent years to quantify vegetation parameters such as height and biomass. However methods used to quantify global vegetation has yet to produce accurate results. It is the goal of this study to develop a signal-processing algorithm through simulation to determine vegetation heights that would lead to accurate height and biomass retrievals. A standard SAR image represents a projection of the 3D distributed backscatter onto a 2D plane. InSAR is capable of determining topography or the height of vegetation. Vegetation height is determined from the mean scattering phase center of all scatterers within a resolution cell. InSAR is capable of generating a 3D height surface, but the distribution of scatters in height is under-determined and cannot be resolved by a single-baseline measurement. One interferogram therefore is insufficient to uniquely determine vertical characteristics of even a simple 3D forest. An aperture synthesis technique in the height or vertical dimension would enable improved resolution capability to distinguish scatterers of different location in the vertical dimension. Repeat pass observations allow us differential interferometry to populate the frequency domain from which we can use the Fourier transform relation to get to the brightness or backscatter domain. Ryle and Hewish first introduced this technique of aperture synthesis in the 1960's for large radio telescope arrays. This technique would allow us to focus the antenna beam pattern in the vertical direction and increase vertical resolving power. It enable us to

  1. Precipitation estimates from MSG SEVIRI daytime, night-time and twilight data with random forests

    NASA Astrophysics Data System (ADS)

    Kühnlein, Meike; Appelhans, Tim; Thies, Boris; Nauss, Thomas

    2014-05-01

    We introduce a new rainfall retrieval technique based on MSG SEVIRI data which aims to retrieve rainfall rates in a continuous manner (day, twilight and night) at high temporal resolution. Due to the deficiencies of existing optical rainfall retrievals, the focus of this technique is on assigning rainfall rates to precipitating cloud areas in connection with extra-tropical cyclones in mid-latitudes including both convective and advective-stratiform precipitating cloud areas. The technique is realized in three steps: (i) Precipitating cloud areas are identified. (ii) The precipitating cloud areas are separated into convective and advective-stratiform precipitating areas. (iii) Rainfall rates are assigned to the convective and advective-stratiform precipitating areas, respectively. Therefore, considering the dominant precipitation processes of convective and advective-stratiform precipitation areas within extra-tropical cyclones, satellite-based information on the cloud top height, cloud top temperature, cloud phase and cloud water path are used to retrieve information about precipitation. The approach uses the ensemble classification and regression technique random forests to develop the prediction algorithms. Random forest models contain a combination of characteristics that make them well suited for its application in precipitation remote sensing. One of the key advantages is the ability to capture non-linear association of patterns between predictors and response which becomes important when dealing with complex non-linear events like precipitation. Using a machine learning approach differentiates the proposed technique from most state-of-the-art satellite-based rainfall retrievals which generally use conventional parametric approaches. To train and validate the model, the radar-based RADOLAN RW product from the German Weather Service (DWD) is used which provides area-wide gauge-adjusted hourly precipitation information. Beside the overall performance of the

  2. Biased Randomized Algorithm for Fast Model-Based Diagnosis

    NASA Technical Reports Server (NTRS)

    Williams, Colin; Vartan, Farrokh

    2005-01-01

    A biased randomized algorithm has been developed to enable the rapid computational solution of a propositional- satisfiability (SAT) problem equivalent to a diagnosis problem. The closest competing methods of automated diagnosis are described in the preceding article "Fast Algorithms for Model-Based Diagnosis" and "Two Methods of Efficient Solution of the Hitting-Set Problem" (NPO-30584), which appears elsewhere in this issue. It is necessary to recapitulate some of the information from the cited articles as a prerequisite to a description of the present method. As used here, "diagnosis" signifies, more precisely, a type of model-based diagnosis in which one explores any logical inconsistencies between the observed and expected behaviors of an engineering system. The function of each component and the interconnections among all the components of the engineering system are represented as a logical system. Hence, the expected behavior of the engineering system is represented as a set of logical consequences. Faulty components lead to inconsistency between the observed and expected behaviors of the system, represented by logical inconsistencies. Diagnosis - the task of finding the faulty components - reduces to finding the components, the abnormalities of which could explain all the logical inconsistencies. One seeks a minimal set of faulty components (denoted a minimal diagnosis), because the trivial solution, in which all components are deemed to be faulty, always explains all inconsistencies. In the methods of the cited articles, the minimal-diagnosis problem is treated as equivalent to a minimal-hitting-set problem, which is translated from a combinatorial to a computational problem by mapping it onto the Boolean-satisfiability and integer-programming problems. The integer-programming approach taken in one of the prior methods is complete (in the sense that it is guaranteed to find a solution if one exists) and slow and yields a lower bound on the size of the

  3. Urban land cover thematic disaggregation, employing datasets from multiple sources and RandomForests modeling

    NASA Astrophysics Data System (ADS)

    Gounaridis, Dimitrios; Koukoulas, Sotirios

    2016-09-01

    Urban land cover mapping has lately attracted a vast amount of attention as it closely relates to a broad scope of scientific and management applications. Late methodological and technological advancements facilitate the development of datasets with improved accuracy. However, thematic resolution of urban land cover has received much less attention so far, a fact that hampers the produced datasets utility. This paper seeks to provide insights towards the improvement of thematic resolution of urban land cover classification. We integrate existing, readily available and with acceptable accuracies datasets from multiple sources, with remote sensing techniques. The study site is Greece and the urban land cover is classified nationwide into five classes, using the RandomForests algorithm. Results allowed us to quantify, for the first time with a good accuracy, the proportion that is occupied by each different urban land cover class. The total area covered by urban land cover is 2280 km2 (1.76% of total terrestrial area), the dominant class is discontinuous dense urban fabric (50.71% of urban land cover) and the least occurring class is discontinuous very low density urban fabric (2.06% of urban land cover).

  4. Automatic co-segmentation of lung tumor based on random forest in PET-CT images

    NASA Astrophysics Data System (ADS)

    Jiang, Xueqing; Xiang, Dehui; Zhang, Bin; Zhu, Weifang; Shi, Fei; Chen, Xinjian

    2016-03-01

    In this paper, a fully automatic method is proposed to segment the lung tumor in clinical 3D PET-CT images. The proposed method effectively combines PET and CT information to make full use of the high contrast of PET images and superior spatial resolution of CT images. Our approach consists of three main parts: (1) initial segmentation, in which spines are removed in CT images and initial connected regions achieved by thresholding based segmentation in PET images; (2) coarse segmentation, in which monotonic downhill function is applied to rule out structures which have similar standardized uptake values (SUV) to the lung tumor but do not satisfy a monotonic property in PET images; (3) fine segmentation, random forests method is applied to accurately segment the lung tumor by extracting effective features from PET and CT images simultaneously. We validated our algorithm on a dataset which consists of 24 3D PET-CT images from different patients with non-small cell lung cancer (NSCLC). The average TPVF, FPVF and accuracy rate (ACC) were 83.65%, 0.05% and 99.93%, respectively. The correlation analysis shows our segmented lung tumor volumes has strong correlation ( average 0.985) with the ground truth 1 and ground truth 2 labeled by a clinical expert.

  5. Automated segmentation of thyroid gland on CT images with multi-atlas label fusion and random classification forest

    NASA Astrophysics Data System (ADS)

    Liu, Jiamin; Chang, Kevin; Kim, Lauren; Turkbey, Evrim; Lu, Le; Yao, Jianhua; Summers, Ronald

    2015-03-01

    The thyroid gland plays an important role in clinical practice, especially for radiation therapy treatment planning. For patients with head and neck cancer, radiation therapy requires a precise delineation of the thyroid gland to be spared on the pre-treatment planning CT images to avoid thyroid dysfunction. In the current clinical workflow, the thyroid gland is normally manually delineated by radiologists or radiation oncologists, which is time consuming and error prone. Therefore, a system for automated segmentation of the thyroid is desirable. However, automated segmentation of the thyroid is challenging because the thyroid is inhomogeneous and surrounded by structures that have similar intensities. In this work, the thyroid gland segmentation is initially estimated by multi-atlas label fusion algorithm. The segmentation is refined by supervised statistical learning based voxel labeling with a random forest algorithm. Multiatlas label fusion (MALF) transfers expert-labeled thyroids from atlases to a target image using deformable registration. Errors produced by label transfer are reduced by label fusion that combines the results produced by all atlases into a consensus solution. Then, random forest (RF) employs an ensemble of decision trees that are trained on labeled thyroids to recognize features. The trained forest classifier is then applied to the thyroid estimated from the MALF by voxel scanning to assign the class-conditional probability. Voxels from the expert-labeled thyroids in CT volumes are treated as positive classes; background non-thyroid voxels as negatives. We applied this automated thyroid segmentation system to CT scans of 20 patients. The results showed that the MALF achieved an overall 0.75 Dice Similarity Coefficient (DSC) and the RF classification further improved the DSC to 0.81.

  6. A novel quantum random number generation algorithm used by smartphone camera

    NASA Astrophysics Data System (ADS)

    Wu, Nan; Wang, Kun; Hu, Haixing; Song, Fangmin; Li, Xiangdong

    2015-05-01

    We study an efficient algorithm to extract quantum random numbers (QRN) from the raw data obtained by charge-coupled device (CCD) or complementary metal-oxide semiconductor (CMOS) based sensors, like a camera used in a commercial smartphone. Based on NIST statistical test for random number generators, the proposed algorithm has a high QRN generation rate and high statistical randomness. This algorithm provides a kind of simple, low-priced and reliable devices as a QRN generator for quantum key distribution (QKD) or other cryptographic applications.

  7. Random Forest and Objected-Based Classification for Forest Pest Extraction from Uav Aerial Imagery

    NASA Astrophysics Data System (ADS)

    Yuan, Yi; Hu, Xiangyun

    2016-06-01

    Forest pest is one of the most important factors affecting the health of forest. However, since it is difficult to figure out the pest areas and to predict the spreading ways just to partially control and exterminate it has not effective enough so far now. The infected areas by it have continuously spreaded out at present. Thus the introduction of spatial information technology is highly demanded. It is very effective to examine the spatial distribution characteristics that can establish timely proper strategies for control against pests by periodically figuring out the infected situations as soon as possible and by predicting the spreading ways of the infection. Now, with the UAV photography being more and more popular, it has become much cheaper and faster to get UAV images which are very suitable to be used to monitor the health of forest and detect the pest. This paper proposals a new method to effective detect forest pest in UAV aerial imagery. For an image, we segment it to many superpixels at first and then we calculate a 12-dimension statistical texture information for each superpixel which are used to train and classify the data. At last, we refine the classification results by some simple rules. The experiments show that the method is effective for the extraction of forest pest areas in UAV images.

  8. The Random Forests Statistical Technique: An Examination of Its Value for the Study of Reading

    ERIC Educational Resources Information Center

    Matsuki, Kazunaga; Kuperman, Victor; Van Dyke, Julie A.

    2016-01-01

    Studies investigating individual differences in reading ability often involve data sets containing a large number of collinear predictors and a small number of observations. In this article, we discuss the method of Random Forests and demonstrate its suitability for addressing the statistical concerns raised by such data sets. The method is…

  9. Variable selection with random forest: Balancing stability, performance, and interpretation in ecological and environmental modeling

    EPA Science Inventory

    Random forest (RF) is popular in ecological and environmental modeling, in part, because of its insensitivity to correlated predictors and resistance to overfitting. Although variable selection has been proposed to improve both performance and interpretation of RF models, it is u...

  10. Evolving forest fire burn severity classification algorithms for multispectral imagery

    NASA Astrophysics Data System (ADS)

    Brumby, Steven P.; Harvey, Neal R.; Bloch, Jeffrey J.; Theiler, James P.; Perkins, Simon J.; Young, Aaron C.; Szymanski, John J.

    2001-08-01

    Between May 6 and May 18, 2000, the Cerro Grande/Los Alamos wildfire burned approximately 43,000 acres (17,500 ha) and 235 residences in the town of Los Alamos, NM. Initial estimates of forest damage included 17,000 acres (6,900 ha) of 70-100% tree mortality. Restoration efforts following the fire were complicated by the large scale of the fire, and by the presence of extensive natural and man-made hazards. These conditions forced a reliance on remote sensing techniques for mapping and classifying the burn region. During and after the fire, remote-sensing data was acquired from a variety of aircraft-based and satellite-based sensors, including Landsat 7. We now report on the application of a machine learning technique, implemented in a software package called GENIE, to the classification of forest fire burn severity using Landsat 7 ETM+ multispectral imagery. The details of this automatic classification are compared to the manually produced burn classification, which was derived from field observations and manual interpretation of high-resolution aerial color/infrared photography.

  11. Source-enhanced coalescence of trees in a random forest

    NASA Astrophysics Data System (ADS)

    Lushnikov, A. A.

    2015-08-01

    The time evolution of a random graph with varying number of edges and vertices is considered. The edges and vertices are assumed to be added at random by one at a time with different rates. A fresh edge connects either two linked components and forms a new component of larger order g (coalescence of graphs) or increases (by one) the number of edges in a given linked component (cycling). Assuming the vertices to have a finite valence (the number of edges connected with a given vertex is limited) the kinetic equation for the distribution of linked components of the graph over their orders and valences is formulated and solved exactly by applying the generating function method for the case of coalescence of trees. The evolution process is shown to reveal a phase transition: the emergence of a giant linked component whose order is comparable to the total order of the graph. The time dependencies of the moments of the distribution of linked components over their orders and valences are found explicitly for the pregelation period and the critical behavior of the spectrum is analyzed. It is found that the linked components are γ distributed over g with the algebraic prefactor g-5 /2. The coalescence process is shown to terminate by the formation of the steady-state γ spectrum with the same algebraic prefactor.

  12. Fault diagnosis of spur gearbox based on random forest and wavelet packet decomposition

    NASA Astrophysics Data System (ADS)

    Cabrera, Diego; Sancho, Fernando; Sánchez, René-Vinicio; Zurita, Grover; Cerrada, Mariela; Li, Chuan; Vásquez, Rafael E.

    2015-09-01

    This paper addresses the development of a random forest classifier for the multi-class fault diagnosis in spur gearboxes. The vibration signal's condition parameters are first extracted by applying the wavelet packet decomposition with multiple mother wavelets, and the coefficients' energy content for terminal nodes is used as the input feature for the classification problem. Then, a study through the parameters' space to find the best values for the number of trees and the number of random features is performed. In this way, the best set of mother wavelets for the application is identified and the best features are selected through the internal ranking of the random forest classifier. The results show that the proposed method reached 98.68% in classification accuracy, and high efficiency and robustness in the models.

  13. Predicting host tropism of influenza A virus proteins using random forest

    PubMed Central

    2014-01-01

    Background Majority of influenza A viruses reside and circulate among animal populations, seldom infecting humans due to host range restriction. Yet when some avian strains do acquire the ability to overcome species barrier, they might become adapted to humans, replicating efficiently and causing diseases, leading to potential pandemic. With the huge influenza A virus reservoir in wild birds, it is a cause for concern when a new influenza strain emerges with the ability to cross host species barrier, as shown in light of the recent H7N9 outbreak in China. Several influenza proteins have been shown to be major determinants in host tropism. Further understanding and determining host tropism would be important in identifying zoonotic influenza virus strains capable of crossing species barrier and infecting humans. Results In this study, computational models for 11 influenza proteins have been constructed using the machine learning algorithm random forest for prediction of host tropism. The prediction models were trained on influenza protein sequences isolated from both avian and human samples, which were transformed into amino acid physicochemical properties feature vectors. The results were highly accurate prediction models (ACC>96.57; AUC>0.980; MCC>0.916) capable of determining host tropism of individual influenza proteins. In addition, features from all 11 proteins were used to construct a combined model to predict host tropism of influenza virus strains. This would help assess a novel influenza strain's host range capability. Conclusions From the prediction models constructed, all achieved high prediction performance, indicating clear distinctions in both avian and human proteins. When used together as a host tropism prediction system, zoonotic strains could potentially be identified based on different protein prediction results. Understanding and predicting host tropism of influenza proteins lay an important foundation for future work in constructing computation

  14. Relevant feature set estimation with a knock-out strategy and random forests.

    PubMed

    Ganz, Melanie; Greve, Douglas N; Fischl, Bruce; Konukoglu, Ender

    2015-11-15

    Group analysis of neuroimaging data is a vital tool for identifying anatomical and functional variations related to diseases as well as normal biological processes. The analyses are often performed on a large number of highly correlated measurements using a relatively smaller number of samples. Despite the correlation structure, the most widely used approach is to analyze the data using univariate methods followed by post-hoc corrections that try to account for the data's multivariate nature. Although widely used, this approach may fail to recover from the adverse effects of the initial analysis when local effects are not strong. Multivariate pattern analysis (MVPA) is a powerful alternative to the univariate approach for identifying relevant variations. Jointly analyzing all the measures, MVPA techniques can detect global effects even when individual local effects are too weak to detect with univariate analysis. Current approaches are successful in identifying variations that yield highly predictive and compact models. However, they suffer from lessened sensitivity and instabilities in identification of relevant variations. Furthermore, current methods' user-defined parameters are often unintuitive and difficult to determine. In this article, we propose a novel MVPA method for group analysis of high-dimensional data that overcomes the drawbacks of the current techniques. Our approach explicitly aims to identify all relevant variations using a "knock-out" strategy and the Random Forest algorithm. In evaluations with synthetic datasets the proposed method achieved substantially higher sensitivity and accuracy than the state-of-the-art MVPA methods, and outperformed the univariate approach when the effect size is low. In experiments with real datasets the proposed method identified regions beyond the univariate approach, while other MVPA methods failed to replicate the univariate results. More importantly, in a reproducibility study with the well-known ADNI dataset

  15. Simple randomized algorithms for online learning with kernels.

    PubMed

    He, Wenwu; Kwok, James T

    2014-12-01

    In online learning with kernels, it is vital to control the size (budget) of the support set because of the curse of kernelization. In this paper, we propose two simple and effective stochastic strategies for controlling the budget. Both algorithms have an expected regret that is sublinear in the horizon. Experimental results on a number of benchmark data sets demonstrate encouraging performance in terms of both efficacy and efficiency. PMID:25108150

  16. High-order local spatial context modeling by spatialized random forest.

    PubMed

    Ni, Bingbing; Yan, Shuicheng; Wang, Meng; Kassim, Ashraf A; Tian, Qi

    2013-02-01

    In this paper, we propose a novel method for spatial context modeling toward boosting visual discriminating power. We are particularly interested in how to model high-order local spatial contexts instead of the intensively studied second-order spatial contexts, i.e., co-occurrence relations. Motivated by the recent success of random forest in learning discriminative visual codebook, we present a spatialized random forest (SRF) approach, which can encode an unlimited length of high-order local spatial contexts. By spatially random neighbor selection and random histogram-bin partition during the tree construction, the SRF can explore much more complicated and informative local spatial patterns in a randomized manner. Owing to the discriminative capability test for the random partition in each tree node's split process, a set of informative high-order local spatial patterns are derived, and new images are then encoded by counting the occurrences of such discriminative local spatial patterns. Extensive comparison experiments on face recognition and object/scene classification clearly demonstrate the superiority of the proposed spatial context modeling method over other state-of-the-art approaches for this purpose. PMID:23060330

  17. Evaluation of Algorithms for Calculating Forest Micrometeorological Variables Using an Extensive Dataset of Paired Station Recordings

    NASA Astrophysics Data System (ADS)

    Garvelmann, J.; Pohl, S.; Warscher, M.; Mair, E.; Marke, T.; Strasser, U.; Kunstmann, H.

    2015-12-01

    Forests represent significant areas of subalpine environments and their influence is crucial for the snow cover dynamics on the ground. Since measurements of major micrometeorological variables are usually lacking for forested sites, physically based or empirical parameterizations are usually applied to calculate the beneath-canopy micrometeorological conditions for snow hydrological modeling. Most of these parameterizations have been developed from observations at selected long-term climate stations. Consequently, the high spatial variability of the micrometeorological variables is usually not taken into account. The goal of this study is to evaluate existing approaches using an extensive dataset collected during five winter seasons using a stratified sampling design with pairs of snow monitoring stations (SnoMoS) at open/forested sites in three study areas (Black Forest region of SW Germany, Brixenbach catchment in the Austrian Alps and the Berchtesgadener Ache catchment in the Berchtesgaden Alps of SE Germany). In total, recordings from 110 station pairs were available for analysis. The measurements of air temperature, relative humidity, wind speed and global radiation from the open field sites were used to calculate the adjacent inside forest conditions. Calculation results are compared to the respective beneath-canopy measurements in order to evaluate the applied model algorithms. The results reveal that the algorithms surprisingly well reproduced the inside canopy conditions for wind speed and global radiation. However, air temperature and relative humidity are not well reproduced. Our study comes up with a modification of the two respective parameterizations developed from the paired measurements.

  18. Space resection model calculation based on Random Sample Consensus algorithm

    NASA Astrophysics Data System (ADS)

    Liu, Xinzhu; Kang, Zhizhong

    2016-03-01

    Resection has been one of the most important content in photogrammetry. It aims at the position and attitude information of camera at the shooting point. However in some cases, the observed values for calculating are with gross errors. This paper presents a robust algorithm that using RANSAC method with DLT model can effectually avoiding the difficulties to determine initial values when using co-linear equation. The results also show that our strategies can exclude crude handicap and lead to an accurate and efficient way to gain elements of exterior orientation.

  19. Improved progressive TIN densification filtering algorithm for airborne LiDAR data in forested areas

    NASA Astrophysics Data System (ADS)

    Zhao, Xiaoqian; Guo, Qinghua; Su, Yanjun; Xue, Baolin

    2016-07-01

    Filtering of light detection and ranging (LiDAR) data into the ground and non-ground points is a fundamental step in processing raw airborne LiDAR data. This paper proposes an improved progressive triangulated irregular network (TIN) densification (IPTD) filtering algorithm that can cope with a variety of forested landscapes, particularly both topographically and environmentally complex regions. The IPTD filtering algorithm consists of three steps: (1) acquiring potential ground seed points using the morphological method; (2) obtaining accurate ground seed points; and (3) building a TIN-based model and iteratively densifying TIN. The IPTD filtering algorithm was tested in 15 forested sites with various terrains (i.e., elevation and slope) and vegetation conditions (i.e., canopy cover and tree height), and was compared with seven other commonly used filtering algorithms (including morphology-based, slope-based, and interpolation-based filtering algorithms). Results show that the IPTD achieves the highest filtering accuracy for nine of the 15 sites. In general, it outperforms the other filtering algorithms, yielding the lowest average total error of 3.15% and the highest average kappa coefficient of 89.53%.

  20. Inference of biological networks using Bi-directional Random Forest Granger causality.

    PubMed

    Furqan, Mohammad Shaheryar; Siyal, Mohammad Yakoob

    2016-01-01

    The standard ordinary least squares based Granger causality is one of the widely used methods for detecting causal interactions between time series data. However, recent developments in technology limit the utilization of some existing implementations due to the availability of high dimensional data. In this paper, we are proposing a technique called Bi-directional Random Forest Granger causality. This technique uses the random forest regularization together with the idea of reusing the time series data by reversing the time stamp to extract more causal information. We have demonstrated the effectiveness of our proposed method by applying it to simulated data and then applied it to two real biological datasets, i.e., fMRI and HeLa cell. fMRI data was used to map brain network involved in deductive reasoning while HeLa cell dataset was used to map gene network involved in cancer. PMID:27186478

  1. RF-Phos: A Novel General Phosphorylation Site Prediction Tool Based on Random Forest

    PubMed Central

    Ismail, Hamid D.; Jones, Ahoi; Kim, Jung H.; Newman, Robert H.; KC, Dukka B.

    2016-01-01

    Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite predictor 2.0 (RF-Phos 2.0), to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 2.0, which uses random forest with sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 2.0 compares favorably to other popular mammalian phosphosite prediction methods, such as PhosphoSVM, GPS2.1, and Musite. PMID:27066500

  2. A U-Statistic-based random Forest approach for genetic association study.

    PubMed

    Li, Ming; Peng, Ruo-Sin; Wei, Changshuai; Lu, Qing

    2012-01-01

    Variations in complex traits are influenced by multiple genetic variants, environmental risk factors, and their interactions. Though substantial progress has been made in identifying single genetic variants associated with complex traits, detecting the gene-gene and gene-environment interactions remains a great challenge. When a large number of genetic variants and environmental risk factors are involved, searching for interactions is limited to pair-wise interactions due to the exponentially increased feature space and computational intensity. Alternatively, recursive partitioning approaches, such as random forests, have gained popularity in high-dimensional genetic association studies. In this article, we propose a U-Statistic-based random forest approach, referred to as Forest U-Test, for genetic association studies with quantitative traits. Through simulation studies, we showed that the Forest U-Test outperformed exiting methods. The proposed method was also applied to study Cannabis Dependence (CD), using three independent datasets from the Study of Addiction: Genetics and Environment. A significant joint association was detected with an empirical p-value less than 0.001. The finding was also replicated in two independent datasets with p-values of 5.93e-19 and 4.70e-17, respectively. PMID:22652671

  3. A Copula Based Approach for Design of Multivariate Random Forests for Drug Sensitivity Prediction

    PubMed Central

    Haider, Saad; Rahman, Raziur; Ghosh, Souparno; Pal, Ranadip

    2015-01-01

    Modeling sensitivity to drugs based on genetic characterizations is a significant challenge in the area of systems medicine. Ensemble based approaches such as Random Forests have been shown to perform well in both individual sensitivity prediction studies and team science based prediction challenges. However, Random Forests generate a deterministic predictive model for each drug based on the genetic characterization of the cell lines and ignores the relationship between different drug sensitivities during model generation. This application motivates the need for generation of multivariate ensemble learning techniques that can increase prediction accuracy and improve variable importance ranking by incorporating the relationships between different output responses. In this article, we propose a novel cost criterion that captures the dissimilarity in the output response structure between the training data and node samples as the difference in the two empirical copulas. We illustrate that copulas are suitable for capturing the multivariate structure of output responses independent of the marginal distributions and the copula based multivariate random forest framework can provide higher accuracy prediction and improved variable selection. The proposed framework has been validated on genomics of drug sensitivity for cancer and cancer cell line encyclopedia database. PMID:26658256

  4. An efficient algorithm for generating random number pairs drawn from a bivariate normal distribution

    NASA Technical Reports Server (NTRS)

    Campbell, C. W.

    1983-01-01

    An efficient algorithm for generating random number pairs from a bivariate normal distribution was developed. Any desired value of the two means, two standard deviations, and correlation coefficient can be selected. Theoretically the technique is exact and in practice its accuracy is limited only by the quality of the uniform distribution random number generator, inaccuracies in computer function evaluation, and arithmetic. A FORTRAN routine was written to check the algorithm and good accuracy was obtained. Some small errors in the correlation coefficient were observed to vary in a surprisingly regular manner. A simple model was developed which explained the qualities aspects of the errors.

  5. Experimental implementation of a quantum random-walk search algorithm using strongly dipolar coupled spins

    SciTech Connect

    Lu Dawei; Peng Xinhua; Du Jiangfeng; Zhu Jing; Zou Ping; Yu Yihua; Zhang Shanmin; Chen Qun

    2010-02-15

    An important quantum search algorithm based on the quantum random walk performs an oracle search on a database of N items with O({radical}(phN)) calls, yielding a speedup similar to the Grover quantum search algorithm. The algorithm was implemented on a quantum information processor of three-qubit liquid-crystal nuclear magnetic resonance (NMR) in the case of finding 1 out of 4, and the diagonal elements' tomography of all the final density matrices was completed with comprehensible one-dimensional NMR spectra. The experimental results agree well with the theoretical predictions.

  6. Experimental implementation of a quantum random-walk search algorithm using strongly dipolar coupled spins

    NASA Astrophysics Data System (ADS)

    Lu, Dawei; Zhu, Jing; Zou, Ping; Peng, Xinhua; Yu, Yihua; Zhang, Shanmin; Chen, Qun; Du, Jiangfeng

    2010-02-01

    An important quantum search algorithm based on the quantum random walk performs an oracle search on a database of N items with O(phN) calls, yielding a speedup similar to the Grover quantum search algorithm. The algorithm was implemented on a quantum information processor of three-qubit liquid-crystal nuclear magnetic resonance (NMR) in the case of finding 1 out of 4, and the diagonal elements’ tomography of all the final density matrices was completed with comprehensible one-dimensional NMR spectra. The experimental results agree well with the theoretical predictions.

  7. Study on 2D random medium inversion algorithm based on Fuzzy C-means Clustering theory

    NASA Astrophysics Data System (ADS)

    Xu, Z.; Zhu, P.; Gu, Y.; Yang, X.; Jiang, J.

    2015-12-01

    Abstract: In seismic exploration for metal deposits, the traditional seismic inversion method based on layered homogeneous medium theory seems difficult to inverse small scale inhomogeneity and spatial variation of the actual medium. The reason is that physical properties of actual medium are more likely random distribution rather than layered. Thus, it is necessary to investigate a random medium inversion algorithm. The velocity of 2D random medium can be described as a function of five parameters: the background velocity (V0), the standard deviation of velocity (σ), the horizontal and vertical autocorrelation lengths (A and B), and the autocorrelation angle (θ). In this study, we propose an inversion algorithm for random medium based on the Fuzzy C-means Clustering (FCM) theory, whose basic idea is that FCM is used to control the inversion process to move forward to the direction we desired by clustering the estimated parameters into groups. Our method can be divided into three steps: firstly, the three parameters (A, B, θ) are estimated from 2D post-stack seismic data using the non-stationary random medium parameter estimation method, and then the estimated parameters are clustered to different groups according to FCM; secondly, the initial random medium model is constructed with clustered groups and the rest two parameters (V0 and σ) obtained from the well logging data; at last, inversion of the random medium are conducted to obtain velocity, impedance and random medium parameters using the Conjugate Gradient Method. The inversion experiments of synthetic seismic data show that the velocity models inverted by our algorithm are close to the real velocity distribution and the boundary of different media can be distinguished clearly.Key words: random medium, inversion, FCM, parameter estimation

  8. Predicting local Soil- and Land-units with Random Forest in the Senegalese Sahel

    NASA Astrophysics Data System (ADS)

    Grau, Tobias; Brandt, Martin; Samimi, Cyrus

    2013-04-01

    MODIS (MCD12Q1) or Globcover are often the only available global land-cover products, however ground-truthing in the Sahel of Senegal has shown that most classes do have any agreement with actual land-cover making those products unusable in any local application. We suggest a methodology, which models local Wolof land- and soil-types in an area in the Senegalese Ferlo around Linguère at different scales. In a first step, interviews with the local population were conducted to ascertain the local denotation of soil units, as well as their agricultural use and woody vegetation mainly growing on them. "Ndjor" are soft sand soils with mainly Combretum glutinosum trees. They are suitable for groundnuts and beans while millet is grown on hard sand soils ("Bardjen") dominated by Balanites aegyptiaca and Acacia tortilis. "Xur" are clayey depressions with a high diversity of tree species. Lateritic pasture sites with dense woody vegetation (mostly Pterocarpus lucens and Guiera senegalensis) have never been used for cropping and are called "All". In a second step, vegetation and soil parameters of 85 plots (~1 ha) were surveyed in the field. 28 different soil parameters are clustered into 4 classes using the WARD algorithm. Here, 81% agree with the local classification. Then, an ordination (NMDS) with 2 dimensions and a stress-value of 9.13% was calculated using the 28 soil parameters. It shows several significant relationships between the soil classes and the fitted environmental parameters which are derived from field data, a digital elevation model, Landsat and RapidEye imagery as well as TRMM rainfall data. Landsat's band 5 reflectance values (1.55 - 1.75 µm) of mean dry season image (2000-2010) has a R² of 0.42 and is the most important of 9 significant variables (5%-level). A random forest classifier is then used to extrapolate the 4 classes to the whole study area based on the 9 significant environmental parameters. At a resolution of 30 m the OBB (out-of-bag) error

  9. An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++

    PubMed Central

    Karpievitch, Yuliya V.; Hill, Elizabeth G.; Leclerc, Anthony P.; Dabney, Alan R.; Almeida, Jonas S.

    2009-01-01

    Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and

  10. Comparative Analyses between Retained Introns and Constitutively Spliced Introns in Arabidopsis thaliana Using Random Forest and Support Vector Machine

    PubMed Central

    Mao, Rui; Raj Kumar, Praveen Kumar; Guo, Cheng; Zhang, Yang; Liang, Chun

    2014-01-01

    One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention. PMID:25110928

  11. SNRFCB: sub-network based random forest classifier for predicting chemotherapy benefit on survival for cancer treatment.

    PubMed

    Shi, Mingguang; He, Jianmin

    2016-04-22

    Adjuvant chemotherapy (CTX) should be individualized to provide potential survival benefit and avoid potential harm to cancer patients. Our goal was to establish a computational approach for making personalized estimates of the survival benefit from adjuvant CTX. We developed Sub-Network based Random Forest classifier for predicting Chemotherapy Benefit (SNRFCB) based gene expression datasets of lung cancer. The SNRFCB approach was then validated in independent test cohorts for identifying chemotherapy responder cohorts and chemotherapy non-responder cohorts. SNRFCB involved the pre-selection of gene sub-network signatures based on the mutations and on protein-protein interaction data as well as the application of the random forest algorithm to gene expression datasets. Adjuvant CTX was significantly associated with the prolonged overall survival of lung cancer patients in the chemotherapy responder group (P = 0.008), but it was not beneficial to patients in the chemotherapy non-responder group (P = 0.657). Adjuvant CTX was significantly associated with the prolonged overall survival of lung cancer squamous cell carcinoma (SQCC) subtype patients in the chemotherapy responder cohorts (P = 0.024), but it was not beneficial to patients in the chemotherapy non-responder cohorts (P = 0.383). SNRFCB improved prediction performance as compared to the machine learning method, support vector machine (SVM). To test the general applicability of the predictive model, we further applied the SNRFCB approach to human breast cancer datasets and also observed superior performance. SNRFCB could provide recurrent probability for individual patients and identify which patients may benefit from adjuvant CTX in clinical trials. PMID:26864276

  12. Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.

    PubMed

    Mao, Rui; Raj Kumar, Praveen Kumar; Guo, Cheng; Zhang, Yang; Liang, Chun

    2014-01-01

    One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter [Formula: see text] in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention. PMID:25110928

  13. Quantification of the Heterogeneity of Prognostic Cellular Biomarkers in Ewing Sarcoma Using Automated Image and Random Survival Forest Analysis

    PubMed Central

    Yu, Haiyue; Branford White, Harriet; Schäfer, Karl L.; Llombart-Bosch, Antonio; Machado, Isidro; Picci, Piero; Hogendoorn, Pancras C. W.; Athanasou, Nicholas A.; Noble, J. Alison; Hassan, A. Bassim

    2014-01-01

    Driven by genomic somatic variation, tumour tissues are typically heterogeneous, yet unbiased quantitative methods are rarely used to analyse heterogeneity at the protein level. Motivated by this problem, we developed automated image segmentation of images of multiple biomarkers in Ewing sarcoma to generate distributions of biomarkers between and within tumour cells. We further integrate high dimensional data with patient clinical outcomes utilising random survival forest (RSF) machine learning. Using material from cohorts of genetically diagnosed Ewing sarcoma with EWSR1 chromosomal translocations, confocal images of tissue microarrays were segmented with level sets and watershed algorithms. Each cell nucleus and cytoplasm were identified in relation to DAPI and CD99, respectively, and protein biomarkers (e.g. Ki67, pS6, Foxo3a, EGR1, MAPK) localised relative to nuclear and cytoplasmic regions of each cell in order to generate image feature distributions. The image distribution features were analysed with RSF in relation to known overall patient survival from three separate cohorts (185 informative cases). Variation in pre-analytical processing resulted in elimination of a high number of non-informative images that had poor DAPI localisation or biomarker preservation (67 cases, 36%). The distribution of image features for biomarkers in the remaining high quality material (118 cases, 104 features per case) were analysed by RSF with feature selection, and performance assessed using internal cross-validation, rather than a separate validation cohort. A prognostic classifier for Ewing sarcoma with low cross-validation error rates (0.36) was comprised of multiple features, including the Ki67 proliferative marker and a sub-population of cells with low cytoplasmic/nuclear ratio of CD99. Through elimination of bias, the evaluation of high-dimensionality biomarker distribution within cell populations of a tumour using random forest analysis in quality controlled tumour

  14. Quantification of the heterogeneity of prognostic cellular biomarkers in ewing sarcoma using automated image and random survival forest analysis.

    PubMed

    Bühnemann, Claudia; Li, Simon; Yu, Haiyue; Branford White, Harriet; Schäfer, Karl L; Llombart-Bosch, Antonio; Machado, Isidro; Picci, Piero; Hogendoorn, Pancras C W; Athanasou, Nicholas A; Noble, J Alison; Hassan, A Bassim

    2014-01-01

    Driven by genomic somatic variation, tumour tissues are typically heterogeneous, yet unbiased quantitative methods are rarely used to analyse heterogeneity at the protein level. Motivated by this problem, we developed automated image segmentation of images of multiple biomarkers in Ewing sarcoma to generate distributions of biomarkers between and within tumour cells. We further integrate high dimensional data with patient clinical outcomes utilising random survival forest (RSF) machine learning. Using material from cohorts of genetically diagnosed Ewing sarcoma with EWSR1 chromosomal translocations, confocal images of tissue microarrays were segmented with level sets and watershed algorithms. Each cell nucleus and cytoplasm were identified in relation to DAPI and CD99, respectively, and protein biomarkers (e.g. Ki67, pS6, Foxo3a, EGR1, MAPK) localised relative to nuclear and cytoplasmic regions of each cell in order to generate image feature distributions. The image distribution features were analysed with RSF in relation to known overall patient survival from three separate cohorts (185 informative cases). Variation in pre-analytical processing resulted in elimination of a high number of non-informative images that had poor DAPI localisation or biomarker preservation (67 cases, 36%). The distribution of image features for biomarkers in the remaining high quality material (118 cases, 104 features per case) were analysed by RSF with feature selection, and performance assessed using internal cross-validation, rather than a separate validation cohort. A prognostic classifier for Ewing sarcoma with low cross-validation error rates (0.36) was comprised of multiple features, including the Ki67 proliferative marker and a sub-population of cells with low cytoplasmic/nuclear ratio of CD99. Through elimination of bias, the evaluation of high-dimensionality biomarker distribution within cell populations of a tumour using random forest analysis in quality controlled tumour

  15. An iterative algorithm for analysis of coupled structural-acoustic systems subject to random excitations

    NASA Astrophysics Data System (ADS)

    Zhao, Guo-Zhong; Chen, Gang; Kang, Zhan

    2012-04-01

    This paper analyzes the random response of structural-acoustic coupled systems. Most existing works on coupled structural-acoustic analysis are limited to systems under deterministic excitations due to high computational cost required by a random response analysis. To reduce the computational burden involved in the coupled random analysis, an iterative procedure based on the Pseudo excitation method has been developed. It is found that this algorithm has an overwhelming advantage in computing efficiency over traditional methods, as demonstrated by some numerical examples given in this paper.

  16. Neural network based adaptive control of nonlinear plants using random search optimization algorithms

    NASA Technical Reports Server (NTRS)

    Boussalis, Dhemetrios; Wang, Shyh J.

    1992-01-01

    This paper presents a method for utilizing artificial neural networks for direct adaptive control of dynamic systems with poorly known dynamics. The neural network weights (controller gains) are adapted in real time using state measurements and a random search optimization algorithm. The results are demonstrated via simulation using two highly nonlinear systems.

  17. Application of random forest approach to QSAR prediction of aquatic toxicity.

    PubMed

    Polishchuk, Pavel G; Muratov, Eugene N; Artemenko, Anatoly G; Kolumbin, Oleg G; Muratov, Nail N; Kuz'min, Victor E

    2009-11-01

    This work is devoted to the application of the random forest approach to QSAR analysis of aquatic toxicity of chemical compounds tested on Tetrahymena pyriformis. The simplex representation of the molecular structure approach implemented in HiT QSAR Software was used for descriptors generation on a two-dimensional level. Adequate models based on simplex descriptors and the RF statistical approach were obtained on a modeling set of 644 compounds. Model predictivity was validated on two external test sets of 339 and 110 compounds. The high impact of lipophilicity and polarizability of investigated compounds on toxicity was determined. It was shown that RF models were tolerant for insertion of irrelevant descriptors as well as for randomization of some part of toxicity values that were representing a "noise". The fast procedure of optimization of the number of trees in the random forest has been proposed. The discussed RF model had comparable or better statistical characteristics than the corresponding PLS or KNN models. PMID:19860412

  18. Magnetic localization and orientation of the capsule endoscope based on a random complex algorithm

    PubMed Central

    He, Xiaoqi; Zheng, Zizhao; Hu, Chao

    2015-01-01

    The development of the capsule endoscope has made possible the examination of the whole gastrointestinal tract without much pain. However, there are still some important problems to be solved, among which, one important problem is the localization of the capsule. Currently, magnetic positioning technology is a suitable method for capsule localization, and this depends on a reliable system and algorithm. In this paper, based on the magnetic dipole model as well as magnetic sensor array, we propose nonlinear optimization algorithms using a random complex algorithm, applied to the optimization calculation for the nonlinear function of the dipole, to determine the three-dimensional position parameters and two-dimensional direction parameters. The stability and the antinoise ability of the algorithm is compared with the Levenberg–Marquart algorithm. The simulation and experiment results show that in terms of the error level of the initial guess of magnet location, the random complex algorithm is more accurate, more stable, and has a higher “denoise” capacity, with a larger range for initial guess values. PMID:25914561

  19. Random forest-based similarity measures for multi-modal classification of Alzheimer's disease.

    PubMed

    Gray, Katherine R; Aljabar, Paul; Heckemann, Rolf A; Hammers, Alexander; Rueckert, Daniel

    2013-01-15

    Neurodegenerative disorders, such as Alzheimer's disease, are associated with changes in multiple neuroimaging and biological measures. These may provide complementary information for diagnosis and prognosis. We present a multi-modality classification framework in which manifolds are constructed based on pairwise similarity measures derived from random forest classifiers. Similarities from multiple modalities are combined to generate an embedding that simultaneously encodes information about all the available features. Multi-modality classification is then performed using coordinates from this joint embedding. We evaluate the proposed framework by application to neuroimaging and biological data from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Features include regional MRI volumes, voxel-based FDG-PET signal intensities, CSF biomarker measures, and categorical genetic information. Classification based on the joint embedding constructed using information from all four modalities out-performs the classification based on any individual modality for comparisons between Alzheimer's disease patients and healthy controls, as well as between mild cognitive impairment patients and healthy controls. Based on the joint embedding, we achieve classification accuracies of 89% between Alzheimer's disease patients and healthy controls, and 75% between mild cognitive impairment patients and healthy controls. These results are comparable with those reported in other recent studies using multi-kernel learning. Random forests provide consistent pairwise similarity measures for multiple modalities, thus facilitating the combination of different types of feature data. We demonstrate this by application to data in which the number of features differs by several orders of magnitude between modalities. Random forest classifiers extend naturally to multi-class problems, and the framework described here could be applied to distinguish between multiple patient groups in the future

  20. Random forest-based similarity measures for multi-modal classification of Alzheimer’s disease

    PubMed Central

    Gray, Katherine R.; Aljabar, Paul; Heckemann, Rolf A.; Hammers, Alexander; Rueckert, Daniel

    2012-01-01

    Neurodegenerative disorders, such as Alzheimer’s disease, are associated with changes in multiple neuroimaging and biological measures. These may provide complementary information for diagnosis and prognosis. We present a multi-modality classification framework in which manifolds are constructed based on pairwise similarity measures derived from random forest classifiers. Similarities from multiple modalities are combined to generate an embedding that simultaneously encodes information about all the available features. Multimodality classification is then performed using coordinates from this joint embedding. We evaluate the proposed framework by application to neuroimaging and biological data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Features include regional MRI volumes, voxel-based FDG-PET signal intensities, CSF biomarker measures, and categorical genetic information. Classification based on the joint embedding constructed using information from all four modalities out-performs classification based on any individual modality for comparisons between Alzheimer’s disease patients and healthy controls, as well as between mild cognitive impairment patients and healthy controls. Based on the joint embedding, we achieve classification accuracies of 89% between Alzheimer’s disease patients and healthy controls, and 75% between mild cognitive impairment patients and healthy controls. These results are comparable with those reported in other recent studies using multi-kernel learning. Random forests provide consistent pairwise similarity measures for multiple modalities, thus facilitating the combination of different types of feature data. We demonstrate this by application to data in which the number of features differ by several orders of magnitude between modalities. Random forest classifiers extend naturally to multi-class problems, and the framework described here could be applied to distinguish between multiple patient groups in the

  1. Quantitative measurement of retinal ganglion cell populations via histology-based random forest classification.

    PubMed

    Hedberg-Buenz, Adam; Christopher, Mark A; Lewis, Carly J; Fernandes, Kimberly A; Dutca, Laura M; Wang, Kai; Scheetz, Todd E; Abràmoff, Michael D; Libby, Richard T; Garvin, Mona K; Anderson, Michael G

    2016-05-01

    The inner surface of the retina contains a complex mixture of neurons, glia, and vasculature, including retinal ganglion cells (RGCs), the final output neurons of the retina and primary neurons that are damaged in several blinding diseases. The goal of the current work was two-fold: to assess the feasibility of using computer-assisted detection of nuclei and random forest classification to automate the quantification of RGCs in hematoxylin/eosin (H&E)-stained retinal whole-mounts; and if possible, to use the approach to examine how nuclear size influences disease susceptibility among RGC populations. To achieve this, data from RetFM-J, a semi-automated ImageJ-based module that detects, counts, and collects quantitative data on nuclei of H&E-stained whole-mounted retinas, were used in conjunction with a manually curated set of images to train a random forest classifier. To test performance, computer-derived outputs were compared to previously published features of several well-characterized mouse models of ophthalmic disease and their controls: normal C57BL/6J mice; Jun-sufficient and Jun-deficient mice subjected to controlled optic nerve crush (CONC); and DBA/2J mice with naturally occurring glaucoma. The result of these efforts was development of RetFM-Class, a command-line-based tool that uses data output from RetFM-J to perform random forest classification of cell type. Comparative testing revealed that manual and automated classifications by RetFM-Class correlated well, with 83.2% classification accuracy for RGCs. Automated characterization of C57BL/6J retinas predicted 54,642 RGCs per normal retina, and identified a 48.3% Jun-dependent loss of cells at 35 days post CONC and a 71.2% loss of RGCs among 16-month-old DBA/2J mice with glaucoma. Output from automated analyses was used to compare nuclear area among large numbers of RGCs from DBA/2J mice (n = 127,361). In aged DBA/2J mice with glaucoma, RetFM-Class detected a decrease in median and mean nucleus size

  2. Identification of a potential fibromyalgia diagnosis using random forest modeling applied to electronic medical records

    PubMed Central

    Emir, Birol; Masters, Elizabeth T; Mardekian, Jack; Clair, Andrew; Kuhn, Max; Silverman, Stuart L

    2015-01-01

    Background Diagnosis of fibromyalgia (FM), a chronic musculoskeletal condition characterized by widespread pain and a constellation of symptoms, remains challenging and is often delayed. Methods Random forest modeling of electronic medical records was used to identify variables that may facilitate earlier FM identification and diagnosis. Subjects aged ≥18 years with two or more listings of the International Classification of Diseases, Ninth Revision, (ICD-9) code for FM (ICD-9 729.1) ≥30 days apart during the 2012 calendar year were defined as cases among subjects associated with an integrated delivery network and who had one or more health care provider encounter in the Humedica database in calendar years 2011 and 2012. Controls were without the FM ICD-9 codes. Seventy-two demographic, clinical, and health care resource utilization variables were entered into a random forest model with downsampling to account for cohort imbalances (<1% subjects had FM). Importance of the top ten variables was ranked based on normalization to 100% for the variable with the largest loss in predicting performance by its omission from the model. Since random forest is a complex prediction method, a set of simple rules was derived to help understand what factors drive individual predictions. Results The ten variables identified by the model were: number of visits where laboratory/non-imaging diagnostic tests were ordered; number of outpatient visits excluding office visits; age; number of office visits; number of opioid prescriptions; number of medications prescribed; number of pain medications excluding opioids; number of medications administered/ordered; number of emergency room visits; and number of musculoskeletal conditions. A receiver operating characteristic curve confirmed the model’s predictive accuracy using an independent test set (area under the curve, 0.810). To enhance interpretability, nine rules were developed that could be used with good predictive probability of

  3. A hybrid flower pollination algorithm based modified randomized location for multi-threshold medical image segmentation.

    PubMed

    Wang, Rui; Zhou, Yongquan; Zhao, Chengyan; Wu, Haizhou

    2015-01-01

    Multi-threshold image segmentation is a powerful image processing technique that is used for the preprocessing of pattern recognition and computer vision. However, traditional multilevel thresholding methods are computationally expensive because they involve exhaustively searching the optimal thresholds to optimize the objective functions. To overcome this drawback, this paper proposes a flower pollination algorithm with a randomized location modification. The proposed algorithm is used to find optimal threshold values for maximizing Otsu's objective functions with regard to eight medical grayscale images. When benchmarked against other state-of-the-art evolutionary algorithms, the new algorithm proves itself to be robust and effective through numerical experimental results including Otsu's objective values and standard deviations. PMID:26405895

  4. Biased Random-Key Genetic Algorithms for the Winner Determination Problem in Combinatorial Auctions.

    PubMed

    de Andrade, Carlos Eduardo; Toso, Rodrigo Franco; Resende, Mauricio G C; Miyazawa, Flávio Keidi

    2015-01-01

    In this paper we address the problem of picking a subset of bids in a general combinatorial auction so as to maximize the overall profit using the first-price model. This winner determination problem assumes that a single bidding round is held to determine both the winners and prices to be paid. We introduce six variants of biased random-key genetic algorithms for this problem. Three of them use a novel initialization technique that makes use of solutions of intermediate linear programming relaxations of an exact mixed integer linear programming model as initial chromosomes of the population. An experimental evaluation compares the effectiveness of the proposed algorithms with the standard mixed linear integer programming formulation, a specialized exact algorithm, and the best-performing heuristics proposed for this problem. The proposed algorithms are competitive and offer strong results, mainly for large-scale auctions. PMID:25299242

  5. Breast segmentation in MRI using Poisson surface reconstruction initialized with random forest edge detection

    NASA Astrophysics Data System (ADS)

    Martel, Anne L.; Gallego-Ortiz, Cristina; Lu, YingLi

    2016-03-01

    Segmentation of breast tissue in MRI images is an important pre-processing step for many applications. We present a new method that uses a random forest classifier to identify candidate edges in the image and then applies a Poisson reconstruction step to define a 3D surface based on the detected edge points. Using a leave one patient out cross validation we achieve a Dice overlap score of 0.96 +/- 0.02 for T1 weighted non-fat suppressed images in 8 patients. In a second dataset of 332 images acquired using a Dixon sequence, which was not used in training the random classifier, the mean Dice score was 0.90 +/- 0.03. Using this approach we have achieved accurate, robust segmentation results using a very small training set.

  6. Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction

    PubMed Central

    Rahman, Raziur; Haider, Saad; Ghosh, Souparno; Pal, Ranadip

    2015-01-01

    Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity prediction problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error. PMID:27081304

  7. Security authentication with a three-dimensional optical phase code using random forest classifier.

    PubMed

    Markman, Adam; Carnicer, Artur; Javidi, Bahram

    2016-06-01

    An object with a unique three-dimensional (3D) optical phase mask attached is analyzed for security and authentication. These 3D optical phase masks are more difficult to duplicate or to have a mathematical formulation compared with 2D masks and thus have improved security capabilities. A quick response code was modulated using a random 3D optical phase mask generating a 3D optical phase code (OPC). Due to the scattering of light through the 3D OPC, a unique speckle pattern based on the materials and structure in the 3D optical phase mask is generated and recorded on a CCD device. Feature extraction is performed by calculating the mean, variance, skewness, kurtosis, and entropy for each recorded speckle pattern. The random forest classifier is used for authentication. Optical experiments demonstrate the feasibility of the authentication scheme. PMID:27409445

  8. Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle.

    PubMed

    Yao, C; Spurlock, D M; Armentano, L E; Page, C D; VandeHaar, M J; Bickhart, D M; Weigel, K A

    2013-10-01

    Feed efficiency is an economically important trait in the beef and dairy cattle industries. Residual feed intake (RFI) is a measure of partial efficiency that is independent of production level per unit of body weight. The objective of this study was to identify significant associations between single nucleotide polymorphism (SNP) markers and RFI in dairy cattle using the Random Forests (RF) algorithm. Genomic data included 42,275 SNP genotypes for 395 Holstein cows, whereas phenotypic measurements were daily RFI from 50 to 150 d postpartum. Residual feed intake was defined as the difference between an animal's feed intake and the average intake of its cohort, after adjustment for year and season of calving, year and season of measurement, age at calving nested within parity, days in milk, milk yield, body weight, and body weight change. Random Forests is a widely used machine-learning algorithm that has been applied to classification and regression problems. By analyzing the tree structures produced within RF, the 25 most frequent pairwise SNP interactions were reported as possible epistatic interactions. The importance scores that are generated by RF take into account both main effects of variables and interactions between variables, and the most negative value of all importance scores can be used as the cutoff level for declaring SNP effects as significant. Ranking by importance scores, 188 SNP surpassed the threshold, among which 38 SNP were mapped to RFI quantitative trait loci (QTL) regions reported in a previous study in beef cattle, and 2 SNP were also detected by a genome-wide association study in beef cattle. The ratio of number of SNP located in RFI QTL to the total number of SNP in the top 188 SNP chosen by RF was significantly higher than in all 42,275 whole-genome markers. Pathway analysis indicated that many of the top 188 SNP are in genomic regions that contain annotated genes with biological functions that may influence RFI. Frequently occurring

  9. Effect of the driving algorithm on the turbulence generated by a random jet array

    NASA Astrophysics Data System (ADS)

    Pérez-Alvarado, Alejandro; Mydlarski, Laurent; Gaskin, Susan

    2016-02-01

    Different driving algorithms for a large random jet array (RJA) were tested and their performance characterized by comparing the statistics of the turbulence generated downstream of the RJA. Of particular interest was the spatial configuration of the jets operating at any given instant (an aspect that has not been documented in previous RJAs studies), as well as the statistics of their respective on/off times. All algorithms generated flows with nonzero skewnesses of the velocity fluctuation normal to the plane of the RJA (identified as an inherent limitation of the system resulting from the unidirectional forcing imposed from only one side of the RJA), and slightly super-Gaussian kurtoses of the velocity fluctuations in all directions. It was observed that algorithms imposing spatial configurations generated the most isotropic flows; however, they suffered from high mean flows and low turbulent kinetic energies. The algorithm identified as RANDOM (also referred to as the "sunbathing algorithm") generated the flow that, on an overall basis, most closely approximated zero-mean-flow homogeneous isotropic turbulence, with variations in horizontal and vertical homogeneities of RMS velocities of no more than ±6 %, deviations from isotropy ( w RMS/ u RMS) in the range of 0.62-0.77, and mean flows on the order of 7 % of the RMS velocities (determined by averaging their absolute values over the three velocity components and three downstream distances). A relatively high turbulent Reynolds number ( Re T = u T ℓ/ ν = 2360, where ℓ is the integral length scale of the flow and u T is a characteristic RMS velocity) was achieved using the RANDOM algorithm and the integral length scale ( ℓ = 11.5 cm) is the largest reported to date. The quality of the turbulence in our large facility demonstrates the ability of RJAs to be scaled-up and to be the laboratory system most capable of generating the largest quasi-homogeneous isotropic turbulent regions with zero mean flow.

  10. Fast randomized Hough transformation track initiation algorithm based on multi-scale clustering

    NASA Astrophysics Data System (ADS)

    Wan, Minjie; Gu, Guohua; Chen, Qian; Qian, Weixian; Wang, Pengcheng

    2015-10-01

    A fast randomized Hough transformation track initiation algorithm based on multi-scale clustering is proposed to overcome existing problems in traditional infrared search and track system(IRST) which cannot provide movement information of the initial target and select the threshold value of correlation automatically by a two-dimensional track association algorithm based on bearing-only information . Movements of all the targets are presumed to be uniform rectilinear motion throughout this new algorithm. Concepts of space random sampling, parameter space dynamic linking table and convergent mapping of image to parameter space are developed on the basis of fast randomized Hough transformation. Considering the phenomenon of peak value clustering due to shortcomings of peak detection itself which is built on threshold value method, accuracy can only be ensured on condition that parameter space has an obvious peak value. A multi-scale idea is added to the above-mentioned algorithm. Firstly, a primary association is conducted to select several alternative tracks by a low-threshold .Then, alternative tracks are processed by multi-scale clustering methods , through which accurate numbers and parameters of tracks are figured out automatically by means of transforming scale parameters. The first three frames are processed by this algorithm in order to get the first three targets of the track , and then two slightly different gate radius are worked out , mean value of which is used to be the global threshold value of correlation. Moreover, a new model for curvilinear equation correction is applied to the above-mentioned track initiation algorithm for purpose of solving the problem of shape distortion when a space three-dimensional curve is mapped to a two-dimensional bearing-only space. Using sideways-flying, launch and landing as examples to build models and simulate, the application of the proposed approach in simulation proves its effectiveness , accuracy , and adaptivity

  11. Nonconvergence of the Wang-Landau algorithms with multiple random walkers

    NASA Astrophysics Data System (ADS)

    Belardinelli, R. E.; Pereyra, V. D.

    2016-05-01

    This paper discusses some convergence properties in the entropic sampling Monte Carlo methods with multiple random walkers, particularly in the Wang-Landau (WL) and 1 /t algorithms. The classical algorithms are modified by the use of m -independent random walkers in the energy landscape to calculate the density of states (DOS). The Ising model is used to show the convergence properties in the calculation of the DOS, as well as the critical temperature, while the calculation of the number π by multiple dimensional integration is used in the continuum approximation. In each case, the error is obtained separately for each walker at a fixed time, t ; then, the average over m walkers is performed. It is observed that the error goes as 1 /√{m } . However, if the number of walkers increases above a certain critical value m >mx , the error reaches a constant value (i.e., it saturates). This occurs for both algorithms; however, it is shown that for a given system, the 1 /t algorithm is more efficient and accurate than the similar version of the WL algorithm. It follows that it makes no sense to increase the number of walkers above a critical value mx, since it does not reduce the error in the calculation. Therefore, the number of walkers does not guarantee convergence.

  12. Quantifying spatial distribution of snow depth errors from LiDAR using Random Forests

    NASA Astrophysics Data System (ADS)

    Tinkham, W.; Smith, A. M.; Marshall, H.; Link, T. E.; Falkowski, M. J.; Winstral, A. H.

    2013-12-01

    There is increasing need to characterize the distribution of snow in complex terrain using remote sensing approaches, especially in isolated mountainous regions that are often water-limited, the principal source of terrestrial freshwater, and sensitive to climatic shifts and variations. We apply intensive topographic surveys, multi-temporal LiDAR, and Random Forest modeling to quantify snow volume and characterize associated errors across seven land cover types in a semi-arid mountainous catchment at a 1 and 4 m spatial resolution. The LiDAR-based estimates of both snow-off surface topology and snow depths were validated against ground-based measurements across the catchment. Comparison of LiDAR-derived snow depths to manual snow depth surveys revealed that LiDAR based estimates were more accurate in areas of low lying vegetation such as shrubs (RMSE = 0.14 m) as compared to areas consisting of tree cover (RMSE = 0.20-0.35 m). The highest errors were found along the edge of conifer forests (RMSE = 0.35 m), however a second conifer transect outside the catchment had much lower errors (RMSE = 0.21 m). This difference is attributed to the wind exposure of the first site that led to highly variable snow depths at short spatial distances. The Random Forest modeled errors deviated from the field measured errors with a RMSE of 0.09-0.34 m across the different cover types. Results show that snow drifts, which are important for maintaining spring and summer stream flows and establishing and sustaining water-limited plant species, contained 30 × 5-6% of the snow volume while only occupying 10% of the catchment area similar to findings by prior physically-based modeling approaches. This study demonstrates the potential utility of combining multi-temporal LiDAR with Random Forest modeling to quantify the distribution of snow depth with a reasonable degree of accuracy. Future work could explore the utility of Terrestrial LiDAR Scanners to produce validation of snow-on surface

  13. A Novel Hepatocellular Carcinoma Image Classification Method Based on Voting Ranking Random Forests

    PubMed Central

    Xia, Bingbing; Jiang, Huiyan; Liu, Huiling; Yi, Dehui

    2016-01-01

    This paper proposed a novel voting ranking random forests (VRRF) method for solving hepatocellular carcinoma (HCC) image classification problem. Firstly, in preprocessing stage, this paper used bilateral filtering for hematoxylin-eosin (HE) pathological images. Next, this paper segmented the bilateral filtering processed image and got three different kinds of images, which include single binary cell image, single minimum exterior rectangle cell image, and single cell image with a size of n⁎n. After that, this paper defined atypia features which include auxiliary circularity, amendment circularity, and cell symmetry. Besides, this paper extracted some shape features, fractal dimension features, and several gray features like Local Binary Patterns (LBP) feature, Gray Level Cooccurrence Matrix (GLCM) feature, and Tamura features. Finally, this paper proposed a HCC image classification model based on random forests and further optimized the model by voting ranking method. The experiment results showed that the proposed features combined with VRRF method have a good performance in HCC image classification problem. PMID:27293477

  14. A Two-Stage Random Forest-Based Pathway Analysis Method

    PubMed Central

    Chung, Ren-Hua; Chen, Ying-Erh

    2012-01-01

    Pathway analysis provides a powerful approach for identifying the joint effect of genes grouped into biologically-based pathways on disease. Pathway analysis is also an attractive approach for a secondary analysis of genome-wide association study (GWAS) data that may still yield new results from these valuable datasets. Most of the current pathway analysis methods focused on testing the cumulative main effects of genes in a pathway. However, for complex diseases, gene-gene interactions are expected to play a critical role in disease etiology. We extended a random forest-based method for pathway analysis by incorporating a two-stage design. We used simulations to verify that the proposed method has the correct type I error rates. We also used simulations to show that the method is more powerful than the original random forest-based pathway approach and the set-based test implemented in PLINK in the presence of gene-gene interactions. Finally, we applied the method to a breast cancer GWAS dataset and a lung cancer GWAS dataset and interesting pathways were identified that have implications for breast and lung cancers. PMID:22586488

  15. Modeling Urban Dynamics Using Random Forest: Implementing Roc and Toc for Model Evaluation

    NASA Astrophysics Data System (ADS)

    Ahmadlou, M.; Delavar, M. R.; Shafizadeh-Moghadam, H.; Tayyebi, A.

    2016-06-01

    The importance of spatial accuracy of land use/cover change maps necessitates the use of high performance models. To reach this goal, calibrating machine learning (ML) approaches to model land use/cover conversions have received increasing interest among the scholars. This originates from the strength of these techniques as they powerfully account for the complex relationships underlying urban dynamics. Compared to other ML techniques, random forest has rarely been used for modeling urban growth. This paper, drawing on information from the multi-temporal Landsat satellite images of 1985, 2000 and 2015, calibrates a random forest regression (RFR) model to quantify the variable importance and simulation of urban change spatial patterns. The results and performance of RFR model were evaluated using two complementary tools, relative operating characteristics (ROC) and total operating characteristics (TOC), by overlaying the map of observed change and the modeled suitability map for land use change (error map). The suitability map produced by RFR model showed 82.48% area under curve for the ROC model which indicates a very good performance and highlights its appropriateness for simulating urban growth.

  16. Parrallel Implementation of Fast Randomized Algorithms for Low Rank Matrix Decomposition

    SciTech Connect

    Lucas, Andrew J.; Stalizer, Mark; Feo, John T.

    2014-03-01

    We analyze the parallel performance of randomized interpolative decomposition by de- composing low rank complex-valued Gaussian random matrices larger than 100 GB. We chose a Cray XMT supercomputer as it provides an almost ideal PRAM model permitting quick investigation of parallel algorithms without obfuscation from hardware idiosyncrasies. We obtain that on non-square matrices performance scales almost linearly with runtime about 100 times faster on 128 processors. We also verify that numerically discovered error bounds still hold on matrices two orders of magnitude larger than those previously tested.

  17. Rotorcraft Blade Mode Damping Identification from Random Responses Using a Recursive Maximum Likelihood Algorithm

    NASA Technical Reports Server (NTRS)

    Molusis, J. A.

    1982-01-01

    An on line technique is presented for the identification of rotor blade modal damping and frequency from rotorcraft random response test data. The identification technique is based upon a recursive maximum likelihood (RML) algorithm, which is demonstrated to have excellent convergence characteristics in the presence of random measurement noise and random excitation. The RML technique requires virtually no user interaction, provides accurate confidence bands on the parameter estimates, and can be used for continuous monitoring of modal damping during wind tunnel or flight testing. Results are presented from simulation random response data which quantify the identified parameter convergence behavior for various levels of random excitation. The data length required for acceptable parameter accuracy is shown to depend upon the amplitude of random response and the modal damping level. Random response amplitudes of 1.25 degrees to .05 degrees are investigated. The RML technique is applied to hingeless rotor test data. The inplane lag regressing mode is identified at different rotor speeds. The identification from the test data is compared with the simulation results and with other available estimates of frequency and damping.

  18. Enhancing network robustness against targeted and random attacks using a memetic algorithm

    NASA Astrophysics Data System (ADS)

    Tang, Xianglong; Liu, Jing; Zhou, Mingxing

    2015-08-01

    In the past decades, there has been much interest in the elasticity of infrastructures to targeted and random attacks. In the recent work by Schneider C. M. et al., Proc. Natl. Acad. Sci. U.S.A., 108 (2011) 3838, the authors proposed an effective measure (namely R, here we label it as R t to represent the measure for targeted attacks) to evaluate network robustness against targeted node attacks. Using a greedy algorithm, they found that the optimal structure is an onion-like one. However, real systems are often under threats of both targeted attacks and random failures. So, enhancing networks robustness against both targeted and random attacks is of great importance. In this paper, we first design a random-robustness index (Rr) . We find that the onion-like networks destroyed the original strong ability of BA networks in resisting random attacks. Moreover, the structure of an R r -optimized network is found to be different from that of an onion-like network. To design robust scale-free networks (RSF) which are resistant to both targeted and random attacks (TRA) without changing the degree distribution, a memetic algorithm (MA) is proposed, labeled as \\textit{MA-RSF}\\textit{TRA} . In the experiments, both synthetic scale-free networks and real-world networks are used to validate the performance of \\textit{MA-RSF}\\textit{TRA} . The results show that \\textit{MA-RSF} \\textit{TRA} has a great ability in searching for the most robust network structure that is resistant to both targeted and random attacks.

  19. A large scale microwave emission model for forests. Contribution to the SMOS algorithm

    NASA Astrophysics Data System (ADS)

    Rahmoune, R.; Della Vecchia, A.; Ferrazzoli, P.; Guerriero, L.; Martin-Porqueras, F.

    2009-04-01

    are being considered. Also the effects of temperature gradients within the crown canopy are being considered. The model was tested against radiometric measurements carried out by towers and aircrafts. A new test has been done using the brightness temperatures measured over some forests in Finland by the AMIRAS radiometer, which is an airborne demonstrator of the MIRAS imaging radiometer to be launched with SMOS. The outputs produced by the model are used to fit the parameters of the simple radiative transfer model which will be used in the Level 2 soil moisture retrieval algorithm. It is planned to compare model outputs with L1C data, which will be made available during the commissioning phase. To this end, a number of adequate extended forest sites are being selected: the Amazon rain forest, the Zaire Basins, the Argentinian Chaco forest, and the Finland forest. 2. PARAMETRIC STUDIES In this paper, results of parametric simulations are shown. The emissivity at vertical and horizontal polarization is simulated as a function of soil moisture content for various conditions of forest cover. Seasonal effects are considered, and the values of Leaf Area Index in winter and summer are taken as basic inputs. The difference between the two values is attributed partially to arboreous foliage and partially to understory, while the woody biomass is assumed to be constant in time. Results indicate that seasonal effects are limited, but not negligible. The simulations are repeated for different distributions of trunk diameters. If the distributions is centered over lower diameter values, the forest is optically thicker, for a given biomass. Also the variations of brightness temperature due to a temperature gradient within the crown canopy have been estimated. The outputs are used to predict the values of a simple first order RT model. 3. COMPARISONS WITH EXPERIMENTAL DATA Results of previous comparisons between model simulations and experimental data are summarized. Experimental

  20. Predicting protein-RNA interaction amino acids using random forest based on submodularity subset selection.

    PubMed

    Pan, Xiaoyong; Zhu, Lin; Fan, Yong-Xian; Yan, Junchi

    2014-11-13

    Protein-RNA interaction plays a very crucial role in many biological processes, such as protein synthesis, transcription and post-transcription of gene expression and pathogenesis of disease. Especially RNAs always function through binding to proteins. Identification of binding interface region is especially useful for cellular pathways analysis and drug design. In this study, we proposed a novel approach for binding sites identification in proteins, which not only integrates local features and global features from protein sequence directly, but also constructed a balanced training dataset using sub-sampling based on submodularity subset selection. Firstly we extracted local features and global features from protein sequence, such as evolution information and molecule weight. Secondly, the number of non-interaction sites is much more than interaction sites, which leads to a sample imbalance problem, and hence biased machine learning model with preference to non-interaction sites. To better resolve this problem, instead of previous randomly sub-sampling over-represented non-interaction sites, a novel sampling approach based on submodularity subset selection was employed, which can select more representative data subset. Finally random forest were trained on optimally selected training subsets to predict interaction sites. Our result showed that our proposed method is very promising for predicting protein-RNA interaction residues, it achieved an accuracy of 0.863, which is better than other state-of-the-art methods. Furthermore, it also indicated the extracted global features have very strong discriminate ability for identifying interaction residues from random forest feature importance analysis. PMID:25462339

  1. The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets.

    PubMed

    González-Recio, O; Jiménez-Montero, J A; Alenda, R

    2013-01-01

    In the next few years, with the advent of high-density single nucleotide polymorphism (SNP) arrays and genome sequencing, genomic evaluation methods will need to deal with a large number of genetic variants and an increasing sample size. The boosting algorithm is a machine-learning technique that may alleviate the drawbacks of dealing with such large data sets. This algorithm combines different predictors in a sequential manner with some shrinkage on them; each predictor is applied consecutively to the residuals from the committee formed by the previous ones to form a final prediction based on a subset of covariates. Here, a detailed description is provided and examples using a toy data set are included. A modification of the algorithm called "random boosting" was proposed to increase predictive ability and decrease computation time of genome-assisted evaluation in large data sets. Random boosting uses a random selection of markers to add a subsequent weak learner to the predictive model. These modifications were applied to a real data set composed of 1,797 bulls genotyped for 39,714 SNP. Deregressed proofs of 4 yield traits and 1 type trait from January 2009 routine evaluations were used as dependent variables. A 2-fold cross-validation scenario was implemented. Sires born before 2005 were used as a training sample (1,576 and 1,562 for production and type traits, respectively), whereas younger sires were used as a testing sample to evaluate predictive ability of the algorithm on yet-to-be-observed phenotypes. Comparison with the original algorithm was provided. The predictive ability of the algorithm was measured as Pearson correlations between observed and predicted responses. Further, estimated bias was computed as the average difference between observed and predicted phenotypes. The results showed that the modification of the original boosting algorithm could be run in 1% of the time used with the original algorithm and with negligible differences in accuracy

  2. Representation of high frequency Space Shuttle data by ARMA algorithms and random response spectra

    NASA Technical Reports Server (NTRS)

    Spanos, P. D.; Mushung, L. J.

    1990-01-01

    High frequency Space Shuttle lift-off data are treated by autoregressive (AR) and autoregressive-moving-average (ARMA) digital algorithms. These algorithms provide useful information on the spectral densities of the data. Further, they yield spectral models which lend themselves to incorporation to the concept of the random response spectrum. This concept yields a reasonably smooth power spectrum for the design of structural and mechanical systems when the available data bank is limited. Due to the non-stationarity of the lift-off event, the pertinent data are split into three slices. Each of the slices is associated with a rather distinguishable phase of the lift-off event, where stationarity can be expected. The presented results are rather preliminary in nature; it is aimed to call attention to the availability of the discussed digital algorithms and to the need to augment the Space Shuttle data bank as more flights are completed.

  3. Automatic estimation of voice onset time for word-initial stops by applying random forest to onset detection.

    PubMed

    Lin, Chi-Yueh; Wang, Hsiao-Chuan

    2011-07-01

    The voice onset time (VOT) of a stop consonant is the interval between its burst onset and voicing onset. Among a variety of research topics on VOT, one that has been studied for years is how VOTs are efficiently measured. Manual annotation is a feasible way, but it becomes a time-consuming task when the corpus size is large. This paper proposes an automatic VOT estimation method based on an onset detection algorithm. At first, a forced alignment is applied to identify the locations of stop consonants. Then a random forest based onset detector searches each stop segment for its burst and voicing onsets to estimate a VOT. The proposed onset detection can detect the onsets in an efficient and accurate manner with only a small amount of training data. The evaluation data extracted from the TIMIT corpus were 2344 words with a word-initial stop. The experimental results showed that 83.4% of the estimations deviate less than 10 ms from their manually labeled values, and 96.5% of the estimations deviate by less than 20 ms. Some factors that influence the proposed estimation method, such as place of articulation, voicing of a stop consonant, and quality of succeeding vowel, were also investigated. PMID:21786917

  4. Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines)

    NASA Astrophysics Data System (ADS)

    Carranza, Emmanuel John M.; Laborte, Alice G.

    2015-01-01

    Machine learning methods that have been used in data-driven predictive modeling of mineral prospectivity (e.g., artificial neural networks) invariably require large number of training prospect/locations and are unable to handle missing values in certain evidential data. The Random Forests (RF) algorithm, which is a machine learning method, has recently been applied to data-driven predictive mapping of mineral prospectivity, and so it is instructive to further study its efficacy in this particular field. This case study, carried out using data from Abra (Philippines), examines (a) if RF modeling can be used for data-driven modeling of mineral prospectivity in areas with a few (i.e., <20) mineral occurrences and (b) if RF modeling can handle evidential data with missing values. We found that RF modeling outperforms weights-of-evidence (WofE) modeling of porphyry-Cu prospectivity in the Abra area, where 12 porphyry-Cu prospects are known to exist. Moreover, just like WofE modeling, RF modeling allows analysis of the spatial associations of known prospects with individual layers of evidential data. Furthermore, RF modeling can handle missing values in evidential data through an RF-based imputation technique whereas in WofE modeling values are simply represented by zero weights. Therefore, the RF algorithm is potentially more useful than existing methods that are currently used for data-driven predictive mapping of mineral prospectivity. In particular, it is not a purely black-box method like artificial neural networks in the context of data-driven predictive modeling of mineral prospectivity. However, further testing of the method in other areas with a few mineral occurrences is needed to fully investigate its usefulness in data-driven predictive modeling of mineral prospectivity.

  5. (abstract) Using an Inversion Algorithm to Retrieve Parameters and Monitor Changes over Forested Areas from SAR Data

    NASA Technical Reports Server (NTRS)

    Moghaddam, Mahta

    1995-01-01

    In this work, the application of an inversion algorithm based on a nonlinear opimization technique to retrieve forest parameters from multifrequency polarimetric SAR data is discussed. The approach discussed here allows for retrieving and monitoring changes in forest parameters in a quantative and systematic fashion using SAR data. The parameters to be inverted directly from the data are the electromagnetic scattering properties of the forest components such as their dielectric constants and size characteristics. Once these are known, attributes such as canopy moisture content can be obtained, which are useful in the ecosystem models.

  6. Image Quality Assessment Using Human Visual DOG Model Fused With Random Forest.

    PubMed

    Pei, Soo-Chang; Chen, Li-Heng

    2015-11-01

    Objective image quality assessment (IQA) plays an important role in the development of multimedia applications. Prediction of IQA metric should be consistent with human perception. The release of the newest IQA database (TID2013) challenges most of the widely used quality metrics (e.g., peak-to-noise-ratio and structure similarity index). We propose a new methodology to build the metric model using a regression approach. The new IQA score is set to be the nonlinear combination of features extracted from several difference of Gaussian (DOG) frequency bands, which mimics the human visual system (HVS). Experimental results show that the random forest regression model trained by the proposed DOG feature is highly correspondent to the HVS and is also robust when tested by available databases. PMID:26054064

  7. Numerical Demultiplexing of Color Image Sensor Measurements via Non-linear Random Forest Modeling

    PubMed Central

    Deglint, Jason; Kazemzadeh, Farnoud; Cho, Daniel; Clausi, David A.; Wong, Alexander

    2016-01-01

    The simultaneous capture of imaging data at multiple wavelengths across the electromagnetic spectrum is highly challenging, requiring complex and costly multispectral image devices. In this study, we investigate the feasibility of simultaneous multispectral imaging using conventional image sensors with color filter arrays via a novel comprehensive framework for numerical demultiplexing of the color image sensor measurements. A numerical forward model characterizing the formation of sensor measurements from light spectra hitting the sensor is constructed based on a comprehensive spectral characterization of the sensor. A numerical demultiplexer is then learned via non-linear random forest modeling based on the forward model. Given the learned numerical demultiplexer, one can then demultiplex simultaneously-acquired measurements made by the color image sensor into reflectance intensities at discrete selectable wavelengths, resulting in a higher resolution reflectance spectrum. Experimental results demonstrate the feasibility of such a method for the purpose of simultaneous multispectral imaging. PMID:27346434

  8. Prediction of G Protein-Coupled Receptors with SVM-Prot Features and Random Forest

    PubMed Central

    Ju, Ying

    2016-01-01

    G protein-coupled receptors (GPCRs) are the largest receptor superfamily. In this paper, we try to employ physical-chemical properties, which come from SVM-Prot, to represent GPCR. Random Forest was utilized as classifier for distinguishing them from other protein sequences. MEME suite was used to detect the most significant 10 conserved motifs of human GPCRs. In the testing datasets, the average accuracy was 91.61%, and the average AUC was 0.9282. MEME discovery analysis showed that many motifs aggregated in the seven hydrophobic helices transmembrane regions adapt to the characteristic of GPCRs. All of the above indicate that our machine-learning method can successfully distinguish GPCRs from non-GPCRs. PMID:27529053

  9. Prediction of G Protein-Coupled Receptors with SVM-Prot Features and Random Forest.

    PubMed

    Liao, Zhijun; Ju, Ying; Zou, Quan

    2016-01-01

    G protein-coupled receptors (GPCRs) are the largest receptor superfamily. In this paper, we try to employ physical-chemical properties, which come from SVM-Prot, to represent GPCR. Random Forest was utilized as classifier for distinguishing them from other protein sequences. MEME suite was used to detect the most significant 10 conserved motifs of human GPCRs. In the testing datasets, the average accuracy was 91.61%, and the average AUC was 0.9282. MEME discovery analysis showed that many motifs aggregated in the seven hydrophobic helices transmembrane regions adapt to the characteristic of GPCRs. All of the above indicate that our machine-learning method can successfully distinguish GPCRs from non-GPCRs. PMID:27529053

  10. Numerical Demultiplexing of Color Image Sensor Measurements via Non-linear Random Forest Modeling.

    PubMed

    Deglint, Jason; Kazemzadeh, Farnoud; Cho, Daniel; Clausi, David A; Wong, Alexander

    2016-01-01

    The simultaneous capture of imaging data at multiple wavelengths across the electromagnetic spectrum is highly challenging, requiring complex and costly multispectral image devices. In this study, we investigate the feasibility of simultaneous multispectral imaging using conventional image sensors with color filter arrays via a novel comprehensive framework for numerical demultiplexing of the color image sensor measurements. A numerical forward model characterizing the formation of sensor measurements from light spectra hitting the sensor is constructed based on a comprehensive spectral characterization of the sensor. A numerical demultiplexer is then learned via non-linear random forest modeling based on the forward model. Given the learned numerical demultiplexer, one can then demultiplex simultaneously-acquired measurements made by the color image sensor into reflectance intensities at discrete selectable wavelengths, resulting in a higher resolution reflectance spectrum. Experimental results demonstrate the feasibility of such a method for the purpose of simultaneous multispectral imaging. PMID:27346434

  11. Prediction of Detailed Enzyme Functions and Identification of Specificity Determining Residues by Random Forests

    PubMed Central

    Nagao, Chioko; Nagano, Nozomi; Mizuguchi, Kenji

    2014-01-01

    Determining enzyme functions is essential for a thorough understanding of cellular processes. Although many prediction methods have been developed, it remains a significant challenge to predict enzyme functions at the fourth-digit level of the Enzyme Commission numbers. Functional specificity of enzymes often changes drastically by mutations of a small number of residues and therefore, information about these critical residues can potentially help discriminate detailed functions. However, because these residues must be identified by mutagenesis experiments, the available information is limited, and the lack of experimentally verified specificity determining residues (SDRs) has hindered the development of detailed function prediction methods and computational identification of SDRs. Here we present a novel method for predicting enzyme functions by random forests, EFPrf, along with a set of putative SDRs, the random forests derived SDRs (rf-SDRs). EFPrf consists of a set of binary predictors for enzymes in each CATH superfamily and the rf-SDRs are the residue positions corresponding to the most highly contributing attributes obtained from each predictor. EFPrf showed a precision of 0.98 and a recall of 0.89 in a cross-validated benchmark assessment. The rf-SDRs included many residues, whose importance for specificity had been validated experimentally. The analysis of the rf-SDRs revealed both a general tendency that functionally diverged superfamilies tend to include more active site residues in their rf-SDRs than in less diverged superfamilies, and superfamily-specific conservation patterns of each functional residue. EFPrf and the rf-SDRs will be an effective tool for annotating enzyme functions and for understanding how enzyme functions have diverged within each superfamily. PMID:24416252

  12. Merits of random forests emerge in evaluation of chemometric classifiers by external validation.

    PubMed

    Scott, I M; Lin, W; Liakata, M; Wood, J E; Vermeer, C P; Allaway, D; Ward, J L; Draper, J; Beale, M H; Corol, D I; Baker, J M; King, R D

    2013-11-01

    Real-world applications will inevitably entail divergence between samples on which chemometric classifiers are trained and the unknowns requiring classification. This has long been recognized, but there is a shortage of empirical studies on which classifiers perform best in 'external validation' (EV), where the unknown samples are subject to sources of variation relative to the population used to train the classifier. Survey of 286 classification studies in analytical chemistry found only 6.6% that stated elements of variance between training and test samples. Instead, most tested classifiers using hold-outs or resampling (usually cross-validation) from the same population used in training. The present study evaluated a wide range of classifiers on NMR and mass spectra of plant and food materials, from four projects with different data properties (e.g., different numbers and prevalence of classes) and classification objectives. Use of cross-validation was found to be optimistic relative to EV on samples of different provenance to the training set (e.g., different genotypes, different growth conditions, different seasons of crop harvest). For classifier evaluations across the diverse tasks, we used ranks-based non-parametric comparisons, and permutation-based significance tests. Although latent variable methods (e.g., PLSDA) were used in 64% of the surveyed papers, they were among the less successful classifiers in EV, and orthogonal signal correction was counterproductive. Instead, the best EV performances were obtained with machine learning schemes that coped with the high dimensionality (914-1898 features). Random forests confirmed their resilience to high dimensionality, as best overall performers on the full data, despite being used in only 4.5% of the surveyed papers. Most other machine learning classifiers were improved by a feature selection filter (ReliefF), but still did not out-perform random forests. PMID:24139571

  13. An evaluation of ISOCLS and CLASSY clustering algorithms for forest classification in northern Idaho. [Elk River quadrange of the Clearwater National Forest

    NASA Technical Reports Server (NTRS)

    Werth, L. F. (Principal Investigator)

    1981-01-01

    Both the iterative self-organizing clustering system (ISOCLS) and the CLASSY algorithms were applied to forest and nonforest classes for one 1:24,000 quadrangle map of northern Idaho and the classification and mapping accuracies were evaluated with 1:30,000 color infrared aerial photography. Confusion matrices for the two clustering algorithms were generated and studied to determine which is most applicable to forest and rangeland inventories in future projects. In an unsupervised mode, ISOCLS requires many trial-and-error runs to find the proper parameters to separate desired information classes. CLASSY tells more in a single run concerning the classes that can be separated, shows more promise for forest stratification than ISOCLS, and shows more promise for consistency. One major drawback to CLASSY is that important forest and range classes that are smaller than a minimum cluster size will be combined with other classes. The algorithm requires so much computer storage that only data sets as small as a quadrangle can be used at one time.

  14. Hyperspectral image clustering method based on artificial bee colony algorithm and Markov random fields

    NASA Astrophysics Data System (ADS)

    Sun, Xu; Yang, Lina; Gao, Lianru; Zhang, Bing; Li, Shanshan; Li, Jun

    2015-01-01

    Center-oriented hyperspectral image clustering methods have been widely applied to hyperspectral remote sensing image processing; however, the drawbacks are obvious, including the over-simplicity of computing models and underutilized spatial information. In recent years, some studies have been conducted trying to improve this situation. We introduce the artificial bee colony (ABC) and Markov random field (MRF) algorithms to propose an ABC-MRF-cluster model to solve the problems mentioned above. In this model, a typical ABC algorithm framework is adopted in which cluster centers and iteration conditional model algorithm's results are considered as feasible solutions and objective functions separately, and MRF is modified to be capable of dealing with the clustering problem. Finally, four datasets and two indices are used to show that the application of ABC-cluster and ABC-MRF-cluster methods could help to obtain better image accuracy than conventional methods. Specifically, the ABC-cluster method is superior when used for a higher power of spectral discrimination, whereas the ABC-MRF-cluster method can provide better results when used for an adjusted random index. In experiments on simulated images with different signal-to-noise ratios, ABC-cluster and ABC-MRF-cluster showed good stability.

  15. Personalized PageRank Clustering: A graph clustering algorithm based on random walks

    NASA Astrophysics Data System (ADS)

    A. Tabrizi, Shayan; Shakery, Azadeh; Asadpour, Masoud; Abbasi, Maziar; Tavallaie, Mohammad Ali

    2013-11-01

    Graph clustering has been an essential part in many methods and thus its accuracy has a significant effect on many applications. In addition, exponential growth of real-world graphs such as social networks, biological networks and electrical circuits demands clustering algorithms with nearly-linear time and space complexity. In this paper we propose Personalized PageRank Clustering (PPC) that employs the inherent cluster exploratory property of random walks to reveal the clusters of a given graph. We combine random walks and modularity to precisely and efficiently reveal the clusters of a graph. PPC is a top-down algorithm so it can reveal inherent clusters of a graph more accurately than other nearly-linear approaches that are mainly bottom-up. It also gives a hierarchy of clusters that is useful in many applications. PPC has a linear time and space complexity and has been superior to most of the available clustering algorithms on many datasets. Furthermore, its top-down approach makes it a flexible solution for clustering problems with different requirements.

  16. Random Forest-Based Recognition of Isolated Sign Language Subwords Using Data from Accelerometers and Surface Electromyographic Sensors.

    PubMed

    Su, Ruiliang; Chen, Xiang; Cao, Shuai; Zhang, Xu

    2015-01-01

    Sign language recognition (SLR) has been widely used for communication amongst the hearing-impaired and non-verbal community. This paper proposes an accurate and robust SLR framework using an improved decision tree as the base classifier of random forests. This framework was used to recognize Chinese sign language subwords using recordings from a pair of portable devices worn on both arms consisting of accelerometers (ACC) and surface electromyography (sEMG) sensors. The experimental results demonstrated the validity of the proposed random forest-based method for recognition of Chinese sign language (CSL) subwords. With the proposed method, 98.25% average accuracy was obtained for the classification of a list of 121 frequently used CSL subwords. Moreover, the random forests method demonstrated a superior performance in resisting the impact of bad training samples. When the proportion of bad samples in the training set reached 50%, the recognition error rate of the random forest-based method was only 10.67%, while that of a single decision tree adopted in our previous work was almost 27.5%. Our study offers a practical way of realizing a robust and wearable EMG-ACC-based SLR systems. PMID:26784195

  17. An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classification and Regression Trees, Bagging, and Random Forests

    ERIC Educational Resources Information Center

    Strobl, Carolin; Malley, James; Tutz, Gerhard

    2009-01-01

    Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, which can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine, and…

  18. Random Forest-Based Recognition of Isolated Sign Language Subwords Using Data from Accelerometers and Surface Electromyographic Sensors

    PubMed Central

    Su, Ruiliang; Chen, Xiang; Cao, Shuai; Zhang, Xu

    2016-01-01

    Sign language recognition (SLR) has been widely used for communication amongst the hearing-impaired and non-verbal community. This paper proposes an accurate and robust SLR framework using an improved decision tree as the base classifier of random forests. This framework was used to recognize Chinese sign language subwords using recordings from a pair of portable devices worn on both arms consisting of accelerometers (ACC) and surface electromyography (sEMG) sensors. The experimental results demonstrated the validity of the proposed random forest-based method for recognition of Chinese sign language (CSL) subwords. With the proposed method, 98.25% average accuracy was obtained for the classification of a list of 121 frequently used CSL subwords. Moreover, the random forests method demonstrated a superior performance in resisting the impact of bad training samples. When the proportion of bad samples in the training set reached 50%, the recognition error rate of the random forest-based method was only 10.67%, while that of a single decision tree adopted in our previous work was almost 27.5%. Our study offers a practical way of realizing a robust and wearable EMG-ACC-based SLR systems. PMID:26784195

  19. Automated Detection of Selective Logging in Amazon Forests Using Airborne Lidar Data and Pattern Recognition Algorithms

    NASA Astrophysics Data System (ADS)

    Keller, M. M.; d'Oliveira, M. N.; Takemura, C. M.; Vitoria, D.; Araujo, L. S.; Morton, D. C.

    2012-12-01

    Selective logging, the removal of several valuable timber trees per hectare, is an important land use in the Brazilian Amazon and may degrade forests through long term changes in structure, loss of forest carbon and species diversity. Similar to deforestation, the annual area affected by selected logging has declined significantly in the past decade. Nonetheless, this land use affects several thousand km2 per year in Brazil. We studied a 1000 ha area of the Antimary State Forest (FEA) in the State of Acre, Brazil (9.304 ○S, 68.281 ○W) that has a basal area of 22.5 m2 ha-1 and an above-ground biomass of 231 Mg ha-1. Logging intensity was low, approximately 10 to 15 m3 ha-1. We collected small-footprint airborne lidar data using an Optech ALTM 3100EA over the study area once each in 2010 and 2011. The study area contained both recent and older logging that used both conventional and technologically advanced logging techniques. Lidar return density averaged over 20 m-2 for both collection periods with estimated horizontal and vertical precision of 0.30 and 0.15 m. A relative density model comparing returns from 0 to 1 m elevation to returns in 1-5 m elevation range revealed the pattern of roads and skid trails. These patterns were confirmed by ground-based GPS survey. A GIS model of the road and skid network was built using lidar and ground data. We tested and compared two pattern recognition approaches used to automate logging detection. Both segmentation using commercial eCognition segmentation and a Frangi filter algorithm identified the road and skid trail network compared to the GIS model. We report on the effectiveness of these two techniques.

  20. Groupwise conditional random forests for automatic shape classification and contour quality assessment in radiotherapy planning.

    PubMed

    McIntosh, Chris; Svistoun, Igor; Purdie, Thomas G

    2013-06-01

    Radiation therapy is used to treat cancer patients around the world. High quality treatment plans maximally radiate the targets while minimally radiating healthy organs at risk. In order to judge plan quality and safety, segmentations of the targets and organs at risk are created, and the amount of radiation that will be delivered to each structure is estimated prior to treatment. If the targets or organs at risk are mislabelled, or the segmentations are of poor quality, the safety of the radiation doses will be erroneously reviewed and an unsafe plan could proceed. We propose a technique to automatically label groups of segmentations of different structures from a radiation therapy plan for the joint purposes of providing quality assurance and data mining. Given one or more segmentations and an associated image we seek to assign medically meaningful labels to each segmentation and report the confidence of that label. Our method uses random forests to learn joint distributions over the training features, and then exploits a set of learned potential group configurations to build a conditional random field (CRF) that ensures the assignment of labels is consistent across the group of segmentations. The CRF is then solved via a constrained assignment problem. We validate our method on 1574 plans, consisting of 17[Formula: see text] 579 segmentations, demonstrating an overall classification accuracy of 91.58%. Our results also demonstrate the stability of RF with respect to tree depth and the number of splitting variables in large data sets. PMID:23475352

  1. Random forest in remote sensing: A review of applications and future directions

    NASA Astrophysics Data System (ADS)

    Belgiu, Mariana; Drăguţ, Lucian

    2016-04-01

    A random forest (RF) classifier is an ensemble classifier that produces multiple decision trees, using a randomly selected subset of training samples and variables. This classifier has become popular within the remote sensing community due to the accuracy of its classifications. The overall objective of this work was to review the utilization of RF classifier in remote sensing. This review has revealed that RF classifier can successfully handle high data dimensionality and multicolinearity, being both fast and insensitive to overfitting. It is, however, sensitive to the sampling design. The variable importance (VI) measurement provided by the RF classifier has been extensively exploited in different scenarios, for example to reduce the number of dimensions of hyperspectral data, to identify the most relevant multisource remote sensing and geographic data, and to select the most suitable season to classify particular target classes. Further investigations are required into less commonly exploited uses of this classifier, such as for sample proximity analysis to detect and remove outliers in the training samples.

  2. View-invariant, partially occluded human detection in still images using part bases and random forest

    NASA Astrophysics Data System (ADS)

    Ko, Byoung Chul; Son, Jung Eun; Nam, Jae-Yeal

    2015-05-01

    This paper presents a part-based human detection method that is invariant to variations in the view of the human and partial occlusion by other objects. First, to address the view variance, parts are extracted from three views: frontal-rear, left profile, and right profile. Then a random set of rectangular parts are extracted from the upper, middle, and lower body as the distribution of Gaussian. Second, an individual part classifier is constructed using random forests across all parts extracted from the three views. From the part locations of each view, part vectors (PVs) are generated and part bases (PB) are also formalized by clustering PVs with their weights of each PB. For testing, a PV for the frontal-rear view is estimated using trained part detectors and is then applied to the trained PB for each view class. Then the distance is computed between the PB and PVs. After applying the same process to the other two views, the final human and its view having the minimum score are selected. The proposed method is applied to pedestrian datasets and its detection precision is, on average, 0.14 higher than related methods, while achieving a faster or comparable processing time with an average of 1.85 s per image.

  3. Analysis of (1+1) evolutionary algorithm and randomized local search with memory.

    PubMed

    Sung, Chi Wan; Yuen, Shiu Yin

    2011-01-01

    This paper considers the scenario of the (1+1) evolutionary algorithm (EA) and randomized local search (RLS) with memory. Previously explored solutions are stored in memory until an improvement in fitness is obtained; then the stored information is discarded. This results in two new algorithms: (1+1) EA-m (with a raw list and hash table option) and RLS-m+ (and RLS-m if the function is a priori known to be unimodal). These two algorithms can be regarded as very simple forms of tabu search. Rigorous theoretical analysis of the expected time to find the globally optimal solutions for these algorithms is conducted for both unimodal and multimodal functions. A unified mathematical framework, involving the new concept of spatially invariant neighborhood, is proposed. Under this framework, both (1+1) EA with standard uniform mutation and RLS can be considered as particular instances and in the most general cases, all functions can be considered to be unimodal. Under this framework, it is found that for unimodal functions, the improvement by memory assistance is always positive but at most by one half. For multimodal functions, the improvement is significant; for functions with gaps and another hard function, the order of growth is reduced; for at least one example function, the order can change from exponential to polynomial. Empirical results, with a reasonable fitness evaluation time assumption, verify that (1+1) EA-m and RLS-m+ are superior to their conventional counterparts. Both new algorithms are promising for use in a memetic algorithm. In particular, RLS-m+ makes the previously impractical RLS practical, and surprisingly, does not require any extra memory in actual implementation. PMID:20868262

  4. Landslide susceptibility estimation by random forests technique: sensitivity and scaling issues

    NASA Astrophysics Data System (ADS)

    Catani, F.; Lagomarsino, D.; Segoni, S.; Tofani, V.

    2013-11-01

    Despite the large number of recent advances and developments in landslide susceptibility mapping (LSM) there is still a lack of studies focusing on specific aspects of LSM model sensitivity. For example, the influence of factors such as the survey scale of the landslide conditioning variables (LCVs), the resolution of the mapping unit (MUR) and the optimal number and ranking of LCVs have never been investigated analytically, especially on large data sets. In this paper we attempt this experimentation concentrating on the impact of model tuning choice on the final result, rather than on the comparison of methodologies. To this end, we adopt a simple implementation of the random forest (RF), a machine learning technique, to produce an ensemble of landslide susceptibility maps for a set of different model settings, input data types and scales. Random forest is a combination of Bayesian trees that relates a set of predictors to the actual landslide occurrence. Being it a nonparametric model, it is possible to incorporate a range of numerical or categorical data layers and there is no need to select unimodal training data as for example in linear discriminant analysis. Many widely acknowledged landslide predisposing factors are taken into account as mainly related to the lithology, the land use, the geomorphology, the structural and anthropogenic constraints. In addition, for each factor we also include in the predictors set a measure of the standard deviation (for numerical variables) or the variety (for categorical ones) over the map unit. As in other systems, the use of RF enables one to estimate the relative importance of the single input parameters and to select the optimal configuration of the classification model. The model is initially applied using the complete set of input variables, then an iterative process is implemented and progressively smaller subsets of the parameter space are considered. The impact of scale and accuracy of input variables, as well as

  5. Lesion segmentation from multimodal MRI using random forest following ischemic stroke.

    PubMed

    Mitra, Jhimli; Bourgeat, Pierrick; Fripp, Jurgen; Ghose, Soumya; Rose, Stephen; Salvado, Olivier; Connelly, Alan; Campbell, Bruce; Palmer, Susan; Sharma, Gagan; Christensen, Soren; Carey, Leeanne

    2014-09-01

    Understanding structure-function relationships in the brain after stroke is reliant not only on the accurate anatomical delineation of the focal ischemic lesion, but also on previous infarcts, remote changes and the presence of white matter hyperintensities. The robust definition of primary stroke boundaries and secondary brain lesions will have significant impact on investigation of brain-behavior relationships and lesion volume correlations with clinical measures after stroke. Here we present an automated approach to identify chronic ischemic infarcts in addition to other white matter pathologies, that may be used to aid the development of post-stroke management strategies. Our approach uses Bayesian-Markov Random Field (MRF) classification to segment probable lesion volumes present on fluid attenuated inversion recovery (FLAIR) MRI. Thereafter, a random forest classification of the information from multimodal (T1-weighted, T2-weighted, FLAIR, and apparent diffusion coefficient (ADC)) MRI images and other context-aware features (within the probable lesion areas) was used to extract areas with high likelihood of being classified as lesions. The final segmentation of the lesion was obtained by thresholding the random forest probabilistic maps. The accuracy of the automated lesion delineation method was assessed in a total of 36 patients (24 male, 12 female, mean age: 64.57±14.23yrs) at 3months after stroke onset and compared with manually segmented lesion volumes by an expert. Accuracy assessment of the automated lesion identification method was performed using the commonly used evaluation metrics. The mean sensitivity of segmentation was measured to be 0.53±0.13 with a mean positive predictive value of 0.75±0.18. The mean lesion volume difference was observed to be 32.32%±21.643% with a high Pearson's correlation of r=0.76 (p<0.0001). The lesion overlap accuracy was measured in terms of Dice similarity coefficient with a mean of 0.60±0.12, while the contour

  6. Random Search Algorithm for Solving the Nonlinear Fredholm Integral Equations of the Second Kind

    PubMed Central

    Hong, Zhimin; Yan, Zaizai; Yan, Jiao

    2014-01-01

    In this paper, a randomized numerical approach is used to obtain approximate solutions for a class of nonlinear Fredholm integral equations of the second kind. The proposed approach contains two steps: at first, we define a discretized form of the integral equation by quadrature formula methods and solution of this discretized form converges to the exact solution of the integral equation by considering some conditions on the kernel of the integral equation. And then we convert the problem to an optimal control problem by introducing an artificial control function. Following that, in the next step, solution of the discretized form is approximated by a kind of Monte Carlo (MC) random search algorithm. Finally, some examples are given to show the efficiency of the proposed approach. PMID:25072373

  7. Evolving random fractal Cantor superlattices for the infrared using a genetic algorithm

    PubMed Central

    Bossard, Jeremy A.; Lin, Lan; Werner, Douglas H.

    2016-01-01

    Ordered and chaotic superlattices have been identified in Nature that give rise to a variety of colours reflected by the skin of various organisms. In particular, organisms such as silvery fish possess superlattices that reflect a broad range of light from the visible to the UV. Such superlattices have previously been identified as ‘chaotic’, but we propose that apparent ‘chaotic’ natural structures, which have been previously modelled as completely random structures, should have an underlying fractal geometry. Fractal geometry, often described as the geometry of Nature, can be used to mimic structures found in Nature, but deterministic fractals produce structures that are too ‘perfect’ to appear natural. Introducing variability into fractals produces structures that appear more natural. We suggest that the ‘chaotic’ (purely random) superlattices identified in Nature are more accurately modelled by multi-generator fractals. Furthermore, we introduce fractal random Cantor bars as a candidate for generating both ordered and ‘chaotic’ superlattices, such as the ones found in silvery fish. A genetic algorithm is used to evolve optimal fractal random Cantor bars with multiple generators targeting several desired optical functions in the mid-infrared and the near-infrared. We present optimized superlattices demonstrating broadband reflection as well as single and multiple pass bands in the near-infrared regime. PMID:26763335

  8. Evolving random fractal Cantor superlattices for the infrared using a genetic algorithm.

    PubMed

    Bossard, Jeremy A; Lin, Lan; Werner, Douglas H

    2016-01-01

    Ordered and chaotic superlattices have been identified in Nature that give rise to a variety of colours reflected by the skin of various organisms. In particular, organisms such as silvery fish possess superlattices that reflect a broad range of light from the visible to the UV. Such superlattices have previously been identified as 'chaotic', but we propose that apparent 'chaotic' natural structures, which have been previously modelled as completely random structures, should have an underlying fractal geometry. Fractal geometry, often described as the geometry of Nature, can be used to mimic structures found in Nature, but deterministic fractals produce structures that are too 'perfect' to appear natural. Introducing variability into fractals produces structures that appear more natural. We suggest that the 'chaotic' (purely random) superlattices identified in Nature are more accurately modelled by multi-generator fractals. Furthermore, we introduce fractal random Cantor bars as a candidate for generating both ordered and 'chaotic' superlattices, such as the ones found in silvery fish. A genetic algorithm is used to evolve optimal fractal random Cantor bars with multiple generators targeting several desired optical functions in the mid-infrared and the near-infrared. We present optimized superlattices demonstrating broadband reflection as well as single and multiple pass bands in the near-infrared regime. PMID:26763335

  9. Cooperative mobile agents search using beehive partitioned structure and Tabu Random search algorithm

    NASA Astrophysics Data System (ADS)

    Ramazani, Saba; Jackson, Delvin L.; Selmic, Rastko R.

    2013-05-01

    In search and surveillance operations, deploying a team of mobile agents provides a robust solution that has multiple advantages over using a single agent in efficiency and minimizing exploration time. This paper addresses the challenge of identifying a target in a given environment when using a team of mobile agents by proposing a novel method of mapping and movement of agent teams in a cooperative manner. The approach consists of two parts. First, the region is partitioned into a hexagonal beehive structure in order to provide equidistant movements in every direction and to allow for more natural and flexible environment mapping. Additionally, in search environments that are partitioned into hexagons, mobile agents have an efficient travel path while performing searches due to this partitioning approach. Second, we use a team of mobile agents that move in a cooperative manner and utilize the Tabu Random algorithm to search for the target. Due to the ever-increasing use of robotics and Unmanned Aerial Vehicle (UAV) platforms, the field of cooperative multi-agent search has developed many applications recently that would benefit from the use of the approach presented in this work, including: search and rescue operations, surveillance, data collection, and border patrol. In this paper, the increased efficiency of the Tabu Random Search algorithm method in combination with hexagonal partitioning is simulated, analyzed, and advantages of this approach are presented and discussed.

  10. Classification of nanoparticle diffusion processes in vital cells by a multifeature random forests approach: application to simulated data, darkfield, and confocal laser scanning microscopy

    NASA Astrophysics Data System (ADS)

    Wagner, Thorsten; Kroll, Alexandra; Wiemann, Martin; Lipinski, Hans-Gerd

    2016-04-01

    Darkfield and confocal laser scanning microscopy both allow for a simultaneous observation of live cells and single nanoparticles. Accordingly, a characterization of nanoparticle uptake and intracellular mobility appears possible within living cells. Single particle tracking makes it possible to characterize the particle and the surrounding cell. In case of free diffusion, the mean squared displacement for each trajectory of a nanoparticle can be measured which allows computing the corresponding diffusion coefficient and, if desired, converting it into the hydrodynamic diameter using the Stokes-Einstein equation and the viscosity of the fluid. However, within the more complex system of a cell's cytoplasm unrestrained diffusion is scarce and several other types of movements may occur. Thus, confined or anomalous diffusion (e.g. diffusion in porous media), active transport, and combinations thereof were described by several authors. To distinguish between these types of particle movement we developed an appropriate classification method, and simulated three types of particle motion in a 2D plane using a Monte Carlo approach: (1) normal diffusion, using random direction and step-length, (2) subdiffusion, using confinements like a reflective boundary with defined radius or reflective objects in the closer vicinity, and (3) superdiffusion, using a directed flow added to the normal diffusion. To simulate subdiffusion we devised a new method based on tracks of different length combined with equally probable obstacle interaction. Next we estimated the fractal dimension, elongation and the ratio of long-time / short-time diffusion coefficients. These features were used to train a random forests classification algorithm. The accuracy for simulated trajectories with 180 steps was 97% (95%-CI: 0.9481-0.9884). The balanced accuracy was 94%, 99% and 98% for normal-, sub- and superdiffusion, respectively. Nanoparticle tracking analysis was used with 100 nm polystyrene particles

  11. SPAR: a random forest-based predictor for self-interacting proteins with fine-grained domain information.

    PubMed

    Liu, Xuhan; Yang, Shiping; Li, Chen; Zhang, Ziding; Song, Jiangning

    2016-07-01

    Protein self-interaction, i.e. the interaction between two or more identical proteins expressed by one gene, plays an important role in the regulation of cellular functions. Considering the limitations of experimental self-interaction identification, it is necessary to design specific bioinformatics tools for self-interacting protein (SIP) prediction from protein sequence information. In this study, we proposed an improved computational approach for SIP prediction, termed SPAR (Self-interacting Protein Analysis serveR). Firstly, we developed an improved encoding scheme named critical residues substitution (CRS), in which the fine-grained domain-domain interaction information was taken into account. Then, by employing the Random Forest algorithm, the performance of CRS was evaluated and compared with several other encoding schemes commonly used for sequence-based protein-protein interaction prediction. Through the tenfold cross-validation tests on a balanced training dataset, CRS performed the best, with the average accuracy up to 72.01 %. We further integrated CRS with other encoding schemes and identified the most important features using the mRMR (the minimum redundancy maximum relevance) feature selection method. Our SPAR model with selected features achieved an average accuracy of 92.09 % on the human-independent test set (the ratio of positives to negatives was about 1:11). Besides, we also evaluated the performance of SPAR on an independent yeast test set (the ratio of positives to negatives was about 1:8) and obtained an average accuracy of 76.96 %. The results demonstrate that SPAR is capable of achieving a reasonable performance in cross-species application. The SPAR server is freely available for academic use at http://systbio.cau.edu.cn/zzdlab/spar/ . PMID:27074717

  12. Track-Before-Detect Algorithm for Faint Moving Objects based on Random Sampling and Consensus

    NASA Astrophysics Data System (ADS)

    Dao, P.; Rast, R.; Schlaegel, W.; Schmidt, V.; Dentamaro, A.

    2014-09-01

    There are many algorithms developed for tracking and detecting faint moving objects in congested backgrounds. One obvious application is detection of targets in images where each pixel corresponds to the received power in a particular location. In our application, a visible imager operated in stare mode observes geostationary objects as fixed, stars as moving and non-geostationary objects as drifting in the field of view. We would like to achieve high sensitivity detection of the drifters. The ability to improve SNR with track-before-detect (TBD) processing, where target information is collected and collated before the detection decision is made, allows respectable performance against dim moving objects. Generally, a TBD algorithm consists of a pre-processing stage that highlights potential targets and a temporal filtering stage. However, the algorithms that have been successfully demonstrated, e.g. Viterbi-based and Bayesian-based, demand formidable processing power and memory. We propose an algorithm that exploits the quasi constant velocity of objects, the predictability of the stellar clutter and the intrinsically low false alarm rate of detecting signature candidates in 3-D, based on an iterative method called "RANdom SAmple Consensus” and one that can run real-time on a typical PC. The technique is tailored for searching objects with small telescopes in stare mode. Our RANSAC-MT (Moving Target) algorithm estimates parameters of a mathematical model (e.g., linear motion) from a set of observed data which contains a significant number of outliers while identifying inliers. In the pre-processing phase, candidate blobs were selected based on morphology and an intensity threshold that would normally generate unacceptable level of false alarms. The RANSAC sampling rejects candidates that conform to the predictable motion of the stars. Data collected with a 17 inch telescope by AFRL/RH and a COTS lens/EM-CCD sensor by the AFRL/RD Satellite Assessment Center is

  13. An efficient voting algorithm for finding additive biclusters with random background.

    PubMed

    Xiao, Jing; Wang, Lusheng; Liu, Xiaowen; Jiang, Tao

    2008-12-01

    The biclustering problem has been extensively studied in many areas, including e-commerce, data mining, machine learning, pattern recognition, statistics, and, more recently, computational biology. Given an n x m matrix A (n >or= m), the main goal of biclustering is to identify a subset of rows (called objects) and a subset of columns (called properties) such that some objective function that specifies the quality of the found bicluster (formed by the subsets of rows and of columns of A) is optimized. The problem has been proved or conjectured to be NP-hard for various objective functions. In this article, we study a probabilistic model for the implanted additive bicluster problem, where each element in the n x m background matrix is a random integer from [0, L - 1] for some integer L, and a k x k implanted additive bicluster is obtained from an error-free additive bicluster by randomly changing each element to a number in [0, L - 1] with probability theta. We propose an O(n(2)m) time algorithm based on voting to solve the problem. We show that when k >or= Omega(square root of (n log n)), the voting algorithm can correctly find the implanted bicluster with probability at least 1 - (9/n(2)). We also implement our algorithm as a C++ program named VOTE. The implementation incorporates several ideas for estimating the size of an implanted bicluster, adjusting the threshold in voting, dealing with small biclusters, and dealing with overlapping implanted biclusters. Our experimental results on both simulated and real datasets show that VOTE can find biclusters with a high accuracy and speed. PMID:19040364

  14. Precise algorithm to generate random sequential addition of hard hyperspheres at saturation.

    PubMed

    Zhang, G; Torquato, S

    2013-11-01

    The study of the packing of hard hyperspheres in d-dimensional Euclidean space R^{d} has been a topic of great interest in statistical mechanics and condensed matter theory. While the densest known packings are ordered in sufficiently low dimensions, it has been suggested that in sufficiently large dimensions, the densest packings might be disordered. The random sequential addition (RSA) time-dependent packing process, in which congruent hard hyperspheres are randomly and sequentially placed into a system without interparticle overlap, is a useful packing model to study disorder in high dimensions. Of particular interest is the infinite-time saturation limit in which the available space for another sphere tends to zero. However, the associated saturation density has been determined in all previous investigations by extrapolating the density results for nearly saturated configurations to the saturation limit, which necessarily introduces numerical uncertainties. We have refined an algorithm devised by us [S. Torquato, O. U. Uche, and F. H. Stillinger, Phys. Rev. E 74, 061308 (2006)] to generate RSA packings of identical hyperspheres. The improved algorithm produce such packings that are guaranteed to contain no available space in a large simulation box using finite computational time with heretofore unattained precision and across the widest range of dimensions (2≤d≤8). We have also calculated the packing and covering densities, pair correlation function g(2)(r), and structure factor S(k) of the saturated RSA configurations. As the space dimension increases, we find that pair correlations markedly diminish, consistent with a recently proposed "decorrelation" principle, and the degree of "hyperuniformity" (suppression of infinite-wavelength density fluctuations) increases. We have also calculated the void exclusion probability in order to compute the so-called quantizer error of the RSA packings, which is related to the second moment of inertia of the average

  15. Precise algorithm to generate random sequential addition of hard hyperspheres at saturation

    NASA Astrophysics Data System (ADS)

    Zhang, G.; Torquato, S.

    2013-11-01

    The study of the packing of hard hyperspheres in d-dimensional Euclidean space Rd has been a topic of great interest in statistical mechanics and condensed matter theory. While the densest known packings are ordered in sufficiently low dimensions, it has been suggested that in sufficiently large dimensions, the densest packings might be disordered. The random sequential addition (RSA) time-dependent packing process, in which congruent hard hyperspheres are randomly and sequentially placed into a system without interparticle overlap, is a useful packing model to study disorder in high dimensions. Of particular interest is the infinite-time saturation limit in which the available space for another sphere tends to zero. However, the associated saturation density has been determined in all previous investigations by extrapolating the density results for nearly saturated configurations to the saturation limit, which necessarily introduces numerical uncertainties. We have refined an algorithm devised by us [S. Torquato, O. U. Uche, and F. H. Stillinger, Phys. Rev. EPLEEE81539-375510.1103/PhysRevE.74.061308 74, 061308 (2006)] to generate RSA packings of identical hyperspheres. The improved algorithm produce such packings that are guaranteed to contain no available space in a large simulation box using finite computational time with heretofore unattained precision and across the widest range of dimensions (2≤d≤8). We have also calculated the packing and covering densities, pair correlation function g2(r), and structure factor S(k) of the saturated RSA configurations. As the space dimension increases, we find that pair correlations markedly diminish, consistent with a recently proposed “decorrelation” principle, and the degree of “hyperuniformity” (suppression of infinite-wavelength density fluctuations) increases. We have also calculated the void exclusion probability in order to compute the so-called quantizer error of the RSA packings, which is related to the

  16. A Random Algorithm for Low-Rank Decomposition of Large-Scale Matrices With Missing Entries.

    PubMed

    Liu, Yiguang; Lei, Yinjie; Li, Chunguang; Xu, Wenzheng; Pu, Yifei

    2015-11-01

    A random submatrix method (RSM) is proposed to calculate the low-rank decomposition U(m×r)V(n×r)(T) (r < m, n) of the matrix Y∈R(m×n) (assuming m > n generally) with known entry percentage 0 < ρ ≤ 1. RSM is very fast as only O(mr(2)ρ(r)) or O(n(3)ρ(3r)) floating-point operations (flops) are required, compared favorably with O(mnr+r(2)(m+n)) flops required by the state-of-the-art algorithms. Meanwhile, RSM has the advantage of a small memory requirement as only max(n(2),mr+nr) real values need to be saved. With the assumption that known entries are uniformly distributed in Y, submatrices formed by known entries are randomly selected from Y with statistical size k×nρ(k) or mρ(l)×l , where k or l takes r+1 usually. We propose and prove a theorem, under random noises the probability that the subspace associated with a smaller singular value will turn into the space associated to anyone of the r largest singular values is smaller. Based on the theorem, the nρ(k)-k null vectors or the l-r right singular vectors associated with the minor singular values are calculated for each submatrix. The vectors ought to be the null vectors of the submatrix formed by the chosen nρ(k) or l columns of the ground truth of V(T). If enough submatrices are randomly chosen, V and U can be estimated accordingly. The experimental results on random synthetic matrices with sizes such as 13 1072 ×10(24) and on real data sets such as dinosaur indicate that RSM is 4.30 ∼ 197.95 times faster than the state-of-the-art algorithms. It, meanwhile, has considerable high precision achieving or approximating to the best. PMID:26208344

  17. Comparing Algorithms for Graph Isomorphism Using Discrete- and Continuous-Time Quantum Random Walks

    SciTech Connect

    Rudinger, Kenneth; Gamble, John King; Bach, Eric; Friesen, Mark; Joynt, Robert; Coppersmith, S. N.

    2013-07-01

    Berry and Wang [Phys. Rev. A 83, 042317 (2011)] show numerically that a discrete-time quan- tum random walk of two noninteracting particles is able to distinguish some non-isomorphic strongly regular graphs from the same family. Here we analytically demonstrate how it is possible for these walks to distinguish such graphs, while continuous-time quantum walks of two noninteracting parti- cles cannot. We show analytically and numerically that even single-particle discrete-time quantum random walks can distinguish some strongly regular graphs, though not as many as two-particle noninteracting discrete-time walks. Additionally, we demonstrate how, given the same quantum random walk, subtle di erences in the graph certi cate construction algorithm can nontrivially im- pact the walk's distinguishing power. We also show that no continuous-time walk of a xed number of particles can distinguish all strongly regular graphs when used in conjunction with any of the graph certi cates we consider. We extend this constraint to discrete-time walks of xed numbers of noninteracting particles for one kind of graph certi cate; it remains an open question as to whether or not this constraint applies to the other graph certi cates we consider.

  18. Comparing Algorithms for Graph Isomorphism Using Discrete- and Continuous-Time Quantum Random Walks

    DOE PAGESBeta

    Rudinger, Kenneth; Gamble, John King; Bach, Eric; Friesen, Mark; Joynt, Robert; Coppersmith, S. N.

    2013-07-01

    Berry and Wang [Phys. Rev. A 83, 042317 (2011)] show numerically that a discrete-time quan- tum random walk of two noninteracting particles is able to distinguish some non-isomorphic strongly regular graphs from the same family. Here we analytically demonstrate how it is possible for these walks to distinguish such graphs, while continuous-time quantum walks of two noninteracting parti- cles cannot. We show analytically and numerically that even single-particle discrete-time quantum random walks can distinguish some strongly regular graphs, though not as many as two-particle noninteracting discrete-time walks. Additionally, we demonstrate how, given the same quantum random walk, subtle di erencesmore » in the graph certi cate construction algorithm can nontrivially im- pact the walk's distinguishing power. We also show that no continuous-time walk of a xed number of particles can distinguish all strongly regular graphs when used in conjunction with any of the graph certi cates we consider. We extend this constraint to discrete-time walks of xed numbers of noninteracting particles for one kind of graph certi cate; it remains an open question as to whether or not this constraint applies to the other graph certi cates we consider.« less

  19. 3D statistical shape models incorporating 3D random forest regression voting for robust CT liver segmentation

    NASA Astrophysics Data System (ADS)

    Norajitra, Tobias; Meinzer, Hans-Peter; Maier-Hein, Klaus H.

    2015-03-01

    During image segmentation, 3D Statistical Shape Models (SSM) usually conduct a limited search for target landmarks within one-dimensional search profiles perpendicular to the model surface. In addition, landmark appearance is modeled only locally based on linear profiles and weak learners, altogether leading to segmentation errors from landmark ambiguities and limited search coverage. We present a new method for 3D SSM segmentation based on 3D Random Forest Regression Voting. For each surface landmark, a Random Regression Forest is trained that learns a 3D spatial displacement function between the according reference landmark and a set of surrounding sample points, based on an infinite set of non-local randomized 3D Haar-like features. Landmark search is then conducted omni-directionally within 3D search spaces, where voxelwise forest predictions on landmark position contribute to a common voting map which reflects the overall position estimate. Segmentation experiments were conducted on a set of 45 CT volumes of the human liver, of which 40 images were randomly chosen for training and 5 for testing. Without parameter optimization, using a simple candidate selection and a single resolution approach, excellent results were achieved, while faster convergence and better concavity segmentation were observed, altogether underlining the potential of our approach in terms of increased robustness from distinct landmark detection and from better search coverage.

  20. Random Forests Based Multiple Classifier System for Power-Line Scene Classification

    NASA Astrophysics Data System (ADS)

    Kim, H. B.; Sohn, G.

    2011-09-01

    The increasing use of electrical energy has yielded more necessities of electric utilities including transmission lines and electric pylons which require a real-time risk monitoring to prevent massive economical damages. Recently, Airborne Laser Scanning (ALS) has become one of primary data acquisition tool for corridor mapping due to its ability of direct 3D measurements. In particular, for power-line risk management, a rapid and accurate classification of power-line objects is an extremely important task. We propose a 3D classification method combining results obtained from multiple classifier trained with different features. As a base classifier, we employ Random Forests (RF) which is a composite descriptors consisting of a number of decision trees populated through learning with bootstrapping samples. Two different sets of features are investigated that are extracted in a point domain and a feature (i.e., line & polygon) domain. RANSAC and Minimum Description Length (MDL) are applied to create lines and a polygon in each volumetric pixel (voxel) for the line & polygon features. Two RFs are trained from the two groups of features uncorrelated by Principle Component Analysis (PCA), which results are combined for final classification. The experiment with two real datasets demonstrates that the proposed classification method shows 10% improvements in classification accuracy compared to a single classifier.

  1. 3D Fast Automatic Segmentation of Kidney Based on Modified AAM and Random Forest.

    PubMed

    Jin, Chao; Shi, Fei; Xiang, Dehui; Jiang, Xueqing; Zhang, Bin; Wang, Ximing; Zhu, Weifang; Gao, Enting; Chen, Xinjian

    2016-06-01

    In this paper, a fully automatic method is proposed to segment the kidney into multiple components: renal cortex, renal column, renal medulla and renal pelvis, in clinical 3D CT abdominal images. The proposed fast automatic segmentation method of kidney consists of two main parts: localization of renal cortex and segmentation of kidney components. In the localization of renal cortex phase, a method which fully combines 3D Generalized Hough Transform (GHT) and 3D Active Appearance Models (AAM) is applied to localize the renal cortex. In the segmentation of kidney components phase, a modified Random Forests (RF) method is proposed to segment the kidney into four components based on the result from localization phase. During the implementation, a multithreading technology is applied to speed up the segmentation process. The proposed method was evaluated on a clinical abdomen CT data set, including 37 contrast-enhanced volume data using leave-one-out strategy. The overall true-positive volume fraction and false-positive volume fraction were 93.15%, 0.37% for renal cortex segmentation; 83.09%, 0.97% for renal column segmentation; 81.92%, 0.55% for renal medulla segmentation; and 80.28%, 0.30% for renal pelvis segmentation, respectively. The average computational time of segmenting kidney into four components took 20 seconds. PMID:26742124

  2. Land cover mapping based on random forest classification of multitemporal spectral and thermal images.

    PubMed

    Eisavi, Vahid; Homayouni, Saeid; Yazdi, Ahmad Maleknezhad; Alimohammadi, Abbas

    2015-05-01

    Thematic mapping of complex landscapes, with various phenological patterns from satellite imagery, is a particularly challenging task. However, supplementary information, such as multitemporal data and/or land surface temperature (LST), has the potential to improve the land cover classification accuracy and efficiency. In this paper, in order to map land covers, we evaluated the potential of multitemporal Landsat 8's spectral and thermal imageries using a random forest (RF) classifier. We used a grid search approach based on the out-of-bag (OOB) estimate of error to optimize the RF parameters. Four different scenarios were considered in this research: (1) RF classification of multitemporal spectral images, (2) RF classification of multitemporal LST images, (3) RF classification of all multitemporal LST and spectral images, and (4) RF classification of selected important or optimum features. The study area in this research was Naghadeh city and its surrounding region, located in West Azerbaijan Province, northwest of Iran. The overall accuracies of first, second, third, and fourth scenarios were equal to 86.48, 82.26, 90.63, and 91.82%, respectively. The quantitative assessments of the results demonstrated that the most important or optimum features increase the class separability, while the spectral and thermal features produced a more moderate increase in the land cover mapping accuracy. In addition, the contribution of the multitemporal thermal information led to a considerable increase in the user and producer accuracies of classes with a rapid temporal change behavior, such as crops and vegetation. PMID:25910718

  3. Random forest Granger causality for detection of effective brain connectivity using high-dimensional data.

    PubMed

    Furqan, Mohammad Shaheryar; Siyal, Mohammad Yakoob

    2016-03-01

    Studies have shown that the brain functions are not localized to isolated areas and connections but rather depend on the intricate network of connections and regions inside the brain. These networks are commonly analyzed using Granger causality (GC) that utilizes the ordinary least squares (OLS) method for its standard implementation. In the past, several approaches have shown to solve the limitations of OLS by using diverse regularization systems. However, there are still some shortcomings in terms of accuracy, precision, and false discovery rate (FDR). In this paper, we are proposing a new strategy to use Random Forest as a regularization technique for computing GC that will improve these shortcomings. We have demonstrated the effectiveness of our proposed methodology by comparing the results with existing Least absolute shrinkage and selection operator (LASSO), and Elastic-Net regularized implementations of GC using simulated dataset. Later, we have used our proposed approach to map the network involved during deductive reasoning using real StarPlus dataset. PMID:26620192

  4. Task-Dependent Band-Selection of Hyperspectral Images by Projection-Based Random Forests

    NASA Astrophysics Data System (ADS)

    Hänsch, R.; Hellwich, O.

    2016-06-01

    The automatic classification of land cover types from hyperspectral images is a challenging problem due to (among others) the large amount of spectral bands and their high spatial and spectral correlation. The extraction of meaningful features, that enables a subsequent classifier to distinguish between different land cover classes, is often limited to a subset of all available data dimensions which is found by band selection techniques or other methods of dimensionality reduction. This work applies Projection-Based Random Forests to hyperspectral images, which not only overcome the need of an explicit feature extraction, but also provide mechanisms to automatically select spectral bands that contain original (i.e. non-redundant) as well as highly meaningful information for the given classification task. The proposed method is applied to four challenging hyperspectral datasets and it is shown that the effective number of spectral bands can be considerably limited without loosing too much of classification performance, e.g. a loss of 1 % accuracy if roughly 13 % of all available bands are used.

  5. A New MAC Address Spoofing Detection Technique Based on Random Forests.

    PubMed

    Alotaibi, Bandar; Elleithy, Khaled

    2016-01-01

    Media access control (MAC) addresses in wireless networks can be trivially spoofed using off-the-shelf devices. The aim of this research is to detect MAC address spoofing in wireless networks using a hard-to-spoof measurement that is correlated to the location of the wireless device, namely the received signal strength (RSS). We developed a passive solution that does not require modification for standards or protocols. The solution was tested in a live test-bed (i.e., a wireless local area network with the aid of two air monitors acting as sensors) and achieved 99.77%, 93.16% and 88.38% accuracy when the attacker is 8-13 m, 4-8 m and less than 4 m away from the victim device, respectively. We implemented three previous methods on the same test-bed and found that our solution outperforms existing solutions. Our solution is based on an ensemble method known as random forests. PMID:26927103

  6. Discrimination of fish populations using parasites: Random Forests on a 'predictable' host-parasite system.

    PubMed

    Pérez-Del-Olmo, A; Montero, F E; Fernández, M; Barrett, J; Raga, J A; Kostadinova, A

    2010-10-01

    We address the effect of spatial scale and temporal variation on model generality when forming predictive models for fish assignment using a new data mining approach, Random Forests (RF), to variable biological markers (parasite community data). Models were implemented for a fish host-parasite system sampled along the Mediterranean and Atlantic coasts of Spain and were validated using independent datasets. We considered 2 basic classification problems in evaluating the importance of variations in parasite infracommunities for assignment of individual fish to their populations of origin: multiclass (2-5 population models, using 2 seasonal replicates from each of the populations) and 2-class task (using 4 seasonal replicates from 1 Atlantic and 1 Mediterranean population each). The main results are that (i) RF are well suited for multiclass population assignment using parasite communities in non-migratory fish; (ii) RF provide an efficient means for model cross-validation on the baseline data and this allows sample size limitations in parasite tag studies to be tackled effectively; (iii) the performance of RF is dependent on the complexity and spatial extent/configuration of the problem; and (iv) the development of predictive models is strongly influenced by seasonal change and this stresses the importance of both temporal replication and model validation in parasite tagging studies. PMID:20602856

  7. Semantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach

    NASA Astrophysics Data System (ADS)

    Du, Shihong; Zhang, Fangli; Zhang, Xiuyuan

    2015-07-01

    While most existing studies have focused on extracting geometric information on buildings, only a few have concentrated on semantic information. The lack of semantic information cannot satisfy many demands on resolving environmental and social issues. This study presents an approach to semantically classify buildings into much finer categories than those of existing studies by learning random forest (RF) classifier from a large number of imbalanced samples with high-dimensional features. First, a two-level segmentation mechanism combining GIS and VHR image produces single image objects at a large scale and intra-object components at a small scale. Second, a semi-supervised method chooses a large number of unbiased samples by considering the spatial proximity and intra-cluster similarity of buildings. Third, two important improvements in RF classifier are made: a voting-distribution ranked rule for reducing the influences of imbalanced samples on classification accuracy and a feature importance measurement for evaluating each feature's contribution to the recognition of each category. Fourth, the semantic classification of urban buildings is practically conducted in Beijing city, and the results demonstrate that the proposed approach is effective and accurate. The seven categories used in the study are finer than those in existing work and more helpful to studying many environmental and social problems.

  8. A New MAC Address Spoofing Detection Technique Based on Random Forests

    PubMed Central

    Alotaibi, Bandar; Elleithy, Khaled

    2016-01-01

    Media access control (MAC) addresses in wireless networks can be trivially spoofed using off-the-shelf devices. The aim of this research is to detect MAC address spoofing in wireless networks using a hard-to-spoof measurement that is correlated to the location of the wireless device, namely the received signal strength (RSS). We developed a passive solution that does not require modification for standards or protocols. The solution was tested in a live test-bed (i.e., a wireless local area network with the aid of two air monitors acting as sensors) and achieved 99.77%, 93.16% and 88.38% accuracy when the attacker is 8–13 m, 4–8 m and less than 4 m away from the victim device, respectively. We implemented three previous methods on the same test-bed and found that our solution outperforms existing solutions. Our solution is based on an ensemble method known as random forests. PMID:26927103

  9. Automatic location of vertebrae on DXA images using random forest regression.

    PubMed

    Roberts, M G; Cootes, Timothy F; Adams, J E

    2012-01-01

    We provide a fully automatic method of segmenting vertebrae in DXA images. This is of clinical relevance to the diagnosis of osteoporosis by vertebral fracture, and to grading fractures in clinical trials. In order to locate the vertebrae we train detectors for the upper and lower vertebral endplates. Each detector uses random forest regressor voting applied to Haar-like input features. The regressors are applied at a grid of points across the image, and each tree votes for an endplate centre position. Modes in the smoothed vote image are endplate candidates, some of which are the neighbouring vertebrae of the one sought. The ambiguity is resolved by applying geometric constraints to the connections between vertebrae, although there can be some ambiguity about where the sequence starts (e.g., is the lowest vertebra L4 or L5, fig 2a). The endplate centres are used to initialise a final phase of active appearance model search for a detailed solution. The method is applied to a dataset of 320 DXA images. Accuracy is comparable to manually initialised AAM segmentation in 91% of images, but multiple grade 3 fractures can cause some edge confusion in severely osteoporotic cases. PMID:23286151

  10. Risk Prediction of One-Year Mortality in Patients with Cardiac Arrhythmias Using Random Survival Forest

    PubMed Central

    Miao, Fen; Cai, Yun-Peng; Zhang, Yu-Xiao; Li, Ye; Zhang, Yuan-Ting

    2015-01-01

    Existing models for predicting mortality based on traditional Cox proportional hazard approach (CPH) often have low prediction accuracy. This paper aims to develop a clinical risk model with good accuracy for predicting 1-year mortality in cardiac arrhythmias patients using random survival forest (RSF), a robust approach for survival analysis. 10,488 cardiac arrhythmias patients available in the public MIMIC II clinical database were investigated, with 3,452 deaths occurring within 1-year followups. Forty risk factors including demographics and clinical and laboratory information and antiarrhythmic agents were analyzed as potential predictors of all-cause mortality. RSF was adopted to build a comprehensive survival model and a simplified risk model composed of 14 top risk factors. The built comprehensive model achieved a prediction accuracy of 0.81 measured by c-statistic with 10-fold cross validation. The simplified risk model also achieved a good accuracy of 0.799. Both results outperformed traditional CPH (which achieved a c-statistic of 0.733 for the comprehensive model and 0.718 for the simplified model). Moreover, various factors are observed to have nonlinear impact on cardiac arrhythmias prognosis. As a result, RSF based model which took nonlinearity into account significantly outperformed traditional Cox proportional hazard model and has great potential to be a more effective approach for survival analysis. PMID:26379761

  11. Random Forest Segregation of Drug Responses May Define Regions of Biological Significance

    PubMed Central

    Bukhari, Qasim; Borsook, David; Rudin, Markus; Becerra, Lino

    2016-01-01

    The ability to assess brain responses in unsupervised manner based on fMRI measure has remained a challenge. Here we have applied the Random Forest (RF) method to detect differences in the pharmacological MRI (phMRI) response in rats to treatment with an analgesic drug (buprenorphine) as compared to control (saline). Three groups of animals were studied: two groups treated with different doses of the opioid buprenorphine, low (LD), and high dose (HD), and one receiving saline. PhMRI responses were evaluated in 45 brain regions and RF analysis was applied to allocate rats to the individual treatment groups. RF analysis was able to identify drug effects based on differential phMRI responses in the hippocampus, amygdala, nucleus accumbens, superior colliculus, and the lateral and posterior thalamus for drug vs. saline. These structures have high levels of mu opioid receptors. In addition these regions are involved in aversive signaling, which is inhibited by mu opioids. The results demonstrate that buprenorphine mediated phMRI responses comprise characteristic features that allow a supervised differentiation from placebo treated rats as well as the proper allocation to the respective drug dose group using the RF method, a method that has been successfully applied in clinical studies. PMID:27014046

  12. Random Forests Are Able to Identify Differences in Clotting Dynamics from Kinetic Models of Thrombin Generation

    PubMed Central

    2016-01-01

    Current methods for distinguishing acute coronary syndromes such as heart attack from stable coronary artery disease, based on the kinetics of thrombin formation, have been limited to evaluating sensitivity of well-established chemical species (e.g., thrombin) using simple quantifiers of their concentration profiles (e.g., maximum level of thrombin concentration, area under the thrombin concentration versus time curve). In order to get an improved classifier, we use a 34-protein factor clotting cascade model and convert the simulation data into a high-dimensional representation (about 19000 features) using a piecewise cubic polynomial fit. Then, we systematically find plausible assays to effectively gauge changes in acute coronary syndrome/coronary artery disease populations by introducing a statistical learning technique called Random Forests. We find that differences associated with acute coronary syndromes emerge in combinations of a handful of features. For instance, concentrations of 3 chemical species, namely, active alpha-thrombin, tissue factor-factor VIIa-factor Xa ternary complex, and intrinsic tenase complex with factor X, at specific time windows, could be used to classify acute coronary syndromes to an accuracy of about 87.2%. Such a combination could be used to efficiently assay the coagulation system. PMID:27171403

  13. Random Forests Are Able to Identify Differences in Clotting Dynamics from Kinetic Models of Thrombin Generation.

    PubMed

    Arumugam, Jayavel; Bukkapatnam, Satish T S; Narayanan, Krishna R; Srinivasa, Arun R

    2016-01-01

    Current methods for distinguishing acute coronary syndromes such as heart attack from stable coronary artery disease, based on the kinetics of thrombin formation, have been limited to evaluating sensitivity of well-established chemical species (e.g., thrombin) using simple quantifiers of their concentration profiles (e.g., maximum level of thrombin concentration, area under the thrombin concentration versus time curve). In order to get an improved classifier, we use a 34-protein factor clotting cascade model and convert the simulation data into a high-dimensional representation (about 19000 features) using a piecewise cubic polynomial fit. Then, we systematically find plausible assays to effectively gauge changes in acute coronary syndrome/coronary artery disease populations by introducing a statistical learning technique called Random Forests. We find that differences associated with acute coronary syndromes emerge in combinations of a handful of features. For instance, concentrations of 3 chemical species, namely, active alpha-thrombin, tissue factor-factor VIIa-factor Xa ternary complex, and intrinsic tenase complex with factor X, at specific time windows, could be used to classify acute coronary syndromes to an accuracy of about 87.2%. Such a combination could be used to efficiently assay the coagulation system. PMID:27171403

  14. Selective of informative metabolites using random forests based on model population analysis.

    PubMed

    Huang, Jian-Hua; Yan, Jun; Wu, Qing-Hua; Duarte Ferro, Miguel; Yi, Lun-Zhao; Lu, Hong-Mei; Xu, Qing-Song; Liang, Yi-Zeng

    2013-12-15

    One of the main goals of metabolomics studies is to discover informative metabolites or biomarkers, which may be used to diagnose diseases and to find out pathology. Sophisticated feature selection approaches are required to extract the information hidden in such complex 'omics' data. In this study, it is proposed a new and robust selective method by combining random forests (RF) with model population analysis (MPA), for selecting informative metabolites from three metabolomic datasets. According to the contribution to the classification accuracy, the metabolites were classified into three kinds: informative, no-informative, and interfering metabolites. Based on the proposed method, some informative metabolites were selected for three datasets; further analyses of these metabolites between healthy and diseased groups were then performed, showing by T-test that the P values for all these selected metabolites were lower than 0.05. Moreover, the informative metabolites identified by the current method were demonstrated to be correlated with the clinical outcome under investigation. The source codes of MPA-RF in Matlab can be freely downloaded from http://code.google.com/p/my-research-list/downloads/list. PMID:24209380

  15. Incremental Learning of Random Forests for Large-Scale Image Classification.

    PubMed

    Ristin, Marko; Guillaumin, Matthieu; Gall, Juergen; Van Gool, Luc

    2016-03-01

    Large image datasets such as ImageNet or open-ended photo websites like Flickr are revealing new challenges to image classification that were not apparent in smaller, fixed sets. In particular, the efficient handling of dynamically growing datasets, where not only the amount of training data but also the number of classes increases over time, is a relatively unexplored problem. In this challenging setting, we study how two variants of Random Forests (RF) perform under four strategies to incorporate new classes while avoiding to retrain the RFs from scratch. The various strategies account for different trade-offs between classification accuracy and computational efficiency. In our extensive experiments, we show that both RF variants, one based on Nearest Class Mean classifiers and the other on SVMs, outperform conventional RFs and are well suited for incrementally learning new classes. In particular, we show that RFs initially trained with just 10 classes can be extended to 1,000 classes with an acceptable loss of accuracy compared to training from the full data and with great computational savings compared to retraining for each new batch of classes. PMID:27046493

  16. Genome-wide association study for backfat thickness in Canchim beef cattle using Random Forest approach

    PubMed Central

    2013-01-01

    Background Meat quality involves many traits, such as marbling, tenderness, juiciness, and backfat thickness, all of which require attention from livestock producers. Backfat thickness improvement by means of traditional selection techniques in Canchim beef cattle has been challenging due to its low heritability, and it is measured late in an animal’s life. Therefore, the implementation of new methodologies for identification of single nucleotide polymorphisms (SNPs) linked to backfat thickness are an important strategy for genetic improvement of carcass and meat quality. Results The set of SNPs identified by the random forest approach explained as much as 50% of the deregressed estimated breeding value (dEBV) variance associated with backfat thickness, and a small set of 5 SNPs were able to explain 34% of the dEBV for backfat thickness. Several quantitative trait loci (QTL) for fat-related traits were found in the surrounding areas of the SNPs, as well as many genes with roles in lipid metabolism. Conclusions These results provided a better understanding of the backfat deposition and regulation pathways, and can be considered a starting point for future implementation of a genomic selection program for backfat thickness in Canchim beef cattle. PMID:23738659

  17. In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner.

    PubMed

    Chen, Jing; Tang, Yuan Yan; Fang, Bin; Guo, Chang

    2012-05-01

    With an increasing need for the rapid and effective safety assessment of compounds in industrial and civil-use products, in silico toxicity exploration techniques provide an economic way for environmental hazard assessment. The previous in silico researches have developed many quantitative structure-activity relationships models to predict toxicity mechanisms for last decade. Most of these methods benefit from data analysis and machine learning techniques, which rely heavily on the characteristics of data sets. For Tetrahymena pyriformis toxicity data sets, there is a great technical challenge-data imbalance. The skewness of data class distribution would greatly deteriorate the prediction performance on rare classes. Most of the previous researches for phenol mechanisms of toxic action prediction did not consider this practical problem. In this work, we dealt with the problem by considering the difference between the two types of misclassifications. Random Forest learner was employed in cost-sensitive learning framework to construct prediction models based on selected molecular descriptors. In computational experiments, both the global and local models obtained appreciable overall prediction accuracies. Particularly, the performance on rare classes was indeed promoted. Moreover, for practical usage of these models, the balance of the two misclassifications can be adjusted by using different cost matrices according to the application goals. PMID:22481075

  18. Protein inter-domain linker prediction using Random Forest and amino acid physiochemical properties

    PubMed Central

    2014-01-01

    Background Protein chains are generally long and consist of multiple domains. Domains are distinct structural units of a protein that can evolve and function independently. The accurate prediction of protein domain linkers and boundaries is often regarded as the initial step of protein tertiary structure and function predictions. Such information not only enhances protein-targeted drug development but also reduces the experimental cost of protein analysis by allowing researchers to work on a set of smaller and independent units. In this study, we propose a novel and accurate domain-linker prediction approach based on protein primary structure information only. We utilize a nature-inspired machine-learning model called Random Forest along with a novel domain-linker profile that contains physiochemical and domain-linker information of amino acid sequences. Results The proposed approach was tested on two well-known benchmark protein datasets and achieved 68% sensitivity and 99% precision, which is better than any existing protein domain-linker predictor. Without applying any data balancing technique such as class weighting and data re-sampling, the proposed approach is able to accurately classify inter-domain linkers from highly imbalanced datasets. Conclusion Our experimental results prove that the proposed approach is useful for domain-linker identification in highly imbalanced single- and multi-domain proteins. PMID:25521329

  19. RF-Hydroxysite: a random forest based predictor for hydroxylation sites.

    PubMed

    Ismail, Hamid D; Newman, Robert H; Kc, Dukka B

    2016-07-19

    Protein hydroxylation is an emerging posttranslational modification involved in both normal cellular processes and a growing number of pathological states, including several cancers. Protein hydroxylation is mediated by members of the hydroxylase family of enzymes, which catalyze the conversion of an alkyne group at select lysine or proline residues on their target substrates to a hydroxyl. Traditionally, hydroxylation has been identified using expensive and time-consuming experimental methods, such as tandem mass spectrometry. Therefore, to facilitate identification of putative hydroxylation sites and to complement existing experimental approaches, computational methods designed to predict the hydroxylation sites in protein sequences have recently been developed. Building on these efforts, we have developed a new method, termed RF-hydroxysite, that uses random forest to identify putative hydroxylysine and hydroxyproline residues in proteins using only the primary amino acid sequence as input. RF-Hydroxysite integrates features previously shown to contribute to hydroxylation site prediction with several new features that we found to augment the performance remarkably. These include features that capture physicochemical, structural, sequence-order and evolutionary information from the protein sequences. The features used in the final model were selected based on their contribution to the prediction. Physicochemical information was found to contribute the most to the model. The present study also sheds light on the contribution of evolutionary, sequence order, and protein disordered region information to hydroxylation site prediction. The web server for RF-hydroxysite is available online at . PMID:27292874

  20. Diagnosis of colorectal cancer by near-infrared optical fiber spectroscopy and random forest

    NASA Astrophysics Data System (ADS)

    Chen, Hui; Lin, Zan; Wu, Hegang; Wang, Li; Wu, Tong; Tan, Chao

    2015-01-01

    Near-infrared (NIR) spectroscopy has such advantages as being noninvasive, fast, relatively inexpensive, and no risk of ionizing radiation. Differences in the NIR signals can reflect many physiological changes, which are in turn associated with such factors as vascularization, cellularity, oxygen consumption, or remodeling. NIR spectral differences between colorectal cancer and healthy tissues were investigated. A Fourier transform NIR spectroscopy instrument equipped with a fiber-optic probe was used to mimic in situ clinical measurements. A total of 186 spectra were collected and then underwent the preprocessing of standard normalize variate (SNV) for removing unwanted background variances. All the specimen and spots used for spectral collection were confirmed staining and examination by an experienced pathologist so as to ensure the representative of the pathology. Principal component analysis (PCA) was used to uncover the possible clustering. Several methods including random forest (RF), partial least squares-discriminant analysis (PLSDA), K-nearest neighbor and classification and regression tree (CART) were used to extract spectral features and to construct the diagnostic models. By comparison, it reveals that, even if no obvious difference of misclassified ratio (MCR) was observed between these models, RF is preferable since it is quicker, more convenient and insensitive to over-fitting. The results indicate that NIR spectroscopy coupled with RF model can serve as a potential tool for discriminating the colorectal cancer tissues from normal ones.

  1. Object-oriented mapping of urban trees using Random Forest classifiers

    NASA Astrophysics Data System (ADS)

    Puissant, Anne; Rougier, Simon; Stumpf, André

    2014-02-01

    Since vegetation in urban areas delivers crucial ecological services as a support to human well-being and to the urban population in general, its monitoring is a major issue for urban planners. Mapping and monitoring the changes in urban green spaces are important tasks because of their functions such as the management of air, climate and water quality, the reduction of noise, the protection of species and the development of recreational activities. In this context, the objective of this work is to propose a methodology to inventory and map the urban tree spaces from a mono-temporal very high resolution (VHR) optical image using a Random Forest classifier in combination with object-oriented approaches. The methodology is developed and its performance is evaluated on a dataset of the city of Strasbourg (France) for different categories of built-up areas. The results indicate a good accuracy and a high robustness for the classification of the green elements in terms of user and producer accuracies.

  2. Applications of the BIOPHYS Algorithm for Physically-Based Retrieval of Biophysical, Structural and Forest Disturbance Information

    NASA Technical Reports Server (NTRS)

    Peddle, Derek R.; Huemmrich, K. Fred; Hall, Forrest G.; Masek, Jeffrey G.; Soenen, Scott A.; Jackson, Chris D.

    2011-01-01

    Canopy reflectance model inversion using look-up table approaches provides powerful and flexible options for deriving improved forest biophysical structural information (BSI) compared with traditional statistical empirical methods. The BIOPHYS algorithm is an improved, physically-based inversion approach for deriving BSI for independent use and validation and for monitoring, inventory and quantifying forest disturbance as well as input to ecosystem, climate and carbon models. Based on the multiple-forward mode (MFM) inversion approach, BIOPHYS results were summarized from different studies (Minnesota/NASA COVER; Virginia/LEDAPS; Saskatchewan/BOREAS), sensors (airborne MMR; Landsat; MODIS) and models (GeoSail; GOMS). Applications output included forest density, height, crown dimension, branch and green leaf area, canopy cover, disturbance estimates based on multi-temporal chronosequences, and structural change following recovery from forest fires over the last century. Good correspondences with validation field data were obtained. Integrated analyses of multiple solar and view angle imagery further improved retrievals compared with single pass data. Quantifying ecosystem dynamics such as the area and percent of forest disturbance, early regrowth and succession provide essential inputs to process-driven models of carbon flux. BIOPHYS is well suited for large-area, multi-temporal applications involving multiple image sets and mosaics for assessing vegetation disturbance and quantifying biophysical structural dynamics and change. It is also suitable for integration with forest inventory, monitoring, updating, and other programs.

  3. An Evaluation of the MOD17 Gross Primary Production Algorithm in a Mangrove Forest

    NASA Astrophysics Data System (ADS)

    Wells, H.; Najjar, R.; Herrmann, M.; Fuentes, J. D.; Ruiz-Plancarte, J.

    2015-12-01

    Though coastal wetlands occupy a small fraction of the Earth's surface, they are extremely active ecosystems and play a significant role in the global carbon budget. However, coastal wetlands are still poorly understood, especially when compared to open-ocean and terrestrial ecosystems. This is partly due to the limited in situ observations in these areas. One of the ways around the limited in situ data is to use remote sensing products. Here we present the first evaluation of the MOD17 remote sensing algorithm of gross primary productivity (GPP) in a mangrove forest using data from a flux tower in the Florida Everglades. MOD17 utilizes remote sensing products from the Moderate Resolution Imaging Spectroradiometer and meteorological fields from the NCEP/DOE Reanalysis 2. MOD17 is found to capture the long-term mean and seasonal amplitude of GPP but has significant errors describing the interannual variability, intramonthly variability, and the phasing of the annual cycle in GPP. Regarding the latter, MOD17 overestimates GPP when salinity is high and underestimates it when it is low, consistent with the fact that MOD17 ignores salinity and salinity tends to decrease GPP. Including salinity in the algorithm would then most likely improve its accuracy. MOD17 also assumes that GPP is linear with respect to PAR (photosynthetically active radiation), which does not hold true in the mangroves. Finally, the estimated PAR and air temperature inputs to MOD17 were found to be significantly lower than observed. In summary, while MOD17 captures some aspects of GPP variability at this mangrove site, it appears to be doing so for the wrong reasons.

  4. Data Security in Ad Hoc Networks Using Randomization of Cryptographic Algorithms

    NASA Astrophysics Data System (ADS)

    Krishna, B. Ananda; Radha, S.; Keshava Reddy, K. Chenna

    Ad hoc networks are a new wireless networking paradigm for mobile hosts. Unlike traditional mobile wireless networks, ad hoc networks do not rely on any fixed infrastructure. Instead, hosts rely on each other to keep the network connected. The military tactical and other security-sensitive operations are still the main applications of ad hoc networks, although there is a trend to adopt ad hoc networks for commercial uses due to their unique properties. One main challenge in design of these networks is how to feasibly detect and defend the major attacks against data, impersonation and unauthorized data modification. Also, in the same network some nodes may be malicious whose objective is to degrade the network performance. In this study, we propose a security model in which the packets are encrypted and decrypted using multiple algorithms where the selection scheme is random. The performance of the proposed model is analyzed and it is observed that there is no increase in control overhead but a slight delay is introduced due to the encryption process. We conclude that the proposed security model works well for heavily loaded networks with high mobility and can be extended for more cryptographic algorithms.

  5. From analytical solutions of solute transport equations to multidimensional time-domain random walk (TDRW) algorithms

    NASA Astrophysics Data System (ADS)

    Bodin, Jacques

    2015-03-01

    In this study, new multi-dimensional time-domain random walk (TDRW) algorithms are derived from approximate one-dimensional (1-D), two-dimensional (2-D), and three-dimensional (3-D) analytical solutions of the advection-dispersion equation and from exact 1-D, 2-D, and 3-D analytical solutions of the pure-diffusion equation. These algorithms enable the calculation of both the time required for a particle to travel a specified distance in a homogeneous medium and the mass recovery at the observation point, which may be incomplete due to 2-D or 3-D transverse dispersion or diffusion. The method is extended to heterogeneous media, represented as a piecewise collection of homogeneous media. The particle motion is then decomposed along a series of intermediate checkpoints located on the medium interface boundaries. The accuracy of the multi-dimensional TDRW method is verified against (i) exact analytical solutions of solute transport in homogeneous media and (ii) finite-difference simulations in a synthetic 2-D heterogeneous medium of simple geometry. The results demonstrate that the method is ideally suited to purely diffusive transport and to advection-dispersion transport problems dominated by advection. Conversely, the method is not recommended for highly dispersive transport problems because the accuracy of the advection-dispersion TDRW algorithms degrades rapidly for a low Péclet number, consistent with the accuracy limit of the approximate analytical solutions. The proposed approach provides a unified methodology for deriving multi-dimensional time-domain particle equations and may be applicable to other mathematical transport models, provided that appropriate analytical solutions are available.

  6. Harmonics elimination algorithm for operational modal analysis using random decrement technique

    NASA Astrophysics Data System (ADS)

    Modak, S. V.; Rawal, Chetan; Kundra, T. K.

    2010-05-01

    Operational modal analysis (OMA) extracts modal parameters of a structure using their output response, during operation in general. OMA, when applied to mechanical engineering structures is often faced with the problem of harmonics present in the output response, and can cause erroneous modal extraction. This paper demonstrates for the first time that the random decrement (RD) method can be efficiently employed to eliminate the harmonics from the randomdec signatures. Further, the research work shows effective elimination of large amplitude harmonics also by proposing inclusion of additional random excitation. This obviously need not be recorded for analysis, as is the case with any other OMA method. The free decays obtained from RD have been used for system modal identification using eigen-system realization algorithm (ERA). The proposed harmonic elimination method has an advantage over previous methods in that it does not require the harmonic frequencies to be known and can be used for multiple harmonics, including periodic signals. The theory behind harmonic elimination is first developed and validated. The effectiveness of the method is demonstrated through a simulated study and then by experimental studies on a beam and a more complex F-shape structure, which resembles in shape to the skeleton of a drilling or milling machine tool. Cases with presence of single and multiple harmonics in the response are considered.

  7. Medical Decision Support System for Diagnosis of Heart Arrhythmia using DWT and Random Forests Classifier.

    PubMed

    Alickovic, Emina; Subasi, Abdulhamit

    2016-04-01

    In this study, Random Forests (RF) classifier is proposed for ECG heartbeat signal classification in diagnosis of heart arrhythmia. Discrete wavelet transform (DWT) is used to decompose ECG signals into different successive frequency bands. A set of different statistical features were extracted from the obtained frequency bands to denote the distribution of wavelet coefficients. This study shows that RF classifier achieves superior performances compared to other decision tree methods using 10-fold cross-validation for the ECG datasets and the obtained results suggest that further significant improvements in terms of classification accuracy can be accomplished by the proposed classification system. Accurate ECG signal classification is the major requirement for detection of all arrhythmia types. Performances of the proposed system have been evaluated on two different databases, namely MIT-BIH database and St. -Petersburg Institute of Cardiological Technics 12-lead Arrhythmia Database. For MIT-BIH database, RF classifier yielded an overall accuracy 99.33 % against 98.44 and 98.67 % for the C4.5 and CART classifiers, respectively. For St. -Petersburg Institute of Cardiological Technics 12-lead Arrhythmia Database, RF classifier yielded an overall accuracy 99.95 % against 99.80 % for both C4.5 and CART classifiers, respectively. The combined model with multiscale principal component analysis (MSPCA) de-noising, discrete wavelet transform (DWT) and RF classifier also achieves better performance with the area under the receiver operating characteristic (ROC) curve (AUC) and F-measure equal to 0.999 and 0.993 for MIT-BIH database and 1 and 0.999 for and St. -Petersburg Institute of Cardiological Technics 12-lead Arrhythmia Database, respectively. Obtained results demonstrate that the proposed system has capacity for reliable classification of ECG signals, and to assist the clinicians for making an accurate diagnosis of cardiovascular disorders (CVDs). PMID:26922592

  8. Prediction of hot spots in protein interfaces using a random forest model with hybrid features.

    PubMed

    Wang, Lin; Liu, Zhi-Ping; Zhang, Xiang-Sun; Chen, Luonan

    2012-03-01

    Prediction of hot spots in protein interfaces provides crucial information for the research on protein-protein interaction and drug design. Existing machine learning methods generally judge whether a given residue is likely to be a hot spot by extracting features only from the target residue. However, hot spots usually form a small cluster of residues which are tightly packed together at the center of protein interface. With this in mind, we present a novel method to extract hybrid features which incorporate a wide range of information of the target residue and its spatially neighboring residues, i.e. the nearest contact residue in the other face (mirror-contact residue) and the nearest contact residue in the same face (intra-contact residue). We provide a novel random forest (RF) model to effectively integrate these hybrid features for predicting hot spots in protein interfaces. Our method can achieve accuracy (ACC) of 82.4% and Matthew's correlation coefficient (MCC) of 0.482 in Alanine Scanning Energetics Database, and ACC of 77.6% and MCC of 0.429 in Binding Interface Database. In a comparison study, performance of our RF model exceeds other existing methods, such as Robetta, FOLDEF, KFC, KFC2, MINERVA and HotPoint. Of our hybrid features, three physicochemical features of target residues (mass, polarizability and isoelectric point), the relative side-chain accessible surface area and the average depth index of mirror-contact residues are found to be the main discriminative features in hot spots prediction. We also confirm that hot spots tend to form large contact surface areas between two interacting proteins. Source data and code are available at: http://www.aporc.org/doc/wiki/HotSpot. PMID:22258275

  9. Using Random Forest to Improve the Downscaling of Global Livestock Census Data.

    PubMed

    Nicolas, Gaëlle; Robinson, Timothy P; Wint, G R William; Conchedda, Giulia; Cinardi, Giuseppina; Gilbert, Marius

    2016-01-01

    Large scale, high-resolution global data on farm animal distributions are essential for spatially explicit assessments of the epidemiological, environmental and socio-economic impacts of the livestock sector. This has been the major motivation behind the development of the Gridded Livestock of the World (GLW) database, which has been extensively used since its first publication in 2007. The database relies on a downscaling methodology whereby census counts of animals in sub-national administrative units are redistributed at the level of grid cells as a function of a series of spatial covariates. The recent upgrade of GLW1 to GLW2 involved automating the processing, improvement of input data, and downscaling at a spatial resolution of 1 km per cell (5 km per cell in the earlier version). The underlying statistical methodology, however, remained unchanged. In this paper, we evaluate new methods to downscale census data with a higher accuracy and increased processing efficiency. Two main factors were evaluated, based on sample census datasets of cattle in Africa and chickens in Asia. First, we implemented and evaluated Random Forest models (RF) instead of stratified regressions. Second, we investigated whether models that predicted the number of animals per rural person (per capita) could provide better downscaled estimates than the previous approach that predicted absolute densities (animals per km2). RF models consistently provided better predictions than the stratified regressions for both continents and species. The benefit of per capita over absolute density models varied according to the species and continent. In addition, different technical options were evaluated to reduce the processing time while maintaining their predictive power. Future GLW runs (GLW 3.0) will apply the new RF methodology with optimized modelling options. The potential benefit of per capita models will need to be further investigated with a better distinction between rural and agricultural

  10. Insight into Best Variables for COPD Case Identification: A Random Forests Analysis

    PubMed Central

    Leidy, Nancy K.; Malley, Karen G.; Steenrod, Anna W.; Mannino, David M.; Make, Barry J.; Bowler, Russ P.; Thomashow, Byron M.; Barr, R. G.; Rennard, Stephen I.; Houfek, Julia F.; Yawn, Barbara P.; Han, Meilan K.; Meldrum, Catherine A.; Bacci, Elizabeth D.; Walsh, John W.; Martinez, Fernando

    2016-01-01

    Rationale This study is part of a larger, multi-method project to develop a questionnaire for identifying undiagnosed cases of chronic obstructive pulmonary disease (COPD) in primary care settings, with specific interest in the detection of patients with moderate to severe airway obstruction or risk of exacerbation. Objectives To examine 3 existing datasets for insight into key features of COPD that could be useful in the identification of undiagnosed COPD. Methods Random forests analyses were applied to the following databases: COPD Foundation Peak Flow Study Cohort (N=5761), Burden of Obstructive Lung Disease (BOLD) Kentucky site (N=508), and COPDGene® (N=10,214). Four scenarios were examined to find the best, smallest sets of variables that distinguished cases and controls:(1) moderate to severe COPD (forced expiratory volume in 1 second [FEV1] <50% predicted) versus no COPD; (2) undiagnosed versus diagnosed COPD; (3) COPD with and without exacerbation history; and (4) clinically significant COPD (FEV1<60% predicted or history of acute exacerbation) versus all others. Results From 4 to 8 variables were able to differentiate cases from controls, with sensitivity ≥73 (range: 73–90) and specificity >68 (range: 68–93). Across scenarios, the best models included age, smoking status or history, symptoms (cough, wheeze, phlegm), general or breathing-related activity limitation, episodes of acute bronchitis, and/or missed work days and non-work activities due to breathing or health. Conclusions Results provide insight into variables that should be considered during the development of candidate items for a new questionnaire to identify undiagnosed cases of clinically significant COPD. PMID:26835508

  11. Using Random Forest to Improve the Downscaling of Global Livestock Census Data

    PubMed Central

    Nicolas, Gaëlle; Robinson, Timothy P.; Wint, G. R. William; Conchedda, Giulia; Cinardi, Giuseppina; Gilbert, Marius

    2016-01-01

    Large scale, high-resolution global data on farm animal distributions are essential for spatially explicit assessments of the epidemiological, environmental and socio-economic impacts of the livestock sector. This has been the major motivation behind the development of the Gridded Livestock of the World (GLW) database, which has been extensively used since its first publication in 2007. The database relies on a downscaling methodology whereby census counts of animals in sub-national administrative units are redistributed at the level of grid cells as a function of a series of spatial covariates. The recent upgrade of GLW1 to GLW2 involved automating the processing, improvement of input data, and downscaling at a spatial resolution of 1 km per cell (5 km per cell in the earlier version). The underlying statistical methodology, however, remained unchanged. In this paper, we evaluate new methods to downscale census data with a higher accuracy and increased processing efficiency. Two main factors were evaluated, based on sample census datasets of cattle in Africa and chickens in Asia. First, we implemented and evaluated Random Forest models (RF) instead of stratified regressions. Second, we investigated whether models that predicted the number of animals per rural person (per capita) could provide better downscaled estimates than the previous approach that predicted absolute densities (animals per km2). RF models consistently provided better predictions than the stratified regressions for both continents and species. The benefit of per capita over absolute density models varied according to the species and continent. In addition, different technical options were evaluated to reduce the processing time while maintaining their predictive power. Future GLW runs (GLW 3.0) will apply the new RF methodology with optimized modelling options. The potential benefit of per capita models will need to be further investigated with a better distinction between rural and agricultural

  12. iDNA-Prot: identification of DNA binding proteins using random forest with grey model.

    PubMed

    Lin, Wei-Zhong; Fang, Jian-An; Xiao, Xuan; Chou, Kuo-Chen

    2011-01-01

    DNA-binding proteins play crucial roles in various cellular processes. Developing high throughput tools for rapidly and effectively identifying DNA-binding proteins is one of the major challenges in the field of genome annotation. Although many efforts have been made in this regard, further effort is needed to enhance the prediction power. By incorporating the features into the general form of pseudo amino acid composition that were extracted from protein sequences via the "grey model" and by adopting the random forest operation engine, we proposed a new predictor, called iDNA-Prot, for identifying uncharacterized proteins as DNA-binding proteins or non-DNA binding proteins based on their amino acid sequences information alone. The overall success rate by iDNA-Prot was 83.96% that was obtained via jackknife tests on a newly constructed stringent benchmark dataset in which none of the proteins included has ≥25% pairwise sequence identity to any other in a same subset. In addition to achieving high success rate, the computational time for iDNA-Prot is remarkably shorter in comparison with the relevant existing predictors. Hence it is anticipated that iDNA-Prot may become a useful high throughput tool for large-scale analysis of DNA-binding proteins. As a user-friendly web-server, iDNA-Prot is freely accessible to the public at the web-site on http://icpr.jci.edu.cn/bioinfo/iDNA-Prot or http://www.jci-bioinfo.cn/iDNA-Prot. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results. PMID:21935457

  13. rpiCOOL: A tool for In Silico RNA-protein interaction detection using random forest.

    PubMed

    Akbaripour-Elahabad, Mohammad; Zahiri, Javad; Rafeh, Reza; Eslami, Morteza; Azari, Mahboobeh

    2016-08-01

    Understanding the principle of RNA-protein interactions (RPIs) is of critical importance to provide insights into post-transcriptional gene regulation and is useful to guide studies about many complex diseases. The limitations and difficulties associated with experimental determination of RPIs, call an urgent need to computational methods for RPI prediction. In this paper, we proposed a machine learning method to detect RNA-protein interactions based on sequence information. We used motif information and repetitive patterns, which have been extracted from experimentally validated RNA-protein interactions, in combination with sequence composition as descriptors to build a model to RPI prediction via a random forest classifier. About 20% of the "sequence motifs" and "nucleotide composition" features have been selected as the informative features with the feature selection methods. These results suggest that these two feature types contribute effectively in RPI detection. Results of 10-fold cross-validation experiments on three non-redundant benchmark datasets show a better performance of the proposed method in comparison with the current state-of-the-art methods in terms of various performance measures. In addition, the results revealed that the accuracy of the RPI prediction methods could vary considerably across different organisms. We have implemented the proposed method, namely rpiCOOL, as a stand-alone tool with a user friendly graphical user interface (GUI) that enables the researchers to predict RNA-protein interaction. The rpiCOOL is freely available at http://biocool.ir/rpicool.html for non-commercial uses. PMID:27134008

  14. Random forest classification of large volume structures for visuo-haptic rendering in CT images

    NASA Astrophysics Data System (ADS)

    Mastmeyer, Andre; Fortmeier, Dirk; Handels, Heinz

    2016-03-01

    For patient-specific voxel-based visuo-haptic rendering of CT scans of the liver area, the fully automatic segmentation of large volume structures such as skin, soft tissue, lungs and intestine (risk structures) is important. Using a machine learning based approach, several existing segmentations from 10 segmented gold-standard patients are learned by random decision forests individually and collectively. The core of this paper is feature selection and the application of the learned classifiers to a new patient data set. In a leave-some-out cross-validation, the obtained full volume segmentations are compared to the gold-standard segmentations of the untrained patients. The proposed classifiers use a multi-dimensional feature space to estimate the hidden truth, instead of relying on clinical standard threshold and connectivity based methods. The result of our efficient whole-body section classification are multi-label maps with the considered tissues. For visuo-haptic simulation, other small volume structures would have to be segmented additionally. We also take a look into these structures (liver vessels). For an experimental leave-some-out study consisting of 10 patients, the proposed method performs much more efficiently compared to state of the art methods. In two variants of leave-some-out experiments we obtain best mean DICE ratios of 0.79, 0.97, 0.63 and 0.83 for skin, soft tissue, hard bone and risk structures. Liver structures are segmented with DICE 0.93 for the liver, 0.43 for blood vessels and 0.39 for bile vessels.

  15. Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models

    PubMed Central

    Svetlichnyy, Dmitry; Imrichova, Hana; Fiers, Mark; Kalender Atak, Zeynep; Aerts, Stein

    2015-01-01

    Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a “gain-of-target” for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes. PMID:26562774

  16. Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.

    PubMed

    Svetlichnyy, Dmitry; Imrichova, Hana; Fiers, Mark; Kalender Atak, Zeynep; Aerts, Stein

    2015-11-01

    Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a "gain-of-target" for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes. PMID:26562774

  17. Pathway analysis of single-nucleotide polymorphisms potentially associated with glioblastoma multiforme susceptibility using random forests.

    PubMed

    Chang, Jeffrey S; Yeh, Ru-Fang; Wiencke, John K; Wiemels, Joseph L; Smirnov, Ivan; Pico, Alexander R; Tihan, Tarik; Patoka, Joe; Miike, Rei; Sison, Jennette D; Rice, Terri; Wrensch, Margaret R

    2008-06-01

    Glioma is a complex disease that is unlikely to result from the effect of a single gene. Genetic analysis at the pathway level involving multiple genes may be more likely to capture gene-disease associations than analyzing genes one at a time. The current pilot study included 112 Caucasians with glioblastoma multiforme and 112 Caucasian healthy controls frequency matched to cases by age and gender. Subjects were genotyped using a commercially available (ParAllele/Affymetrix) assay panel of 10,177 nonsynonymous coding single-nucleotide polymorphisms (SNP) spanning the genome known at the time the panel was constructed. For this analysis, we selected 10 pathways potentially involved in gliomagenesis that had SNPs represented on the panel. We performed random forests (RF) analyses of SNPs within each pathway group and logistic regression to assess interaction among genes in the one pathway for which the RF prediction error was better than chance and the permutation P < 0.10. Only the DNA repair pathway had a better than chance classification of case-control status with a prediction error of 45.5% and P = 0.09. Three SNPs (rs1047840 of EXO1, rs12450550 of EME1, and rs799917 of BRCA1) of the DNA repair pathway were identified as promising candidates for further replication. In addition, statistically significant interactions (P < 0.05) between rs1047840 of EXO1 and rs799917 or rs1799966 of BRCA1 were observed. Despite less than complete inclusion of genes and SNPs relevant to glioma and a small sample size, RF analysis identified one important biological pathway and several SNPs potentially associated with the development of glioblastoma. PMID:18559551

  18. Dust storm detection using random forests and physical-based approaches over the Middle East

    NASA Astrophysics Data System (ADS)

    Souri, Amir Hossein; Vajedian, Sanaz

    2015-07-01

    Dust storms are important phenomena over large regions of the arid and semi-arid areas of the Middle East. Due to the influences of dust aerosols on climate and human daily activities, dust detection plays a crucial role in environmental and climatic studies. Detection of dust storms is critical to accurately understand dust, their properties and distribution. Currently, remotely sensed data such as MODIS (Moderate Resolution Imaging Spectroradiometer) with appropriate temporal and spectral resolutions have been widely used for this purpose. This paper investigates the capability of two physical-based methods, and random forests (RF) classifier, for the first time, to detect dust storms using MODIS imagery. Since the physical-based approaches are empirical, they suffer from certain drawbacks such as high variability of thresholds depending on the underlying surface. Therefore, classification-based approaches could be deployed as an alternative. In this paper, the most relevant bands are chosen based on the physical effects of the major classes, particularly dust, cloud and snow, on both emissive infrared and reflective bands. In order to verify the capability of the methods, OMAERUV AAOD (aerosol absorption optical depth) product from OMI (Ozone Monitoring Instrument) sensor is exploited. In addition, some small regions are selected manually to be considered as ground truth for measuring the probability of false detection (POFD) and probability of missing detection (POMD). The dust class generated by RF is consistent qualitatively with the location and extent of dust observed in OMERAUV and MODIS true colour images. Quantitatively, the dust classes generated for eight dust outbreaks in the Middle East are found to be accurate from 7% and 6% of POFD and POMD respectively. Moreover, results demonstrate the sound capability of RF in classifying dust plumes over both water and land simultaneously. The performance of the physical-based approaches is found weaker than RF

  19. In vivo MRI based prostate cancer localization with random forests and auto-context model.

    PubMed

    Qian, Chunjun; Wang, Li; Gao, Yaozong; Yousuf, Ambereen; Yang, Xiaoping; Oto, Aytekin; Shen, Dinggang

    2016-09-01

    Prostate cancer is one of the major causes of cancer death for men. Magnetic resonance (MR) imaging is being increasingly used as an important modality to localize prostate cancer. Therefore, localizing prostate cancer in MRI with automated detection methods has become an active area of research. Many methods have been proposed for this task. However, most of previous methods focused on identifying cancer only in the peripheral zone (PZ), or classifying suspicious cancer ROIs into benign tissue and cancer tissue. Few works have been done on developing a fully automatic method for cancer localization in the entire prostate region, including central gland (CG) and transition zone (TZ). In this paper, we propose a novel learning-based multi-source integration framework to directly localize prostate cancer regions from in vivo MRI. We employ random forests to effectively integrate features from multi-source images together for cancer localization. Here, multi-source images include initially the multi-parametric MRIs (i.e., T2, DWI, and dADC) and later also the iteratively-estimated and refined tissue probability map of prostate cancer. Experimental results on 26 real patient data show that our method can accurately localize cancerous sections. The higher section-based evaluation (SBE), combined with the ROC analysis result of individual patients, shows that the proposed method is promising for in vivo MRI based prostate cancer localization, which can be used for guiding prostate biopsy, targeting the tumor in focal therapy planning, triage and follow-up of patients with active surveillance, as well as the decision making in treatment selection. The common ROC analysis with the AUC value of 0.832 and also the ROI-based ROC analysis with the AUC value of 0.883 both illustrate the effectiveness of our proposed method. PMID:27048995

  20. Random Forests to Predict Rectal Toxicity Following Prostate Cancer Radiation Therapy

    SciTech Connect

    Ospina, Juan D.; Zhu, Jian; Chira, Ciprian; Bossi, Alberto; Delobel, Jean B.; Beckendorf, Véronique; Dubray, Bernard; Lagrange, Jean-Léon; Correa, Juan C.; and others

    2014-08-01

    Purpose: To propose a random forest normal tissue complication probability (RF-NTCP) model to predict late rectal toxicity following prostate cancer radiation therapy, and to compare its performance to that of classic NTCP models. Methods and Materials: Clinical data and dose-volume histograms (DVH) were collected from 261 patients who received 3-dimensional conformal radiation therapy for prostate cancer with at least 5 years of follow-up. The series was split 1000 times into training and validation cohorts. A RF was trained to predict the risk of 5-year overall rectal toxicity and bleeding. Parameters of the Lyman-Kutcher-Burman (LKB) model were identified and a logistic regression model was fit. The performance of all the models was assessed by computing the area under the receiving operating characteristic curve (AUC). Results: The 5-year grade ≥2 overall rectal toxicity and grade ≥1 and grade ≥2 rectal bleeding rates were 16%, 25%, and 10%, respectively. Predictive capabilities were obtained using the RF-NTCP model for all 3 toxicity endpoints, including both the training and validation cohorts. The age and use of anticoagulants were found to be predictors of rectal bleeding. The AUC for RF-NTCP ranged from 0.66 to 0.76, depending on the toxicity endpoint. The AUC values for the LKB-NTCP were statistically significantly inferior, ranging from 0.62 to 0.69. Conclusions: The RF-NTCP model may be a useful new tool in predicting late rectal toxicity, including variables other than DVH, and thus appears as a strong competitor to classic NTCP models.

  1. Identifying Important Risk Factors for Survival in Kidney Graft Failure Patients Using Random Survival Forests

    PubMed Central

    HAMIDI, Omid; POOROLAJAL, Jalal; FARHADIAN, Maryam; TAPAK, Leili

    2016-01-01

    Background: Kidney transplantation is the best alternative treatment for end-stage renal disease. Several studies have been devoted to investigate predisposing factors of graft rejection. However, there is inconsistency between the results. The objective of the present study was to utilize an intuitive and robust approach for variable selection, random survival forests (RSF), and to identify important risk factors in kidney transplantation patients. Methods: The data set included 378 patients with kidney transplantation obtained through a historical cohort study in Hamadan, western Iran, from 1994 to 2011. The event of interest was chronic nonreversible graft rejection and the duration between kidney transplantation and rejection was considered as the survival time. RSF method was used to identify important risk factors for survival of the patients among the potential predictors of graft rejection. Results: The mean survival time was 7.35±4.62 yr. Thirty-seven episodes of rejection were occurred. The most important predictors of survival were cold ischemic time, recipient’s age, creatinine level at discharge, donors’ age and duration of hospitalization. RSF method predicted survival better than the conventional Cox-proportional hazards model (out-of-bag C-index of 0.965 for RSF vs. 0.766 for Cox model and integrated Brier score of 0.081 for RSF vs. 0.088 for Cox model). Conclusion: A RSF model in the kidney transplantation patients outperformed traditional Cox-proportional hazard model. RSF is a promising method that may serve as a more intuitive approach to identify important risk factors for graft rejection. PMID:27057518

  2. Water chemistry in 179 randomly selected Swedish headwater streams related to forest production, clear-felling and climate.

    PubMed

    Löfgren, Stefan; Fröberg, Mats; Yu, Jun; Nisell, Jakob; Ranneby, Bo

    2014-12-01

    From a policy perspective, it is important to understand forestry effects on surface waters from a landscape perspective. The EU Water Framework Directive demands remedial actions if not achieving good ecological status. In Sweden, 44 % of the surface water bodies have moderate ecological status or worse. Many of these drain catchments with a mosaic of managed forests. It is important for the forestry sector and water authorities to be able to identify where, in the forested landscape, special precautions are necessary. The aim of this study was to quantify the relations between forestry parameters and headwater stream concentrations of nutrients, organic matter and acid-base chemistry. The results are put into the context of regional climate, sulphur and nitrogen deposition, as well as marine influences. Water chemistry was measured in 179 randomly selected headwater streams from two regions in southwest and central Sweden, corresponding to 10 % of the Swedish land area. Forest status was determined from satellite images and Swedish National Forest Inventory data using the probabilistic classifier method, which was used to model stream water chemistry with Bayesian model averaging. The results indicate that concentrations of e.g. nitrogen, phosphorus and organic matter are related to factors associated with forest production but that it is not forestry per se that causes the excess losses. Instead, factors simultaneously affecting forest production and stream water chemistry, such as climate, extensive soil pools and nitrogen deposition, are the most likely candidates The relationships with clear-felled and wetland areas are likely to be direct effects. PMID:25260924

  3. Using Dynamic Programming and Genetic Algorithms to Reduce Erosion Risks From Forest Roads

    NASA Astrophysics Data System (ADS)

    Madej, M.; Eschenbach, E.; Teasley, R.; Diaz, C.; Wartella, J.; Simi, J.

    2002-12-01

    Many anadromous fisheries streams in the Pacific Northwest have been damaged by various land use activities, including timber harvest and road construction. Unpaved forest roads can cause erosion and downstream sedimentation damage in anadromous fish-bearing streams. Although road decommissioning and road upgrading activities have been conducted on many of these roads, these activities have usually been implemented and evaluated on a site-specific basis without the benefit of a watershed perspective. Land managers still struggle with designing the most effective road treatment plan to minimize erosion while keeping costs reasonable across a large land base. Trade-offs between costs of different levels of treatment and the net effect on reducing sediment risks to streams need to be quantified. For example, which problems should be treated first, and by what treatment method? Is it better to fix one large problem or 100 small problems? If sediment reduction to anadromous fish-bearing streams is the desired outcome of road treatment activities, a more rigorous evaluation of risks and optimization of treatments is needed. Two approaches, Dynamic Programming (DP) and Genetic Algorithms (GA), were successfully used to determine the most effective treatment levels for roads and stream crossings in a pilot study basin with approximately 200 road segments and stream crossings and in an actual watershed with approximately 600 road segments and crossings. The optimization models determine the treatment levels for roads and crossings that maximize the total sediment saved within a watershed while maintaining the total treatment cost within the specified budget. The optimization models import GIS data on roads and crossings and export the optimal treatment level for each road and crossing to the GIS watershed model.

  4. SU-D-201-06: Random Walk Algorithm Seed Localization Parameters in Lung Positron Emission Tomography (PET) Images

    SciTech Connect

    Soufi, M; Asl, A Kamali; Geramifar, P

    2015-06-15

    Purpose: The objective of this study was to find the best seed localization parameters in random walk algorithm application to lung tumor delineation in Positron Emission Tomography (PET) images. Methods: PET images suffer from statistical noise and therefore tumor delineation in these images is a challenging task. Random walk algorithm, a graph based image segmentation technique, has reliable image noise robustness. Also its fast computation and fast editing characteristics make it powerful for clinical purposes. We implemented the random walk algorithm using MATLAB codes. The validation and verification of the algorithm have been done by 4D-NCAT phantom with spherical lung lesions in different diameters from 20 to 90 mm (with incremental steps of 10 mm) and different tumor to background ratios of 4:1 and 8:1. STIR (Software for Tomographic Image Reconstruction) has been applied to reconstruct the phantom PET images with different pixel sizes of 2×2×2 and 4×4×4 mm{sup 3}. For seed localization, we selected pixels with different maximum Standardized Uptake Value (SUVmax) percentages, at least (70%, 80%, 90% and 100%) SUVmax for foreground seeds and up to (20% to 55%, 5% increment) SUVmax for background seeds. Also, for investigation of algorithm performance on clinical data, 19 patients with lung tumor were studied. The resulted contours from algorithm have been compared with nuclear medicine expert manual contouring as ground truth. Results: Phantom and clinical lesion segmentation have shown that the best segmentation results obtained by selecting the pixels with at least 70% SUVmax as foreground seeds and pixels up to 30% SUVmax as background seeds respectively. The mean Dice Similarity Coefficient of 94% ± 5% (83% ± 6%) and mean Hausdorff Distance of 1 (2) pixels have been obtained for phantom (clinical) study. Conclusion: The accurate results of random walk algorithm in PET image segmentation assure its application for radiation treatment planning and

  5. A fast random walk algorithm for computing the pulsed-gradient spin-echo signal in multiscale porous media.

    PubMed

    Grebenkov, Denis S

    2011-02-01

    A new method for computing the signal attenuation due to restricted diffusion in a linear magnetic field gradient is proposed. A fast random walk (FRW) algorithm for simulating random trajectories of diffusing spin-bearing particles is combined with gradient encoding. As random moves of a FRW are continuously adapted to local geometrical length scales, the method is efficient for simulating pulsed-gradient spin-echo experiments in hierarchical or multiscale porous media such as concrete, sandstones, sedimentary rocks and, potentially, brain or lungs. PMID:21159532

  6. Random forest learning of ultrasonic statistical physics and object spaces for lesion detection in 2D sonomammography

    NASA Astrophysics Data System (ADS)

    Sheet, Debdoot; Karamalis, Athanasios; Kraft, Silvan; Noël, Peter B.; Vag, Tibor; Sadhu, Anup; Katouzian, Amin; Navab, Nassir; Chatterjee, Jyotirmoy; Ray, Ajoy K.

    2013-03-01

    Breast cancer is the most common form of cancer in women. Early diagnosis can significantly improve lifeexpectancy and allow different treatment options. Clinicians favor 2D ultrasonography for breast tissue abnormality screening due to high sensitivity and specificity compared to competing technologies. However, inter- and intra-observer variability in visual assessment and reporting of lesions often handicaps its performance. Existing Computer Assisted Diagnosis (CAD) systems though being able to detect solid lesions are often restricted in performance. These restrictions are inability to (1) detect lesion of multiple sizes and shapes, and (2) differentiate between hypo-echoic lesions from their posterior acoustic shadowing. In this work we present a completely automatic system for detection and segmentation of breast lesions in 2D ultrasound images. We employ random forests for learning of tissue specific primal to discriminate breast lesions from surrounding normal tissues. This enables it to detect lesions of multiple shapes and sizes, as well as discriminate between hypo-echoic lesion from associated posterior acoustic shadowing. The primal comprises of (i) multiscale estimated ultrasonic statistical physics and (ii) scale-space characteristics. The random forest learns lesion vs. background primal from a database of 2D ultrasound images with labeled lesions. For segmentation, the posterior probabilities of lesion pixels estimated by the learnt random forest are hard thresholded to provide a random walks segmentation stage with starting seeds. Our method achieves detection with 99.19% accuracy and segmentation with mean contour-to-contour error < 3 pixels on a set of 40 images with 49 lesions.

  7. Excitation energies from particle-particle random phase approximation: Davidson algorithm and benchmark studies.

    PubMed

    Yang, Yang; Peng, Degao; Lu, Jianfeng; Yang, Weitao

    2014-09-28

    The particle-particle random phase approximation (pp-RPA) has been used to investigate excitation problems in our recent paper [Y. Yang, H. van Aggelen, and W. Yang, J. Chem. Phys. 139, 224105 (2013)]. It has been shown to be capable of describing double, Rydberg, and charge transfer excitations, which are challenging for conventional time-dependent density functional theory (TDDFT). However, its performance on larger molecules is unknown as a result of its expensive O(N(6)) scaling. In this article, we derive and implement a Davidson iterative algorithm for the pp-RPA to calculate the lowest few excitations for large systems. The formal scaling is reduced to O(N(4)), which is comparable with the commonly used configuration interaction singles (CIS) and TDDFT methods. With this iterative algorithm, we carried out benchmark tests on molecules that are significantly larger than the molecules in our previous paper with a reasonably large basis set. Despite some self-consistent field convergence problems with ground state calculations of (N - 2)-electron systems, we are able to accurately capture lowest few excitations for systems with converged calculations. Compared to CIS and TDDFT, there is no systematic bias for the pp-RPA with the mean signed error close to zero. The mean absolute error of pp-RPA with B3LYP or PBE references is similar to that of TDDFT, which suggests that the pp-RPA is a comparable method to TDDFT for large molecules. Moreover, excitations with relatively large non-HOMO excitation contributions are also well described in terms of excitation energies, as long as there is also a relatively large HOMO excitation contribution. These findings, in conjunction with the capability of pp-RPA for describing challenging excitations shown earlier, further demonstrate the potential of pp-RPA as a reliable and general method to describe excitations, and to be a good alternative to TDDFT methods. PMID:25273409

  8. Variances in the projections, resulting from CLIMEX, Boosted Regression Trees and Random Forests techniques

    NASA Astrophysics Data System (ADS)

    Shabani, Farzin; Kumar, Lalit; Solhjouy-fard, Samaneh

    2016-05-01

    The aim of this study was to have a comparative investigation and evaluation of the capabilities of correlative and mechanistic modeling processes, applied to the projection of future distributions of date palm in novel environments and to establish a method of minimizing uncertainty in the projections of differing techniques. The location of this study on a global scale is in Middle Eastern Countries. We compared the mechanistic model CLIMEX (CL) with the correlative models MaxEnt (MX), Boosted Regression Trees (BRT), and Random Forests (RF) to project current and future distributions of date palm (Phoenix dactylifera L.). The Global Climate Model (GCM), the CSIRO-Mk3.0 (CS) using the A2 emissions scenario, was selected for making projections. Both indigenous and alien distribution data of the species were utilized in the modeling process. The common areas predicted by MX, BRT, RF, and CL from the CS GCM were extracted and compared to ascertain projection uncertainty levels of each individual technique. The common areas identified by all four modeling techniques were used to produce a map indicating suitable and unsuitable areas for date palm cultivation for Middle Eastern countries, for the present and the year 2100. The four different modeling approaches predict fairly different distributions. Projections from CL were more conservative than from MX. The BRT and RF were the most conservative methods in terms of projections for the current time. The combination of the final CL and MX projections for the present and 2100 provide higher certainty concerning those areas that will become highly suitable for future date palm cultivation. According to the four models, cold, hot, and wet stress, with differences on a regional basis, appears to be the major restrictions on future date palm distribution. The results demonstrate variances in the projections, resulting from different techniques. The assessment and interpretation of model projections requires reservations

  9. Highly efficient numerical algorithm based on random trees for accelerating parallel Vlasov-Poisson simulations

    NASA Astrophysics Data System (ADS)

    Acebrón, Juan A.; Rodríguez-Rozas, Ángel

    2013-10-01

    An efficient numerical method based on a probabilistic representation for the Vlasov-Poisson system of equations in the Fourier space has been derived. This has been done theoretically for arbitrary dimensional problems, and particularized to unidimensional problems for numerical purposes. Such a representation has been validated theoretically in the linear regime comparing the solution obtained with the classical results of the linear Landau damping. The numerical strategy followed requires generating suitable random trees combined with a Padé approximant for approximating accurately a given divergent series. Such series are obtained by summing the partial contributions to the solution coming from trees with arbitrary number of branches. These contributions, coming in general from multi-dimensional definite integrals, are efficiently computed by a quasi-Monte Carlo method. It is shown how the accuracy of the method can be effectively increased by considering more terms of the series. The new representation was used successfully to develop a Probabilistic Domain Decomposition method suited for massively parallel computers, which improves the scalability found in classical methods. Finally, a few numerical examples based on classical phenomena such as the non-linear Landau damping, and the two streaming instability are given, illustrating the remarkable performance of the algorithm, when compared the results with those obtained using a classical method.

  10. Comparison of Logistic Regression and Random Forests techniques for shallow landslide susceptibility assessment in Giampilieri (NE Sicily, Italy)

    NASA Astrophysics Data System (ADS)

    Trigila, Alessandro; Iadanza, Carla; Esposito, Carlo; Scarascia-Mugnozza, Gabriele

    2015-11-01

    The aim of this work is to define reliable susceptibility models for shallow landslides using Logistic Regression and Random Forests multivariate statistical techniques. The study area, located in North-East Sicily, was hit on October 1st 2009 by a severe rainstorm (225 mm of cumulative rainfall in 7 h) which caused flash floods and more than 1000 landslides. Several small villages, such as Giampilieri, were hit with 31 fatalities, 6 missing persons and damage to buildings and transportation infrastructures. Landslides, mainly types such as earth and debris translational slides evolving into debris flows, were triggered on steep slopes and involved colluvium and regolith materials which cover the underlying metamorphic bedrock. The work has been carried out with the following steps: i) realization of a detailed event landslide inventory map through field surveys coupled with observation of high resolution aerial colour orthophoto; ii) identification of landslide source areas; iii) data preparation of landslide controlling factors and descriptive statistics based on a bivariate method (Frequency Ratio) to get an initial overview on existing relationships between causative factors and shallow landslide source areas; iv) choice of criteria for the selection and sizing of the mapping unit; v) implementation of 5 multivariate statistical susceptibility models based on Logistic Regression and Random Forests techniques and focused on landslide source areas; vi) evaluation of the influence of sample size and type of sampling on results and performance of the models; vii) evaluation of the predictive capabilities of the models using ROC curve, AUC and contingency tables; viii) comparison of model results and obtained susceptibility maps; and ix) analysis of temporal variation of landslide susceptibility related to input parameter changes. Models based on Logistic Regression and Random Forests have demonstrated excellent predictive capabilities. Land use and wildfire

  11. Neither Host-specific nor Random: Vascular Epiphytes on Three Tree Species in a Panamanian Lowland Forest

    PubMed Central

    LAUBE, STEFAN; ZOTZ, GERHARD

    2006-01-01

    • Background and Aims A possible role of host tree identity in the structuring of vascular epiphyte communities has attracted scientific attention for decades. Specifically, it has been suggested that each host tree species has a specific subset of the local species pool according to its own set of properties, e.g. physicochemical characteristics of the bark, tree architecture, or leaf phenology patterns. • Methods A novel, quantitative approach to this question is presented, taking advantage of a complete census of the vascular epiphyte community in 0·4 ha of undisturbed lowland forest in Panama. For three locally common host-tree species (Socratea exorrhiza, Marila laxiflora, Perebea xanthochyma) null models were created of the expected epiphyte assemblages assuming that epiphyte colonization reflected random distribution of epiphytes in the forest. • Key Results In all three tree species, abundances of the majority of epiphyte species (69–81 %) were indistinguishable from random, while the remaining species were about equally over- or under-represented compared with their occurrence in the entire forest plot. Permutations based on the number of colonized trees (reflecting observed spatial patchiness) yielded similar results. Finally, a third analysis (canonical correspondence analysis) also confirmed host-specific differences in epiphyte assemblages. In spite of pronounced preferences of some epiphytes for particular host trees, no epiphyte species was restricted to a single host. • Conclusions The epiphytes on a given tree species are not simply a random sample of the local species pool, but there are no indications of host specificity either. PMID:16574691

  12. Segmentation of prostate from CT scans using a combined voxel random forests classification with spherical harmonics regularization

    NASA Astrophysics Data System (ADS)

    Commandeur, F.; Acosta, O.; Simon, A.; Ospina Arango, J. D.; Dillenseger, J. L.; Mathieu, R.; Haigron, P.; de Crevoisier, R.

    2015-01-01

    In prostate cancer external beam radiotherapy, pelvic structures identification in computed tomography (CT) is required for the treatment planning and is performed manually by experts. Prostate manual delineations in CT modality is time consuming and prone to observer variability. We propose a fully automated process using a combination of a Random Forests (RF) classification and Spherical Harmonics (SPHARM) to identify the prostate boundaries. The proposed method outperformed classical atlas based approach from the literature. Combining RF to detect the prostate and SPHARM for shape regularization provided promising results for automatic prostate segmentation.

  13. Normalized algorithm for mapping and dating forest disturbances and regrowth for the United States

    NASA Astrophysics Data System (ADS)

    He, Liming; Chen, Jing M.; Zhang, Shaoliang; Gomez, Gustavo; Pan, Yude; McCullough, Kevin; Birdsey, Richard; Masek, Jeffrey G.

    2011-04-01

    Forest disturbances such as harvesting, wildfire and insect infestation are critical ecosystem processes affecting the carbon cycle. Because carbon dynamics are related to time since disturbance, forest stand age that can be used as a surrogate for major clear-cut/fire disturbance information has recently been recognized as an important input to forest carbon cycle models for improving prediction accuracy. In this study, forest disturbances in the USA for the period of ˜1990-2000 were mapped using 400+ pairs of re-sampled Landsat TM/ETM scenes in 500m resolution, which were provided by the Landsat Ecosystem Disturbance Adaptive Processing System project. The detected disturbances were then separated into two five-year age groups, facilitated by Forest Inventory and Analysis (FIA) data, which was used to calculate the area of forest regeneration for each county in the USA. In this study, a disturbance index (DI) was defined as the ratio of the short wave-infrared (SWIR, band 5) to near-infrared (NIR, band 4) reflectance. Forest disturbances were identified through the Normalized Difference of Disturbance Index (NDDI) between circa 2000 and 1990, where a positive NDDI means disturbance and a negative NDDI means regrowth. Axis rotation was performed on the plot between DIs of the two matched Landsat scenes in order to reduce any difference of DIs caused by non-disturbance factors. The threshold of NDDI for each TM/ETM pair was determined by analysis of FIA data. Minor disturbances affecting small areas may be omitted due to the coarse resolution of the aggregated Landsat data, but the major stand-clearing disturbances (clear-cut harvest, fire) are captured. The spatial distribution of the detected disturbed areas was validated by Monitoring Trends in Burn Severity fire data in four States of the western USA (Washington, Oregon, Idaho, and California). Results indicate omission errors of 66.9%. An important application of this remote sensing-based disturbance map is

  14. An automatic water body area monitoring algorithm for satellite images based on Markov Random Fields

    NASA Astrophysics Data System (ADS)

    Elmi, Omid; Tourian, Mohammad J.; Sneeuw, Nico

    2016-04-01

    Our knowledge about spatial and temporal variation of hydrological parameters are surprisingly poor, because most of it is based on in situ stations and the number of stations have reduced dramatically during the past decades. On the other hand, remote sensing techniques have proven their ability to measure different parameters of Earth phenomena. Optical and SAR satellite imagery provide the opportunity to monitor the spatial change in coastline, which can serve as a way to determine the water extent repeatedly in an appropriate time interval. An appropriate classification technique to separate water and land is the backbone of each automatic water body monitoring. Due to changes in the water level, river and lake extent, atmosphere, sunlight radiation and onboard calibration of the satellite over time, most of the pixel-based classification techniques fail to determine accurate water masks. Beyond pixel intensity, spatial correlation between neighboring pixels is another source of information that should be used to decide the label of pixels. Water bodies have strong spatial correlation in satellite images. Therefore including contextual information as additional constraint into the procedure of water body monitoring improves the accuracy of the derived water masks significantly. In this study, we present an automatic algorithm for water body area monitoring based on maximum a posteriori (MAP) estimation of Markov Random Fields (MRF). First we collect all available images from selected case studies during the monitoring period. Then for each image separately we apply a k-means clustering to derive a primary water mask. After that we develop a MRF using pixel values and the primary water mask for each image. Then among the different realizations of the field we select the one that maximizes the posterior estimation. We solve this optimization problem using graph cut techniques. A graph with two terminals is constructed, after which the best labelling structure for

  15. Classification of Potential Water Bodies Using Landsat 8 OLI and a Combination of Two Boosted Random Forest Classifiers

    PubMed Central

    Ko, Byoung Chul; Kim, Hyeong Hun; Nam, Jae Yeal

    2015-01-01

    This study proposes a new water body classification method using top-of-atmosphere (TOA) reflectance and water indices (WIs) of the Landsat 8 Operational Land Imager (OLI) sensor and its corresponding random forest classifiers. In this study, multispectral images from the OLI sensor are represented as TOA reflectance and WI values because a classification result using two measures is better than raw spectral images. Two types of boosted random forest (BRF) classifiers are learned using TOA reflectance and WI values, respectively, instead of the heuristic threshold or unsupervised methods. The final probability is summed linearly using the probabilities of two different BRFs to classify image pixels to water class. This study first demonstrates that the Landsat 8 OLI sensor has higher classification rate because it provides improved signal-to-ratio radiometric by using 12-bit quantization of the data instead of 8-bit as available from other sensors. In addition, we prove that the performance of the proposed combination of two BRF classifiers shows robust water body classification results, regardless of topology, river properties, and background environment. PMID:26110405

  16. Classification of Potential Water Bodies Using Landsat 8 OLI and a Combination of Two Boosted Random Forest Classifiers.

    PubMed

    Ko, Byoung Chul; Kim, Hyeong Hun; Nam, Jae Yeal

    2015-01-01

    This study proposes a new water body classification method using top-of-atmosphere (TOA) reflectance and water indices (WIs) of the Landsat 8 Operational Land Imager (OLI) sensor and its corresponding random forest classifiers. In this study, multispectral images from the OLI sensor are represented as TOA reflectance and WI values because a classification result using two measures is better than raw spectral images. Two types of boosted random forest (BRF) classifiers are learned using TOA reflectance and WI values, respectively, instead of the heuristic threshold or unsupervised methods. The final probability is summed linearly using the probabilities of two different BRFs to classify image pixels to water class. This study first demonstrates that the Landsat 8 OLI sensor has higher classification rate because it provides improved signal-to-ratio radiometric by using 12-bit quantization of the data instead of 8-bit as available from other sensors. In addition, we prove that the performance of the proposed combination of two BRF classifiers shows robust water body classification results, regardless of topology, river properties, and background environment. PMID:26110405

  17. Identification of and handling of critical irradiance forecast uncertainties using a Random Forest scheme - a case study for southern Brazil

    NASA Astrophysics Data System (ADS)

    Beyer, Hans Georg; Preed Revheim, Pal; Kratenber, Manfred Georg; Zuern, Hans Helmut

    2015-04-01

    For the secure operation of the utility grid, especially in grid with a high penetration of volatile wind or solar generation forecasts of the respective power flows are essential. Full profit of the forecasts can, however not been taken without the knowledge of their uncertainties. Based on irradiance data from southern Brazil we present a scheme for the identification of situations for which elevated forecast errors are to be expected. For this, the classification technique Random Forests applied, using the history of the site specific irradiance data forecast errors together with a set of auxiliary meteorological variables. As byproduct, extracted systematic forecast errors are used to update the forecasts. The predictive performance of the Random Forest models are assessed by the predictions ability to reduce the number of hours or days with forecast errors exceeding the limits, and on the resulting overall forecast RMSE. Limited to none improvements are obtained when predicting next-hour forecast errors, while significant improvements are obtained when predicting next-day forecast errors. Setting a relatively low limit for forecast error exceedance is found to give the largest improvements in terms of reduction of forecast RMSE.

  18. An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests

    PubMed Central

    Strobl, Carolin; Malley, James; Tutz, Gerhard

    2010-01-01

    Recursive partitioning methods have become popular and widely used tools for non-parametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years. High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions. The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application. Application of the methods is illustrated using freely available implementations in the R system for statistical computing. PMID:19968396

  19. A Novel Compressed Sensing Method for Magnetic Resonance Imaging: Exponential Wavelet Iterative Shrinkage-Thresholding Algorithm with Random Shift.

    PubMed

    Zhang, Yudong; Yang, Jiquan; Yang, Jianfei; Liu, Aijun; Sun, Ping

    2016-01-01

    Aim. It can help improve the hospital throughput to accelerate magnetic resonance imaging (MRI) scanning. Patients will benefit from less waiting time. Task. In the last decade, various rapid MRI techniques on the basis of compressed sensing (CS) were proposed. However, both computation time and reconstruction quality of traditional CS-MRI did not meet the requirement of clinical use. Method. In this study, a novel method was proposed with the name of exponential wavelet iterative shrinkage-thresholding algorithm with random shift (abbreviated as EWISTARS). It is composed of three successful components: (i) exponential wavelet transform, (ii) iterative shrinkage-thresholding algorithm, and (iii) random shift. Results. Experimental results validated that, compared to state-of-the-art approaches, EWISTARS obtained the least mean absolute error, the least mean-squared error, and the highest peak signal-to-noise ratio. Conclusion. EWISTARS is superior to state-of-the-art approaches. PMID:27066068

  20. A Novel Compressed Sensing Method for Magnetic Resonance Imaging: Exponential Wavelet Iterative Shrinkage-Thresholding Algorithm with Random Shift

    PubMed Central

    Zhang, Yudong; Yang, Jiquan; Yang, Jianfei; Liu, Aijun; Sun, Ping

    2016-01-01

    Aim. It can help improve the hospital throughput to accelerate magnetic resonance imaging (MRI) scanning. Patients will benefit from less waiting time. Task. In the last decade, various rapid MRI techniques on the basis of compressed sensing (CS) were proposed. However, both computation time and reconstruction quality of traditional CS-MRI did not meet the requirement of clinical use. Method. In this study, a novel method was proposed with the name of exponential wavelet iterative shrinkage-thresholding algorithm with random shift (abbreviated as EWISTARS). It is composed of three successful components: (i) exponential wavelet transform, (ii) iterative shrinkage-thresholding algorithm, and (iii) random shift. Results. Experimental results validated that, compared to state-of-the-art approaches, EWISTARS obtained the least mean absolute error, the least mean-squared error, and the highest peak signal-to-noise ratio. Conclusion. EWISTARS is superior to state-of-the-art approaches. PMID:27066068

  1. Use of a porous material description of forests in infrasonic propagation algorithms.

    PubMed

    Swearingen, Michelle E; White, Michael J; Ketcham, Stephen A; McKenna, Mihan H

    2013-10-01

    Infrasound can propagate very long distances and remain at measurable levels. As a result infrasound sensing is used for remote monitoring in many applications. At local ranges, on the order of 10 km, the influence of the presence or absence of forests on the propagation of infrasonic signals is considered. Because the wavelengths of interest are much larger than the scale of individual components, the forest is modeled as a porous material. This approximation is developed starting with the relaxation model of porous materials. This representation is then incorporated into a Crank-Nicholson method parabolic equation solver to determine the relative impacts of the physical parameters of a forest (trunk size and basal area), the presence of gaps/trees in otherwise continuous forest/open terrain, and the effects of meteorology coupled with the porous layer. Finally, the simulations are compared to experimental data from a 10.9 kg blast propagated 14.5 km. Comparison to the experimental data shows that appropriate inclusion of a forest layer along the propagation path provides a closer fit to the data than solely changing the ground type across the frequency range from 1 to 30 Hz. PMID:24116403

  2. Detecting Sirex noctilio grey-attacked and lightning-struck pine trees using airborne hyperspectral data, random forest and support vector machines classifiers

    NASA Astrophysics Data System (ADS)

    Abdel-Rahman, Elfatih M.; Mutanga, Onisimo; Adam, Elhadi; Ismail, Riyad

    2014-02-01

    The visual progression of sirex (Sirex noctilio) infestation symptoms has been categorized into three distinct infestation phases, namely the green, red and grey stages. The grey stage is the final stage which leads to almost complete defoliation resulting in dead standing trees or snags. Dead standing pine trees however, could also be due to the lightning damage. Hence, the objective of the present study was to distinguish amongst healthy, sirex grey-attacked and lightning-damaged pine trees using AISA Eagle hyperspectral data, random forest (RF) and support vector machines (SVM) classifiers. Our study also presents an opportunity to look at the possibility of separating amongst the previously mentioned pine trees damage classes and other landscape classes on the study area. The results of the present study revealed the robustness of the two machine learning classifiers with an overall accuracy of 74.50% (total disagreement = 26%) for RF and 73.50% (total disagreement = 27%) for SVM using all the remaining AISA Eagle spectral bands after removing the noisy ones. When the most useful spectral bands as measured by RF were exploited, the overall accuracy was considerably improved; 78% (total disagreement = 22%) for RF and 76.50% (total disagreement = 24%) for SVM. There was no significant difference between the performances of the two classifiers as demonstrated by the results of McNemar's test (chi-squared; χ2 = 0.14, and 0.03 when all the remaining ASIA Eagle wavebands, after removing the noisy ones and the most important wavebands were used, respectively). This study concludes that AISA Eagle data classified using RF and SVM algorithms provide relatively accurate information that is important to the forest industry for making informed decision regarding pine plantations health protocols.

  3. Tropical Forest Tree Height Retrieval With Tandem-X: Algorithm Development And Accuracy Analysis

    NASA Astrophysics Data System (ADS)

    Antropov, Oleg; Rauste, Yrjo; Hame, Tuomas; de Jong, Ben

    2013-12-01

    Two semi-empirical approaches suitable for forest tree height retrieval from interferometric SAR images were developed in this study. Methods developed are mainly meant for cases when reference elevation model is missing. Spaceborne interferometric SAR data from TanDEM-X mission was used in the study. The study site was located in the south-eastern part of the state of Chiapas, Mexico. The TanDEM-X images were acquired during spring and summer 2012. The height estimates obtained in the study varied between 15 and 35 meters in the area of interest covered by the TanDEM-X images. The Pearson's correlation coefficient between estimated tree heights and reference ground plots form National Forest Inventory were 0.25 and 0.32 for maximum and average tree height measures, respectively. This work on tropical forest tree height retrieval from interferometric TanDEM-X images was performed within the EU FP7 project ReCover.

  4. Model parameter adaption-based multi-model algorithm for extended object tracking using a random matrix.

    PubMed

    Li, Borui; Mu, Chundi; Han, Shuli; Bai, Tianming

    2014-01-01

    Traditional object tracking technology usually regards the target as a point source object. However, this approximation is no longer appropriate for tracking extended objects such as large targets and closely spaced group objects. Bayesian extended object tracking (EOT) using a random symmetrical positive definite (SPD) matrix is a very effective method to jointly estimate the kinematic state and physical extension of the target. The key issue in the application of this random matrix-based EOT approach is to model the physical extension and measurement noise accurately. Model parameter adaptive approaches for both extension dynamic and measurement noise are proposed in this study based on the properties of the SPD matrix to improve the performance of extension estimation. An interacting multi-model algorithm based on model parameter adaptive filter using random matrix is also presented. Simulation results demonstrate the effectiveness of the proposed adaptive approaches and multi-model algorithm. The estimation performance of physical extension is better than the other algorithms, especially when the target maneuvers. The kinematic state estimation error is lower than the others as well. PMID:24763252

  5. GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran.

    PubMed

    Naghibi, Seyed Amir; Pourghasemi, Hamid Reza; Dixon, Barnali

    2016-01-01

    Groundwater is considered one of the most valuable fresh water resources. The main objective of this study was to produce groundwater spring potential maps in the Koohrang Watershed, Chaharmahal-e-Bakhtiari Province, Iran, using three machine learning models: boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF). Thirteen hydrological-geological-physiographical (HGP) factors that influence locations of springs were considered in this research. These factors include slope degree, slope aspect, altitude, topographic wetness index (TWI), slope length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, lithology, land use, drainage density, and fault density. Subsequently, groundwater spring potential was modeled and mapped using CART, RF, and BRT algorithms. The predicted results from the three models were validated using the receiver operating characteristics curve (ROC). From 864 springs identified, 605 (≈70 %) locations were used for the spring potential mapping, while the remaining 259 (≈30 %) springs were used for the model validation. The area under the curve (AUC) for the BRT model was calculated as 0.8103 and for CART and RF the AUC were 0.7870 and 0.7119, respectively. Therefore, it was concluded that the BRT model produced the best prediction results while predicting locations of springs followed by CART and RF models, respectively. Geospatially integrated BRT, CART, and RF methods proved to be useful in generating the spring potential map (SPM) with reasonable accuracy. PMID:26687087

  6. Addition of random run FM noise to the KPW time scale algorithm

    NASA Technical Reports Server (NTRS)

    Greenhall, C. A.

    2002-01-01

    The KPW (Kalman plus weights) time scale algorithm uses a Kalman filter to provide frequency and drift information to a basic time scale equation. This paper extends the algorithm to three-state clocks nd gives results for a simulated eight-clock ensemble.

  7. An algorithm to detect chimeric clones and random noise in genomic mapping

    SciTech Connect

    Grigoriev, A.; Mott, R.; Lehrach, H.

    1994-07-15

    Experimental noise and contiguous clone inserts can pose serious problems in reconstructing genomic maps from hybridization data. The authors describe an algorithm that easily identifies false positive signals and clones containing chimeric inserts/internal deletions. The algorithm {open_quotes}dechimerizes{close_quotes} clones, splitting them into independent contiguous components and cleaning the initial library into a more consistent data set for further ordering. The effectiveness of the algorithm is demonstrated on both simulated data and the real YAC map of the whole genome genome of the fission yeast Schizosaccharomyces pombe. 8 refs., 3 figs., 1 tab.

  8. A randomized controlled trial of a diagnostic algorithm for symptoms of uncomplicated cystitis at an out-of-hours service

    PubMed Central

    Grude, Nils; Lindbaek, Morten

    2015-01-01

    Objective. To compare the clinical outcome of patients presenting with symptoms of uncomplicated cystitis who were seen by a doctor, with patients who were given treatment following a diagnostic algorithm. Design. Randomized controlled trial. Setting. Out-of-hours service, Oslo, Norway. Intervention. Women with typical symptoms of uncomplicated cystitis were included in the trial in the time period September 2010–November 2011. They were randomized into two groups. One group received standard treatment according to the diagnostic algorithm, the other group received treatment after a regular consultation by a doctor. Subjects. Women (n = 441) aged 16–55 years. Mean age in both groups 27 years. Main outcome measures. Number of days until symptomatic resolution. Results. No significant differences were found between the groups in the basic patient demographics, severity of symptoms, or percentage of urine samples with single culture growth. A median of three days until symptomatic resolution was found in both groups. By day four 79% in the algorithm group and 72% in the regular consultation group were free of symptoms (p = 0.09). The number of patients who contacted a doctor again in the follow-up period and received alternative antibiotic treatment was insignificantly higher (p = 0.08) after regular consultation than after treatment according to the diagnostic algorithm. There were no cases of severe pyelonephritis or hospital admissions during the follow-up period. Conclusion. Using a diagnostic algorithm is a safe and efficient method for treating women with symptoms of uncomplicated cystitis at an out-of-hours service. This simplification of treatment strategy can lead to a more rational use of consultation time and a stricter adherence to National Antibiotic Guidelines for a common disorder. PMID:25961367

  9. Computed tomography synthesis from magnetic resonance images in the pelvis using multiple random forests and auto-context features

    NASA Astrophysics Data System (ADS)

    Andreasen, Daniel; Edmund, Jens M.; Zografos, Vasileios; Menze, Bjoern H.; Van Leemput, Koen

    2016-03-01

    In radiotherapy treatment planning that is only based on magnetic resonance imaging (MRI), the electron density information usually obtained from computed tomography (CT) must be derived from the MRI by synthesizing a so-called pseudo CT (pCT). This is a non-trivial task since MRI intensities are neither uniquely nor quantitatively related to electron density. Typical approaches involve either a classification or regression model requiring specialized MRI sequences to solve intensity ambiguities, or an atlas-based model necessitating multiple registrations between atlases and subject scans. In this work, we explore a machine learning approach for creating a pCT of the pelvic region from conventional MRI sequences without using atlases. We use a random forest provided with information about local texture, edges and spatial features derived from the MRI. This helps to solve intensity ambiguities. Furthermore, we use the concept of auto-context by sequentially training a number of classification forests to create and improve context features, which are finally used to train a regression forest for pCT prediction. We evaluate the pCT quality in terms of the voxel-wise error and the radiologic accuracy as measured by water-equivalent path lengths. We compare the performance of our method against two baseline pCT strategies, which either set all MRI voxels in the subject equal to the CT value of water, or in addition transfer the bone volume from the real CT. We show an improved performance compared to both baseline pCTs suggesting that our method may be useful for MRI-only radiotherapy.

  10. Deep neural network and random forest hybrid architecture for learning to detect retinal vessels in fundus images.

    PubMed

    Maji, Debapriya; Santara, Anirban; Ghosh, Sambuddha; Sheet, Debdoot; Mitra, Pabitra

    2015-08-01

    Vision impairment due to pathological damage of the retina can largely be prevented through periodic screening using fundus color imaging. However the challenge with large-scale screening is the inability to exhaustively detect fine blood vessels crucial to disease diagnosis. In this work we present a computational imaging framework using deep and ensemble learning based hybrid architecture for reliable detection of blood vessels in fundus color images. A deep neural network (DNN) is used for unsupervised learning of vesselness dictionaries using sparse trained denoising auto-encoders (DAE), followed by supervised learning of the DNN response using a random forest for detecting vessels in color fundus images. In experimental evaluation with the DRIVE database, we achieve the objective of vessel detection with max. avg. accuracy of 0.9327 and area under ROC curve of 0.9195. PMID:26736930

  11. Convergent Random Forest predictor: methodology for predicting drug response from genome-scale data applied to anti-TNF response.

    PubMed

    Bienkowska, Jadwiga R; Dalgin, Gul S; Batliwalla, Franak; Allaire, Normand; Roubenoff, Ronenn; Gregersen, Peter K; Carulli, John P

    2009-12-01

    Biomarker development for prediction of patient response to therapy is one of the goals of molecular profiling of human tissues. Due to the large number of transcripts, relatively limited number of samples, and high variability of data, identification of predictive biomarkers is a challenge for data analysis. Furthermore, many genes may be responsible for drug response differences, but often only a few are sufficient for accurate prediction. Here we present an analysis approach, the Convergent Random Forest (CRF) method, for the identification of highly predictive biomarkers. The aim is to select from genome-wide expression data a small number of non-redundant biomarkers that could be developed into a simple and robust diagnostic tool. Our method combines the Random Forest classifier and gene expression clustering to rank and select a small number of predictive genes. We evaluated the CRF approach by analyzing four different data sets. The first set contains transcript profiles of whole blood from rheumatoid arthritis patients, collected before anti-TNF treatment, and their subsequent response to the therapy. In this set, CRF identified 8 transcripts predicting response to therapy with 89% accuracy. We also applied the CRF to the analysis of three previously published expression data sets. For all sets, we have compared the CRF and recursive support vector machines (RSVM) approaches to feature selection and classification. In all cases the CRF selects much smaller number of features, five to eight genes, while achieving similar or better performance on both training and independent testing sets of data. For both methods performance estimates using cross-validation is similar to performance on independent samples. The method has been implemented in R and is available from the authors upon request: Jadwiga.Bienkowska@biogenidec.com. PMID:19699293

  12. A Comparison of Hourly Typhoon Rainfall Forecasting Models Based on Support Vector Machines and Random Forests with Different Predictor Sets

    NASA Astrophysics Data System (ADS)

    Lin, Kun-Hsiang; Tseng, Hung-Wei; Kuo, Chen-Min; Yang, Tao-Chang; Yu, Pao-Shan

    2016-04-01

    Typhoons with heavy rainfall and strong wind often cause severe floods and losses in Taiwan, which motivates the development of rainfall forecasting models as part of an early warning system. Thus, this study aims to develop rainfall forecasting models based on two machine learning methods, support vector machines (SVMs) and random forests (RFs), and investigate the performances of the models with different predictor sets for searching the optimal predictor set in forecasting. Four predictor sets were used: (1) antecedent rainfalls, (2) antecedent rainfalls and typhoon characteristics, (3) antecedent rainfalls and meteorological factors, and (4) antecedent rainfalls, typhoon characteristics and meteorological factors to construct for 1- to 6-hour ahead rainfall forecasting. An application to three rainfall stations in Yilan River basin, northeastern Taiwan, was conducted. Firstly, the performance of the SVMs-based forecasting model with predictor set #1 was analyzed. The results show that the accuracy of the models for 2- to 6-hour ahead forecasting decrease rapidly as compared to the accuracy of the model for 1-hour ahead forecasting which is acceptable. For improving the model performance, each predictor set was further examined in the SVMs-based forecasting model. The results reveal that the SVMs-based model using predictor set #4 as input variables performs better than the other sets and a significant improvement of model performance is found especially for the long lead time forecasting. Lastly, the performance of the SVMs-based model using predictor set #4 as input variables was compared with the performance of the RFs-based model using predictor set #4 as input variables. It is found that the RFs-based model is superior to the SVMs-based model in hourly typhoon rainfall forecasting. Keywords: hourly typhoon rainfall forecasting, predictor selection, support vector machines, random forests

  13. A Proposed Extension to the Soil Moisture and Ocean Salinity Level 2 Algorithm for Mixed Forest and Moderate Vegetation Pixels

    NASA Technical Reports Server (NTRS)

    Panciera, Rocco; Walker, Jeffrey P.; Kalma, Jetse; Kim, Edward

    2011-01-01

    The Soil Moisture and Ocean Salinity (SMOS)mission, launched in November 2009, provides global maps of soil moisture and ocean salinity by measuring the L-band (1.4 GHz) emission of the Earth's surface with a spatial resolution of 40-50 km.Uncertainty in the retrieval of soilmoisture over large heterogeneous areas such as SMOS pixels is expected, due to the non-linearity of the relationship between soil moisture and the microwave emission. The current baseline soilmoisture retrieval algorithm adopted by SMOS and implemented in the SMOS Level 2 (SMOS L2) processor partially accounts for the sub-pixel heterogeneity of the land surface, by modelling the individual contributions of different pixel fractions to the overall pixel emission. This retrieval approach is tested in this study using airborne L-band data over an area the size of a SMOS pixel characterised by a mix Eucalypt forest and moderate vegetation types (grassland and crops),with the objective of assessing its ability to correct for the soil moisture retrieval error induced by the land surface heterogeneity. A preliminary analysis using a traditional uniform pixel retrieval approach shows that the sub-pixel heterogeneity of land cover type causes significant errors in soil moisture retrieval (7.7%v/v RMSE, 2%v/v bias) in pixels characterised by a significant amount of forest (40-60%). Although the retrieval approach adopted by SMOS partially reduces this error, it is affected by errors beyond the SMOS target accuracy, presenting in particular a strong dry bias when a fraction of the pixel is occupied by forest (4.1%v/v RMSE,-3.1%v/v bias). An extension to the SMOS approach is proposed that accounts for the heterogeneity of vegetation optical depth within the SMOS pixel. The proposed approach is shown to significantly reduce the error in retrieved soil moisture (2.8%v/v RMSE, -0.3%v/v bias) in pixels characterised by a critical amount of forest (40-60%), at the limited cost of only a crude estimate of the

  14. Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests

    PubMed Central

    2015-01-01

    Background Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree. Results This approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction

  15. Identification by random forest method of HLA class I amino acid substitutions associated with lower survival at day 100 in unrelated donor hematopoietic cell transplantation

    PubMed Central

    Marino, Susana R.; Lin, Shang; Maiers, Martin; Haagenson, Michael; Spellman, Stephen; Klein, John P.; Binkowski, T. Andrew; Lee, Stephanie J.; van Besien, Koen

    2011-01-01

    The identification of important amino acid substitutions associated with low survival in hematopoietic cell transplantation (HCT) is hampered by the large number of observed substitutions compared to the small number of patients available for analysis. Random forest analysis is designed to address these limitations. We studied 2,107 HCT recipients with good or intermediate risk hematologic malignancies to identify HLA class I amino acid substitutions associated with reduced survival at day 100 post-transplant. Random forest analysis and traditional univariate and multivariate analyses were used. Random forest analysis identified amino acid substitutions in 33 positions that were associated with reduced 100 day survival, including HLA-A 9, 43, 62, 63, 76, 77, 95, 97, 114, 116, 152, 156, 166, and 167; HLA-B 97, 109, 116, and 156; and HLA-C 6, 9, 11, 14, 21, 66, 77, 80, 95, 97, 99, 116, 156, 163, and 173. Thirteen had been previously reported by other investigators using classical biostatistical approaches. Using the same dataset, traditional multivariate logistic regression identified only 5 amino acid substitutions associated with lower day 100 survival. Random forest analysis is a novel statistical methodology for analysis of HLA-mismatching and outcome studies, capable of identifying important amino acid substitutions missed by other methods. PMID:21441965

  16. A production-inventory model with permissible delay incorporating learning effect in random planning horizon using genetic algorithm

    NASA Astrophysics Data System (ADS)

    Kar, Mohuya B.; Bera, Shankar; Das, Debasis; Kar, Samarjit

    2015-10-01

    This paper presents a production-inventory model for deteriorating items with stock-dependent demand under inflation in a random planning horizon. The supplier offers the retailer fully permissible delay in payment. It is assumed that the time horizon of the business period is random in nature and follows exponential distribution with a known mean. Here learning effect is also introduced for the production cost and setup cost. The model is formulated as profit maximization problem with respect to the retailer and solved with the help of genetic algorithm (GA) and PSO. Moreover, the convergence of two methods—GA and PSO—is studied against generation numbers and it is seen that GA converges rapidly than PSO. The optimum results from methods are compared both numerically and graphically. It is observed that the performance of GA is marginally better than PSO. We have provided some numerical examples and some sensitivity analyses to illustrate the model.

  17. A multiresolution wavelet analysis and Gaussian Markov random field algorithm for breast cancer screening of digital mammography

    SciTech Connect

    Lee, C.G.; Chen, C.H.

    1996-12-31

    In this paper a novel multiresolution wavelet analysis (MWA) and non-stationary Gaussian Markov random field (GMRF) technique is introduced for the identification of microcalcifications with high accuracy. The hierarchical multiresolution wavelet information in conjunction with the contextual information of the images extracted from GMRF provides a highly efficient technique for microcalcification detection. A Bayesian teaming paradigm realized via the expectation maximization (EM) algorithm was also introduced for edge detection or segmentation of larger lesions recorded on the mammograms. The effectiveness of the approach has been extensively tested with a number of mammographic images provided by a local hospital.

  18. Mass weighted urn design - a new randomization algorithm for unequal allocations

    PubMed Central

    Zhao, Wenle

    2015-01-01

    Unequal allocations have been used in clinical trials motivated by ethical, efficiency, or feasibility concerns. Commonly used permuted block randomization faces a tradeoff between effective imbalance control with a small block size and accurate allocation target with a large block size. Few other unequal allocation randomization designs have been proposed in literature with applications in real trials hardly ever been reported, partly due to their complexity in implementation compared to the permuted block randomization. Proposed in this paper is the mass weighted urn design, in which the number of balls in the urn equals to the number of treatments, and remains unchanged during the study. The chance a ball being randomly selected is proportional to the mass of the ball. After each treatment assignment, a part of the mass of the selected ball is re-distributed to all balls based on the target allocation ratio. This design allows any desired optimal unequal allocations be accurately targeted without approximation, and provides a consistent imbalance control throughout the allocation sequence. The statistical properties of this new design is evaluated with the Euclidean distance between the observed treatment distribution and the desired treatment distribution as the treatment imbalance measure; and the Euclidean distance between the conditional allocation probability and the target allocation probability as the allocation predictability measure. Computer simulation results are presented comparing the mass weighted urn design with other randomization designs currently available for unequal allocations. PMID:26091947

  19. Intra-and-Inter Species Biomass Prediction in a Plantation Forest: Testing the Utility of High Spatial Resolution Spaceborne Multispectral RapidEye Sensor and Advanced Machine Learning Algorithms

    PubMed Central

    Dube, Timothy; Mutanga, Onisimo; Adam, Elhadi; Ismail, Riyad

    2014-01-01

    The quantification of aboveground biomass using remote sensing is critical for better understanding the role of forests in carbon sequestration and for informed sustainable management. Although remote sensing techniques have been proven useful in assessing forest biomass in general, more is required to investigate their capabilities in predicting intra-and-inter species biomass which are mainly characterised by non-linear relationships. In this study, we tested two machine learning algorithms, Stochastic Gradient Boosting (SGB) and Random Forest (RF) regression trees to predict intra-and-inter species biomass using high resolution RapidEye reflectance bands as well as the derived vegetation indices in a commercial plantation. The results showed that the SGB algorithm yielded the best performance for intra-and-inter species biomass prediction; using all the predictor variables as well as based on the most important selected variables. For example using the most important variables the algorithm produced an R2 of 0.80 and RMSE of 16.93 t·ha−1 for E. grandis; R2 of 0.79, RMSE of 17.27 t·ha−1 for P. taeda and R2 of 0.61, RMSE of 43.39 t·ha−1 for the combined species data sets. Comparatively, RF yielded plausible results only for E. dunii (R2 of 0.79; RMSE of 7.18 t·ha−1). We demonstrated that although the two statistical methods were able to predict biomass accurately, RF produced weaker results as compared to SGB when applied to combined species dataset. The result underscores the relevance of stochastic models in predicting biomass drawn from different species and genera using the new generation high resolution RapidEye sensor with strategically positioned bands. PMID:25140631

  20. A Prüfer-Sequence Based Algorithm for Calculating the Size of Ideal Randomly Branched Polymers.

    PubMed

    Singaram, Surendra W; Gopal, Ajaykumar; Ben-Shaul, Avinoam

    2016-07-01

    Branched polymers can be represented as tree graphs. A one-to-one correspondence exists between a tree graph comprised of N labeled vertices and a sequence of N - 2 integers, known as the Prüfer sequence. Permutations of this sequence yield sequences corresponding to tree graphs with the same vertex-degree distribution but (generally) different branching patterns. Repeatedly shuffling the Prüfer sequence we have generated large ensembles of random tree graphs, all with the same degree distributions. We also present and apply an efficient algorithm to determine graph distances directly from their Prüfer sequences. From the (Prüfer sequence derived) graph distances, 3D size metrics, e.g., the polymer's radius of gyration, Rg, and average end-to-end distance, were then calculated using several different theoretical approaches. Applying our method to ideal randomly branched polymers of different vertex-degree distributions, all their 3D size measures are found to obey the usual N(1/4) scaling law. Among the branched polymers analyzed are RNA molecules comprised of equal proportions of the four-randomly distributed-nucleotides. Prior to Prüfer shuffling, the vertices of their representative tree graphs, these "random-sequence" RNAs exhibit an Rg ∼ N(1/3) scaling. PMID:27104292

  1. Fast Numerical Algorithms for 3-D Scattering from PEC and Dielectric Random Rough Surfaces in Microwave Remote Sensing

    NASA Astrophysics Data System (ADS)

    Zhang, Lisha

    We present fast and robust numerical algorithms for 3-D scattering from perfectly electrical conducting (PEC) and dielectric random rough surfaces in microwave remote sensing. The Coifman wavelets or Coiflets are employed to implement Galerkin's procedure in the method of moments (MoM). Due to the high-precision one-point quadrature, the Coiflets yield fast evaluations of the most off-diagonal entries, reducing the matrix fill effort from O(N2) to O( N). The orthogonality and Riesz basis of the Coiflets generate well conditioned impedance matrix, with rapid convergence for the conjugate gradient solver. The resulting impedance matrix is further sparsified by the matrix-formed standard fast wavelet transform (SFWT). By properly selecting multiresolution levels of the total transformation matrix, the solution precision can be enhanced while matrix sparsity and memory consumption have not been noticeably sacrificed. The unified fast scattering algorithm for dielectric random rough surfaces can asymptotically reduce to the PEC case when the loss tangent grows extremely large. Numerical results demonstrate that the reduced PEC model does not suffer from ill-posed problems. Compared with previous publications and laboratory measurements, good agreement is observed.

  2. Predictive modeling of groundwater nitrate pollution using Random Forest and multisource variables related to intrinsic and specific vulnerability: a case study in an agricultural setting (Southern Spain).

    PubMed

    Rodriguez-Galiano, Victor; Mendes, Maria Paula; Garcia-Soldado, Maria Jose; Chica-Olmo, Mario; Ribeiro, Luis

    2014-04-01

    Watershed management decisions need robust methods, which allow an accurate predictive modeling of pollutant occurrences. Random Forest (RF) is a powerful machine learning data driven method that is rarely used in water resources studies, and thus has not been evaluated thoroughly in this field, when compared to more conventional pattern recognition techniques key advantages of RF include: its non-parametric nature; high predictive accuracy; and capability to determine variable importance. This last characteristic can be used to better understand the individual role and the combined effect of explanatory variables in both protecting and exposing groundwater from and to a pollutant. In this paper, the performance of the RF regression for predictive modeling of nitrate pollution is explored, based on intrinsic and specific vulnerability assessment of the Vega de Granada aquifer. The applicability of this new machine learning technique is demonstrated in an agriculture-dominated area where nitrate concentrations in groundwater can exceed the trigger value of 50 mg/L, at many locations. A comprehensive GIS database of twenty-four parameters related to intrinsic hydrogeologic proprieties, driving forces, remotely sensed variables and physical-chemical variables measured in "situ", were used as inputs to build different predictive models of nitrate pollution. RF measures of importance were also used to define the most significant predictors of nitrate pollution in groundwater, allowing the establishment of the pollution sources (pressures). The potential of RF for generating a vulnerability map to nitrate pollution is assessed considering multiple criteria related to variations in the algorithm parameters and the accuracy of the maps. The performance of the RF is also evaluated in comparison to the logistic regression (LR) method using different efficiency measures to ensure their generalization ability. Prediction results show the ability of RF to build accurate models

  3. Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature

    PubMed Central

    Wu, Jiansheng; Liu, Hongde; Duan, Xueye; Ding, Yan; Wu, Hongtao; Bai, Yunfei; Sun, Xiao

    2009-01-01

    Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical–chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class. Results: The results show that the RF model achieves 91.41% overall accuracy with Matthew's correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional (3D) structural information. We have demonstrated that the prediction results are useful for understanding protein–DNA interactions. Availability: DBindR web server implementation is freely available at http://www.cbi.seu.edu.cn/DBindR/DBindR.htm. Contact: xsun@seu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19008251

  4. The adaptive dynamic community detection algorithm based on the non-homogeneous random walking

    NASA Astrophysics Data System (ADS)

    Xin, Yu; Xie, Zhi-Qiang; Yang, Jing

    2016-05-01

    With the changing of the habit and custom, people's social activity tends to be changeable. It is required to have a community evolution analyzing method to mine the dynamic information in social network. For that, we design the random walking possibility function and the topology gain function to calculate the global influence matrix of the nodes. By the analysis of the global influence matrix, the clustering directions of the nodes can be obtained, thus the NRW (Non-Homogeneous Random Walk) method for detecting the static overlapping communities can be established. We design the ANRW (Adaptive Non-Homogeneous Random Walk) method via adapting the nodes impacted by the dynamic events based on the NRW. The ANRW combines the local community detection with dynamic adaptive adjustment to decrease the computational cost for ANRW. Furthermore, the ANRW treats the node as the calculating unity, thus the running manner of the ANRW is suitable to the parallel computing, which could meet the requirement of large dataset mining. Finally, by the experiment analysis, the efficiency of ANRW on dynamic community detection is verified.

  5. Improved scaling of time-evolving block-decimation algorithm through reduced-rank randomized singular value decomposition

    NASA Astrophysics Data System (ADS)

    Tamascelli, D.; Rosenbach, R.; Plenio, M. B.

    2015-06-01

    When the amount of entanglement in a quantum system is limited, the relevant dynamics of the system is restricted to a very small part of the state space. When restricted to this subspace the description of the system becomes efficient in the system size. A class of algorithms, exemplified by the time-evolving block-decimation (TEBD) algorithm, make use of this observation by selecting the relevant subspace through a decimation technique relying on the singular value decomposition (SVD). In these algorithms, the complexity of each time-evolution step is dominated by the SVD. Here we show that, by applying a randomized version of the SVD routine (RRSVD), the power law governing the computational complexity of TEBD is lowered by one degree, resulting in a considerable speed-up. We exemplify the potential gains in efficiency at the hand of some real world examples to which TEBD can be successfully applied and demonstrate that for those systems RRSVD delivers results as accurate as state-of-the-art deterministic SVD routines.

  6. Pharmacologic treatment for urgency-predominant urinary incontinence in women diagnosed using a simplified algorithm: a randomized trial

    PubMed Central

    Huang, Alison J.; Hess, Rachel; Arya, Lily A.; Richter, Holly E.; Subak, Leslee L.; Bradley, Catherine S.; Rogers, Rebecca G.; Myers, Deborah L.; Johnson, Karen C.; Gregory, W. Thomas; Kraus, Stephen R.; Schembri, Michael; Brown, Jeanette S.

    2013-01-01

    Objective The purpose of this study was to evaluate clinical outcomes associated with the initiation of treatment for urgency-predominant incontinence in women diagnosed by a simple 3-item questionnaire. Study Design We conducted a multicenter, double-blinded, 12-week randomized trial of pharmacologic therapy for urgency-predominant incontinence in ambulatory women diagnosed by the simple 3-item questionnaire. Participants (N = 645) were assigned randomly to fesoterodine therapy (4-8 mg daily) or placebo. Urinary incontinence was assessed with the use of voiding diaries; postvoid residual volume was measured after treatment. Results After 12 weeks, women who had been assigned randomly to fesoterodine therapy reported 0.9 fewer urgency and 1.0 fewer total incontinence episodes/day, compared with placebo (P ≤ .001). Four serious adverse events occurred in each group, none of which was related to treatment. No participant had postvoid residual volume of ≥250 mL after treatment. Conclusion Among ambulatory women with urgency-predominant incontinence diagnosed with a simple 3-item questionnaire, pharmacologic therapy resulted in a moderate decrease in incontinence frequency without increasing significant urinary retention or serious adverse events, which provides support for a streamlined algorithm for diagnosis and treatment of female urgency-predominant incontinence. PMID:22542122

  7. Robust 3D object localization and pose estimation for random bin picking with the 3DMaMa algorithm

    NASA Astrophysics Data System (ADS)

    Skotheim, Øystein; Thielemann, Jens T.; Berge, Asbjørn; Sommerfelt, Arne

    2010-02-01

    Enabling robots to automatically locate and pick up randomly placed and oriented objects from a bin is an important challenge in factory automation, replacing tedious and heavy manual labor. A system should be able to recognize and locate objects with a predefined shape and estimate the position with the precision necessary for a gripping robot to pick it up. We describe a system that consists of a structured light instrument for capturing 3D data and a robust approach for object location and pose estimation. The method does not depend on segmentation of range images, but instead searches through pairs of 2D manifolds to localize candidates for object match. This leads to an algorithm that is not very sensitive to scene complexity or the number of objects in the scene. Furthermore, the strategy for candidate search is easily reconfigurable to arbitrary objects. Experiments reported in this paper show the utility of the method on a general random bin picking problem, in this paper exemplified by localization of car parts with random position and orientation. Full pose estimation is done in less than 380 ms per image. We believe that the method is applicable for a wide range of industrial automation problems where precise localization of 3D objects in a scene is needed.

  8. Mapping Sub-Antarctic Cushion Plants Using Random Forests to Combine Very High Resolution Satellite Imagery and Terrain Modelling

    PubMed Central

    Bricher, Phillippa K.; Lucieer, Arko; Shaw, Justine; Terauds, Aleks; Bergstrom, Dana M.

    2013-01-01

    Monitoring changes in the distribution and density of plant species often requires accurate and high-resolution baseline maps of those species. Detecting such change at the landscape scale is often problematic, particularly in remote areas. We examine a new technique to improve accuracy and objectivity in mapping vegetation, combining species distribution modelling and satellite image classification on a remote sub-Antarctic island. In this study, we combine spectral data from very high resolution WorldView-2 satellite imagery and terrain variables from a high resolution digital elevation model to improve mapping accuracy, in both pixel- and object-based classifications. Random forest classification was used to explore the effectiveness of these approaches on mapping the distribution of the critically endangered cushion plant Azorellamacquariensis Orchard (Apiaceae) on sub-Antarctic Macquarie Island. Both pixel- and object-based classifications of the distribution of Azorella achieved very high overall validation accuracies (91.6–96.3%, κ = 0.849–0.924). Both two-class and three-class classifications were able to accurately and consistently identify the areas where Azorella was absent, indicating that these maps provide a suitable baseline for monitoring expected change in the distribution of the cushion plants. Detecting such change is critical given the threats this species is currently facing under altering environmental conditions. The method presented here has applications to monitoring a range of species, particularly in remote and isolated environments. PMID:23940805

  9. Accurate Segmentation of CT Male Pelvic Organs via Regression-Based Deformable Models and Multi-Task Random Forests.

    PubMed

    Gao, Yaozong; Shao, Yeqin; Lian, Jun; Wang, Andrew Z; Chen, Ronald C; Shen, Dinggang

    2016-06-01

    Segmenting male pelvic organs from CT images is a prerequisite for prostate cancer radiotherapy. The efficacy of radiation treatment highly depends on segmentation accuracy. However, accurate segmentation of male pelvic organs is challenging due to low tissue contrast of CT images, as well as large variations of shape and appearance of the pelvic organs. Among existing segmentation methods, deformable models are the most popular, as shape prior can be easily incorporated to regularize the segmentation. Nonetheless, the sensitivity to initialization often limits their performance, especially for segmenting organs with large shape variations. In this paper, we propose a novel approach to guide deformable models, thus making them robust against arbitrary initializations. Specifically, we learn a displacement regressor, which predicts 3D displacement from any image voxel to the target organ boundary based on the local patch appearance. This regressor provides a non-local external force for each vertex of deformable model, thus overcoming the initialization problem suffered by the traditional deformable models. To learn a reliable displacement regressor, two strategies are particularly proposed. 1) A multi-task random forest is proposed to learn the displacement regressor jointly with the organ classifier; 2) an auto-context model is used to iteratively enforce structural information during voxel-wise prediction. Extensive experiments on 313 planning CT scans of 313 patients show that our method achieves better results than alternative classification or regression based methods, and also several other existing methods in CT pelvic organ segmentation. PMID:26800531

  10. Proteomics Analysis with a Nano Random Forest Approach Reveals Novel Functional Interactions Regulated by SMC Complexes on Mitotic Chromosomes*

    PubMed Central

    Ohta, Shinya; Montaño-Gutierrez, Luis F.; de Lima Alves, Flavia; Ogawa, Hiromi; Toramoto, Iyo; Sato, Nobuko; Morrison, Ciaran G.; Takeda, Shunichi; Hudson, Damien F.; Earnshaw, William C.

    2016-01-01

    Packaging of DNA into condensed chromosomes during mitosis is essential for the faithful segregation of the genome into daughter nuclei. Although the structure and composition of mitotic chromosomes have been studied for over 30 years, these aspects are yet to be fully elucidated. Here, we used stable isotope labeling with amino acids in cell culture to compare the proteomes of mitotic chromosomes isolated from cell lines harboring conditional knockouts of members of the condensin (SMC2, CAP-H, CAP-D3), cohesin (Scc1/Rad21), and SMC5/6 (SMC5) complexes. Our analysis revealed that these complexes associate with chromosomes independently of each other, with the SMC5/6 complex showing no significant dependence on any other chromosomal proteins during mitosis. To identify subtle relationships between chromosomal proteins, we employed a nano Random Forest (nanoRF) approach to detect protein complexes and the relationships between them. Our nanoRF results suggested that as few as 113 of 5058 detected chromosomal proteins are functionally linked to chromosome structure and segregation. Furthermore, nanoRF data revealed 23 proteins that were not previously suspected to have functional interactions with complexes playing important roles in mitosis. Subsequent small-interfering-RNA-based validation and localization tracking by green fluorescent protein-tagging highlighted novel candidates that might play significant roles in mitotic progression. PMID:27231315

  11. Using random forest to classify linear B-cell epitopes based on amino acid properties and molecular features.

    PubMed

    Huang, Jian-Hua; Wen, Ming; Tang, Li-Juan; Xie, Hua-Lin; Fu, Liang; Liang, Yi-Zeng; Lu, Hong-Mei

    2014-08-01

    Identification and characterization of B-cell epitopes in target antigens was one of the key steps in epitopes-driven vaccine design, immunodiagnostic tests, and antibody production. Experimental determination of epitopes was labor-intensive and expensive. Therefore, there was an urgent need of computational methods for reliable identification of B-cell epitopes. In current study, we proposed a novel peptide feature description method which combined peptide amino acid properties with chemical molecular features. Based on these combined features, a random forest (RF) classifier was adopted to classify B-cell epitopes and non-epitopes. RF is an ensemble method that uses recursive partitioning to generate many trees for aggregating the results; and it always produces highly competitive models. The classification accuracy, sensitivity, specificity, Matthews correlation coefficient (MCC), and area under the curve (AUC) values for current method were 78.31%, 80.05%, 72.23%, 0.5836, and 0.8800, respectively. These results showed that an appropriate combination of peptide amino acid features and chemical molecular features with a RF model could enhance the prediction performance of linear B-cell epitopes. Finally, a freely online service was available at http://sysbio.yznu.cn/Research/Epitopesprediction.aspx. PMID:24721579

  12. In silico modelling of permeation enhancement potency in Caco-2 monolayers based on molecular descriptors and random forest.

    PubMed

    Welling, Søren H; Clemmensen, Line K H; Buckley, Stephen T; Hovgaard, Lars; Brockhoff, Per B; Refsgaard, Hanne H F

    2015-08-01

    Structural traits of permeation enhancers are important determinants of their capacity to promote enhanced drug absorption. Therefore, in order to obtain a better understanding of structure-activity relationships for permeation enhancers, a Quantitative Structural Activity Relationship (QSAR) model has been developed. The random forest-QSAR model was based upon Caco-2 data for 41 surfactant-like permeation enhancers from Whitehead et al. (2008) and molecular descriptors calculated from their structure. The QSAR model was validated by two test-sets: (i) an eleven compound experimental set with Caco-2 data and (ii) nine compounds with Caco-2 data from literature. Feature contributions, a recent developed diagnostic tool, was applied to elucidate the contribution of individual molecular descriptors to the predicted potency. Feature contributions provided easy interpretable suggestions of important structural properties for potent permeation enhancers such as segregation of hydrophilic and lipophilic domains. Focusing on surfactant-like properties, it is possible to model the potency of the complex pharmaceutical excipients, permeation enhancers. For the first time, a QSAR model has been developed for permeation enhancement. The model is a valuable in silico approach for both screening of new permeation enhancers and physicochemical optimisation of surfactant enhancer systems. PMID:26004819

  13. Mapping sub-antarctic cushion plants using random forests to combine very high resolution satellite imagery and terrain modelling.

    PubMed

    Bricher, Phillippa K; Lucieer, Arko; Shaw, Justine; Terauds, Aleks; Bergstrom, Dana M

    2013-01-01

    Monitoring changes in the distribution and density of plant species often requires accurate and high-resolution baseline maps of those species. Detecting such change at the landscape scale is often problematic, particularly in remote areas. We examine a new technique to improve accuracy and objectivity in mapping vegetation, combining species distribution modelling and satellite image classification on a remote sub-Antarctic island. In this study, we combine spectral data from very high resolution WorldView-2 satellite imagery and terrain variables from a high resolution digital elevation model to improve mapping accuracy, in both pixel- and object-based classifications. Random forest classification was used to explore the effectiveness of these approaches on mapping the distribution of the critically endangered cushion plant Azorella macquariensis Orchard (Apiaceae) on sub-Antarctic Macquarie Island. Both pixel- and object-based classifications of the distribution of Azorella achieved very high overall validation accuracies (91.6-96.3%, κ = 0.849-0.924). Both two-class and three-class classifications were able to accurately and consistently identify the areas where Azorella was absent, indicating that these maps provide a suitable baseline for monitoring expected change in the distribution of the cushion plants. Detecting such change is critical given the threats this species is currently facing under altering environmental conditions. The method presented here has applications to monitoring a range of species, particularly in remote and isolated environments. PMID:23940805

  14. Random Forest Classification of Sediments on Exposed Intertidal Flats Using ALOS-2 Quad-Polarimetric SAR Data

    NASA Astrophysics Data System (ADS)

    Wang, W.; Yang, X.; Liu, G.; Zhou, H.; Ma, W.; Yu, Y.; Li, Z.

    2016-06-01

    Coastal zones are one of the world's most densely populated areas and it is necessary to propose an accurate, cost effective, frequent, and synoptic method of monitoring these complex ecosystems. However, misclassification of sediments on exposed intertidal flats restricts the development of coastal zones surveillance. With the advent of SAR (Synthetic Aperture Radar) satellites, polarimetric SAR satellite imagery plays an increasingly important role in monitoring changes in coastal wetland. This research investigated the necessity of combining SAR polarimetric features with optical data, and their contribution in accurately sediment classification. Three experimental groups were set to make assessment of the most appropriate descriptors. (i) Several SAR polarimetric descriptors were extracted from scattering matrix using Cloude-Pottier, Freeman-Durden and Yamaguchi methods; (ii) Optical remote sensing (RS) data with R, G and B channels formed the second feature combinations; (iii) The chosen SAR and optical RS indicators were both added into classifier. Classification was carried out using Random Forest (RF) classifiers and a general result mapping of intertidal flats was generated. Experiments were implemented using ALOS-2 L-band satellite imagery and GF-1 optical multi-spectral data acquired in the same period. The weights of descriptors were evaluated by VI (RF Variable Importance). Results suggested that optical data source has few advantages on sediment classification, and even reduce the effect of SAR indicators. Polarimetric SAR feature sets show great potentials in intertidal flats classification and are promising in classifying mud flats, sand flats, bare farmland and tidal water.

  15. Polarimetric SAR decomposition parameter subset selection and their optimal dynamic range evaluation for urban area classification using Random Forest

    NASA Astrophysics Data System (ADS)

    Hariharan, Siddharth; Tirodkar, Siddhesh; Bhattacharya, Avik

    2016-02-01

    Urban area classification is important for monitoring the ever increasing urbanization and studying its environmental impact. Two NASA JPL's UAVSAR datasets of L-band (wavelength: 23 cm) were used in this study for urban area classification. The two datasets used in this study are different in terms of urban area structures, building patterns, their geometric shapes and sizes. In these datasets, some urban areas appear oriented about the radar line of sight (LOS) while some areas appear non-oriented. In this study, roll invariant polarimetric SAR decomposition parameters were used to classify these urban areas. Random Forest (RF), which is an ensemble decision tree learning technique, was used in this study. RF performs parameter subset selection as a part of its classification procedure. In this study, parameter subsets were obtained and analyzed to infer scattering mechanisms useful for urban area classification. The Cloude-Pottier α, the Touzi dominant scattering amplitude αs1 and the anisotropy A were among the top six important parameters selected for both the datasets. However, it was observed that these parameters were ranked differently for the two datasets. The urban area classification using RF was compared with the Support Vector Machine (SVM) and the Maximum Likelihood Classifier (MLC) for both the datasets. RF outperforms SVM by 4% and MLC by 12% in Dataset 1. It also outperforms SVM and MLC by 3.5% and 11% respectively in Dataset 2.

  16. Random forests in non-invasive sensorimotor rhythm brain-computer interfaces: a practical and convenient non-linear classifier.

    PubMed

    Steyrl, David; Scherer, Reinhold; Faller, Josef; Müller-Putz, Gernot R

    2016-02-01

    There is general agreement in the brain-computer interface (BCI) community that although non-linear classifiers can provide better results in some cases, linear classifiers are preferable. Particularly, as non-linear classifiers often involve a number of parameters that must be carefully chosen. However, new non-linear classifiers were developed over the last decade. One of them is the random forest (RF) classifier. Although popular in other fields of science, RFs are not common in BCI research. In this work, we address three open questions regarding RFs in sensorimotor rhythm (SMR) BCIs: parametrization, online applicability, and performance compared to regularized linear discriminant analysis (LDA). We found that the performance of RF is constant over a large range of parameter values. We demonstrate - for the first time - that RFs are applicable online in SMR-BCIs. Further, we show in an offline BCI simulation that RFs statistically significantly outperform regularized LDA by about 3%. These results confirm that RFs are practical and convenient non-linear classifiers for SMR-BCIs. Taking into account further properties of RFs, such as independence from feature distributions, maximum margin behavior, multiclass and advanced data mining capabilities, we argue that RFs should be taken into consideration for future BCIs. PMID:25830903

  17. Proteomics Analysis with a Nano Random Forest Approach Reveals Novel Functional Interactions Regulated by SMC Complexes on Mitotic Chromosomes.

    PubMed

    Ohta, Shinya; Montaño-Gutierrez, Luis F; de Lima Alves, Flavia; Ogawa, Hiromi; Toramoto, Iyo; Sato, Nobuko; Morrison, Ciaran G; Takeda, Shunichi; Hudson, Damien F; Rappsilber, Juri; Earnshaw, William C

    2016-08-01

    Packaging of DNA into condensed chromosomes during mitosis is essential for the faithful segregation of the genome into daughter nuclei. Although the structure and composition of mitotic chromosomes have been studied for over 30 years, these aspects are yet to be fully elucidated. Here, we used stable isotope labeling with amino acids in cell culture to compare the proteomes of mitotic chromosomes isolated from cell lines harboring conditional knockouts of members of the condensin (SMC2, CAP-H, CAP-D3), cohesin (Scc1/Rad21), and SMC5/6 (SMC5) complexes. Our analysis revealed that these complexes associate with chromosomes independently of each other, with the SMC5/6 complex showing no significant dependence on any other chromosomal proteins during mitosis. To identify subtle relationships between chromosomal proteins, we employed a nano Random Forest (nanoRF) approach to detect protein complexes and the relationships between them. Our nanoRF results suggested that as few as 113 of 5058 detected chromosomal proteins are functionally linked to chromosome structure and segregation. Furthermore, nanoRF data revealed 23 proteins that were not previously suspected to have functional interactions with complexes playing important roles in mitosis. Subsequent small-interfering-RNA-based validation and localization tracking by green fluorescent protein-tagging highlighted novel candidates that might play significant roles in mitotic progression. PMID:27231315

  18. Analysis of a stack algorithm for random multiple-access communication

    NASA Astrophysics Data System (ADS)

    Fayolle, G.; Flajolet, P.; Jacquet, P.; Hofri, M.

    1985-03-01

    The present investigation is concerned with the performance of a protocol for managing the use of a single-channel packet switching communications network such as the one employed in the Aloha network. The conducted analysis is mainly concerned with the Capetanakis-Tsybakov-Mikhailov (CTM) access protocols and collision resolution algorithms (CRA), coupled with free, or continuous, access of newly arriving packets into the contention. The considered scheme proves ergodic as long as the rate of generation of new packets is below a certain bound. Attention is given to the basic equation and the collision resolution interval (CRI) duration, the direct evaluation of the mean delay, the distribution of the states of the top of stack, the moments of packet delay time, and the numerical results.

  19. Efficient asymmetric image authentication schemes based on photon counting-double random phase encoding and RSA algorithms.

    PubMed

    Moon, Inkyu; Yi, Faliu; Han, Mingu; Lee, Jieun

    2016-06-01

    Recently, double random phase encoding (DRPE) has been integrated with the photon counting (PC) imaging technique for the purpose of secure image authentication. In this scheme, the same key should be securely distributed and shared between the sender and receiver, but this is one of the most vexing problems of symmetric cryptosystems. In this study, we propose an efficient asymmetric image authentication scheme by combining the PC-DRPE and RSA algorithms, which solves key management and distribution problems. The retrieved image from the proposed authentication method contains photon-limited encrypted data obtained by means of PC-DRPE. Therefore, the original image can be protected while the retrieved image can be efficiently verified using a statistical nonlinear correlation approach. Experimental results demonstrate the feasibility of our proposed asymmetric image authentication method. PMID:27411183

  20. Temporal optimisation of image acquisition for land cover classification with Random Forest and MODIS time-series

    NASA Astrophysics Data System (ADS)

    Nitze, Ingmar; Barrett, Brian; Cawkwell, Fiona

    2015-02-01

    The analysis and classification of land cover is one of the principal applications in terrestrial remote sensing. Due to the seasonal variability of different vegetation types and land surface characteristics, the ability to discriminate land cover types changes over time. Multi-temporal classification can help to improve the classification accuracies, but different constraints, such as financial restrictions or atmospheric conditions, may impede their application. The optimisation of image acquisition timing and frequencies can help to increase the effectiveness of the classification process. For this purpose, the Feature Importance (FI) measure of the state-of-the art machine learning method Random Forest was used to determine the optimal image acquisition periods for a general (Grassland, Forest, Water, Settlement, Peatland) and Grassland specific (Improved Grassland, Semi-Improved Grassland) land cover classification in central Ireland based on a 9-year time-series of MODIS Terra 16 day composite data (MOD13Q1). Feature Importances for each acquisition period of the Enhanced Vegetation Index (EVI) and Normalised Difference Vegetation Index (NDVI) were calculated for both classification scenarios. In the general land cover classification, the months December and January showed the highest, and July and August the lowest separability for both VIs over the entire nine-year period. This temporal separability was reflected in the classification accuracies, where the optimal choice of image dates outperformed the worst image date by 13% using NDVI and 5% using EVI on a mono-temporal analysis. With the addition of the next best image periods to the data input the classification accuracies converged quickly to their limit at around 8-10 images. The binary classification schemes, using two classes only, showed a stronger seasonal dependency with a higher intra-annual, but lower inter-annual variation. Nonetheless anomalous weather conditions, such as the cold winter of

  1. Q-ary collision resolution algorithms in random-access systems with free or blocked channel access

    NASA Astrophysics Data System (ADS)

    Mathys, P.; Flajolet, P.

    1985-03-01

    The throughput characteristics of contention-based random-access systems (RAS's) which use Q-ary tree algorithms (where Q equal to or greater than 2 is the number of groups into which contending users are split) of the Capetanakis-Tsybakov-Mikhailov-Vvedenskaya type are analyzed for an infinite population of identical users generating packets according to a Poisson process. Both free and blocked channel-access protocols are considered in combination with Q-ary collision resoluton algorithms that exploit either binary ('collision/no collision') or ternary ('collision/success/idle') feedback. For the resulting RAS's, functional equations for transformed generating functions of the first two moments of the collision resolution interval length are obtained and solved. The maximum stable throughput as a function of Q is given. The results of a packet-delay analysis are also given, and the analyzed RAS's are compared among themselves and with the slotted ALOHA system in terms of both system throughput and packet delay. It is concluded that the 'practical optimum' RAS (in terms of ease of implementation combined with good performance) uses free (i.e., immediate) channel access and ternary splitting (i.e., Q = 3) with binary feedback.

  2. Use of random forest to estimate population attributable fractions from a case-control study of Salmonella enterica serotype Enteritidis infections.

    PubMed

    Gu, W; Vieira, A R; Hoekstra, R M; Griffin, P M; Cole, D

    2015-10-01

    To design effective food safety programmes we need to estimate how many sporadic foodborne illnesses are caused by specific food sources based on case-control studies. Logistic regression has substantive limitations for analysing structured questionnaire data with numerous exposures and missing values. We adapted random forest to analyse data of a case-control study of Salmonella enterica serotype Enteritidis illness for source attribution. For estimation of summary population attributable fractions (PAFs) of exposures grouped into transmission routes, we devised a counterfactual estimator to predict reductions in illness associated with removing grouped exposures. For the purpose of comparison, we fitted the data using logistic regression models with stepwise forward and backward variable selection. Our results show that the forward and backward variable selection of logistic regression models were not consistent for parameter estimation, with different significant exposures identified. By contrast, the random forest model produced estimated PAFs of grouped exposures consistent in rank order with results obtained from outbreak data, with egg-related exposures having the highest estimated PAF (22·1%, 95% confidence interval 8·5-31·8). Random forest might be structurally more coherent and efficient than logistic regression models for attributing Salmonella illnesses to sources involving many causal pathways. PMID:25672399

  3. Applicability of random sequential adsorption algorithm for simulation of surface plasma polishing kinetics

    NASA Astrophysics Data System (ADS)

    Minárik, Stanislav; Vaňa, Dušan

    2015-11-01

    Applicability of random sequential adsorption (RSA) model for the material removal during a surface plasma polishing is discussed. The mechanical nature of plasma polishing process is taken into consideration in modified version of RSA model. During the plasma polishing the surface layer is aligned such that molecules of material are removed from the surface mechanically as a consequence of the surface deformation induced by plasma particles impact. We propose modification of RSA technique to describe the reduction of material on the surface provided that sequential character of molecules release from the surface is maintained throughout the polishing process. This empirical model is able to estimate depth profile of material density on the surface during the plasma polishing. We have shown that preliminary results obtained from this model are in good agreement with experimental results. We believe that molecular dynamics simulation of the polishing process, possibly also other types of surface treatment, can be based on this model. However influence of material parameters and processing conditions (including plasma characteristics) must be taken into account using appropriate model variables.

  4. Selecting Optimal Random Forest Predictive Models: A Case Study on Predicting the Spatial Distribution of Seabed Hardness

    PubMed Central

    Li, Jin; Tran, Maggie; Siwabessy, Justy

    2016-01-01

    Spatially continuous predictions of seabed hardness are important baseline environmental information for sustainable management of Australia’s marine jurisdiction. Seabed hardness is often inferred from multibeam backscatter data with unknown accuracy and can be inferred from underwater video footage at limited locations. In this study, we classified the seabed into four classes based on two new seabed hardness classification schemes (i.e., hard90 and hard70). We developed optimal predictive models to predict seabed hardness using random forest (RF) based on the point data of hardness classes and spatially continuous multibeam data. Five feature selection (FS) methods that are variable importance (VI), averaged variable importance (AVI), knowledge informed AVI (KIAVI), Boruta and regularized RF (RRF) were tested based on predictive accuracy. Effects of highly correlated, important and unimportant predictors on the accuracy of RF predictive models were examined. Finally, spatial predictions generated using the most accurate models were visually examined and analysed. This study confirmed that: 1) hard90 and hard70 are effective seabed hardness classification schemes; 2) seabed hardness of four classes can be predicted with a high degree of accuracy; 3) the typical approach used to pre-select predictive variables by excluding highly correlated variables needs to be re-examined; 4) the identification of the important and unimportant predictors provides useful guidelines for further improving predictive models; 5) FS methods select the most accurate predictive model(s) instead of the most parsimonious ones, and AVI and Boruta are recommended for future studies; and 6) RF is an effective modelling method with high predictive accuracy for multi-level categorical data and can be applied to ‘small p and large n’ problems in environmental sciences. Additionally, automated computational programs for AVI need to be developed to increase its computational efficiency and

  5. Selecting Optimal Random Forest Predictive Models: A Case Study on Predicting the Spatial Distribution of Seabed Hardness.

    PubMed

    Li, Jin; Tran, Maggie; Siwabessy, Justy

    2016-01-01

    Spatially continuous predictions of seabed hardness are important baseline environmental information for sustainable management of Australia's marine jurisdiction. Seabed hardness is often inferred from multibeam backscatter data with unknown accuracy and can be inferred from underwater video footage at limited locations. In this study, we classified the seabed into four classes based on two new seabed hardness classification schemes (i.e., hard90 and hard70). We developed optimal predictive models to predict seabed hardness using random forest (RF) based on the point data of hardness classes and spatially continuous multibeam data. Five feature selection (FS) methods that are variable importance (VI), averaged variable importance (AVI), knowledge informed AVI (KIAVI), Boruta and regularized RF (RRF) were tested based on predictive accuracy. Effects of highly correlated, important and unimportant predictors on the accuracy of RF predictive models were examined. Finally, spatial predictions generated using the most accurate models were visually examined and analysed. This study confirmed that: 1) hard90 and hard70 are effective seabed hardness classification schemes; 2) seabed hardness of four classes can be predicted with a high degree of accuracy; 3) the typical approach used to pre-select predictive variables by excluding highly correlated variables needs to be re-examined; 4) the identification of the important and unimportant predictors provides useful guidelines for further improving predictive models; 5) FS methods select the most accurate predictive model(s) instead of the most parsimonious ones, and AVI and Boruta are recommended for future studies; and 6) RF is an effective modelling method with high predictive accuracy for multi-level categorical data and can be applied to 'small p and large n' problems in environmental sciences. Additionally, automated computational programs for AVI need to be developed to increase its computational efficiency and

  6. Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin

    NASA Astrophysics Data System (ADS)

    Mellor, Andrew; Boukir, Samia; Haywood, Andrew; Jones, Simon

    2015-07-01

    Studies have demonstrated the robust performance of the ensemble machine learning classifier, random forests, for remote sensing land cover classification, particularly across complex landscapes. This study introduces new ensemble margin criteria to evaluate the performance of Random Forests (RF) in the context of large area land cover classification and examines the effect of different training data characteristics (imbalance and mislabelling) on classification accuracy and uncertainty. The study presents a new margin weighted confusion matrix, which used in combination with the traditional confusion matrix, provides confidence estimates associated with correctly and misclassified instances in the RF classification model. Landsat TM satellite imagery, topographic and climate ancillary data are used to build binary (forest/non-forest) and multiclass (forest canopy cover classes) classification models, trained using sample aerial photograph maps, across Victoria, Australia. Experiments were undertaken to reveal insights into the behaviour of RF over large and complex data, in which training data are not evenly distributed among classes (imbalance) and contain systematically mislabelled instances. Results of experiments reveal that while the error rate of the RF classifier is relatively insensitive to mislabelled training data (in the multiclass experiment, overall 78.3% Kappa with no mislabelled instances to 70.1% with 25% mislabelling in each class), the level of associated confidence falls at a faster rate than overall accuracy with increasing amounts of mislabelled training data. In general, balanced training data resulted in the lowest overall error rates for classification experiments (82.3% and 78.3% for the binary and multiclass experiments respectively). However, results of the study demonstrate that imbalance can be introduced to improve error rates of more difficult classes, without adversely affecting overall classification accuracy.

  7. An evidence gathering and assessment technique designed for a forest cover classification algorithm based on the Dempster-Shafer theory of evidence

    NASA Astrophysics Data System (ADS)

    Szymanski, David Lawrence

    This thesis presents a new approach for classifying Landsat 5 Thematic Mapper (TM) imagery that utilizes digitally represented, non-spectral data in the classification step. A classification algorithm that is based on the Dempster-Shafer theory of evidence is developed and tested for its ability to provide an accurate representation of forest cover on the ground at the Anderson et al (1976) level II. The research focuses on defining an objective, systematic method of gathering and assessing the evidence from digital sources including TM data, the normalized difference vegetation index, soils, slope, aspect, and elevation. The algorithm is implemented using the ESRI ArcView Spatial Analyst software package and the Grid spatial data structure with software coded in both ArcView Avenue and also C. The methodology uses frequency of occurrence information to gather evidence and also introduces measures of evidence quality that quantify the ability of the evidence source to differentiate the Anderson forest cover classes. The measures are derived objectively and empirically and are based on common principles of legal argument. The evidence assessment measures augment the Dempster-Shafer theory and the research will determine if they provide an argument that is mentally sound, credible, and consistent. This research produces a method for identifying, assessing, and combining evidence sources using the Dempster-Shafer theory that results in a classified image containing the Anderson forest cover class. Test results indicate that the new classifier performs with accuracy that is similar to the traditional maximum likelihood approach. However, confusion among the deciduous and mixed classes remains. The utility of the evidence gathering method and also the evidence assessment method is demonstrated and confirmed. The algorithm presents an operational method of using the Dempster-Shafer theory of evidence for forest classification.

  8. Using Logistic Regression and Random Forests multivariate statistical methods for landslide spatial probability assessment in North-Est Sicily, Italy

    NASA Astrophysics Data System (ADS)

    Trigila, Alessandro; Iadanza, Carla; Esposito, Carlo; Scarascia-Mugnozza, Gabriele

    2015-04-01

    first phase of the work addressed to identify the spatial relationships between the landslides location and the 13 related factors by using the Frequency Ratio bivariate statistical method. The analysis was then carried out by adopting a multivariate statistical approach, according to the Logistic Regression technique and Random Forests technique that gave best results in terms of AUC. The models were performed and evaluated with different sample sizes and also taking into account the temporal variation of input variables such as burned areas by wildfire. The most significant outcome of this work are: the relevant influence of the sample size on the model results and the strong importance of some environmental factors (e.g. land use and wildfires) for the identification of the depletion zones of extremely rapid shallow landslides.

  9. Combining random forest and 2D correlation analysis to identify serum spectral signatures for neuro-oncology.

    PubMed

    Smith, Benjamin R; Ashton, Katherine M; Brodbelt, Andrew; Dawson, Timothy; Jenkinson, Michael D; Hunt, Neil T; Palmer, David S; Baker, Matthew J

    2016-06-01

    Fourier transform infrared (FTIR) spectroscopy has long been established as an analytical technique for the measurement of vibrational modes of molecular systems. More recently, FTIR has been used for the analysis of biofluids with the aim of becoming a tool to aid diagnosis. For the clinician, this represents a convenient, fast, non-subjective option for the study of biofluids and the diagnosis of disease states. The patient also benefits from this method, as the procedure for the collection of serum is much less invasive and stressful than traditional biopsy. This is especially true of patients in whom brain cancer is suspected. A brain biopsy is very unpleasant for the patient, potentially dangerous and can occasionally be inconclusive. We therefore present a method for the diagnosis of brain cancer from serum samples using FTIR and machine learning techniques. The scope of the study involved 433 patients from whom were collected 9 spectra each in the range 600-4000 cm(-1). To begin the development of the novel method, various pre-processing steps were investigated and ranked in terms of final accuracy of the diagnosis. Random forest machine learning was utilised as a classifier to separate patients into cancer or non-cancer categories based upon the intensities of wavenumbers present in their spectra. Generalised 2D correlational analysis was then employed to further augment the machine learning, and also to establish spectral features important for the distinction between cancer and non-cancer serum samples. Using these methods, sensitivities of up to 92.8% and specificities of up to 91.5% were possible. Furthermore, ratiometrics were also investigated in order to establish any correlations present in the dataset. We show a rapid, computationally light, accurate, statistically robust methodology for the identification of spectral features present in differing disease states. With current advances in IR technology, such as the development of rapid discrete

  10. Monitoring grass nutrients and biomass as indicators of rangeland quality and quantity using random forest modelling and WorldView-2 data

    NASA Astrophysics Data System (ADS)

    Ramoelo, Abel; Cho, M. A.; Mathieu, R.; Madonsela, S.; van de Kerchove, R.; Kaszta, Z.; Wolff, E.

    2015-12-01

    Land use and climate change could have huge impacts on food security and the health of various ecosystems. Leaf nitrogen (N) and above-ground biomass are some of the key factors limiting agricultural production and ecosystem functioning. Leaf N and biomass can be used as indicators of rangeland quality and quantity. Conventional methods for assessing these vegetation parameters at landscape scale level are time consuming and tedious. Remote sensing provides a bird-eye view of the landscape, which creates an opportunity to assess these vegetation parameters over wider rangeland areas. Estimation of leaf N has been successful during peak productivity or high biomass and limited studies estimated leaf N in dry season. The estimation of above-ground biomass has been hindered by the signal saturation problems using conventional vegetation indices. The objective of this study is to monitor leaf N and above-ground biomass as an indicator of rangeland quality and quantity using WorldView-2 satellite images and random forest technique in the north-eastern part of South Africa. Series of field work to collect samples for leaf N and biomass were undertaken in March 2013, April or May 2012 (end of wet season) and July 2012 (dry season). Several conventional and red edge based vegetation indices were computed. Overall results indicate that random forest and vegetation indices explained over 89% of leaf N concentrations for grass and trees, and less than 89% for all the years of assessment. The red edge based vegetation indices were among the important variables for predicting leaf N. For the biomass, random forest model explained over 84% of biomass variation in all years, and visible bands including red edge based vegetation indices were found to be important. The study demonstrated that leaf N could be monitored using high spatial resolution with the red edge band capability, and is important for rangeland assessment and monitoring.

  11. Estimating Digital Terrain Model in forest areas from TanDEM-X and Stereo-photogrammetric technique by means of Random Volume over Ground model

    NASA Astrophysics Data System (ADS)

    Lee, S. K.; Fatoyinbo, T. E.; Lagomasino, D.; Osmanoglu, B.; Feliciano, E. A.

    2015-12-01

    The Digital Terrain Model (DTM) in forest areas is invaluable information for various environmental, hydrological and ecological studies, for example, watershed delineation, vegetation canopy height, water dynamic modeling, forest biomass and carbon estimations. There are few solutions to extract bare-earth Digital Elevation Model information. Airborne lidar systems are widely and successfully used for estimating bare-earth DEMs with centimeter-order accuracy and high spatial resolution. However, expensive cost of operation and small image coverage prevent the use of airborne lidar sensors for large- or global-scale. Although IceSAT/GLAS (Ice, Cloud, and Land Elevation Satellite/Geoscience Laser Altimeter System) lidar data sets have been available for global DTM estimate with relatively lower cost, the large footprint size of 70 m and the interval of 172 m are insufficient for various applications. In this study we propose to extract higher resolution bare-earth DEM over vegetated areas from the combination of interferometric complex coherence from single-pass TanDEM-X (TDX) data at HH polarization and Digital Surface Model (DSM) derived from high-resolution WorldView (WV) images by means of random volume over ground (RVoG) model. The RVoG model is a widely and successfully used model for polarimetric SAR interferometry (Pol-InSAR) forest canopy height inversion. The bare-earth DEM is obtained by complex volume decorrelation in the RVoG model with the DSM estimated by stereo-photogrammetric technique. Forest canopy height can be estimated by subtracting the estimated bare-earth model from the DSM. Finally, the DTM from airborne lidar system was used to validate the bare-earth DEM and forest canopy height estimates.

  12. Decision Tree, Bagging and Random Forest methods detect TEC seismo-ionospheric anomalies around the time of the Chile, (Mw = 8.8) earthquake of 27 February 2010

    NASA Astrophysics Data System (ADS)

    Akhoondzadeh, Mehdi

    2016-06-01

    In this paper for the first time ensemble methods including Decision Tree, Bagging and Random Forest have been proposed in the field of earthquake precursors to detect GPS-TEC (Total Electron Content) seismo-ionospheric anomalies around the time and location of Chile earthquake of 27 February 2010. All of the implemented ensemble methods detected a striking anomaly in time series of TEC data, 1 day after the earthquake at 14:00 UTC. The results indicate that the proposed methods due to their performance, speed and simplicity are quite promising and deserve serious attention as a new predictor tools for seismo-ionospheric anomalies detection.

  13. A random forest approach for predicting the presence of Echinococcus multilocularis intermediate host Ochotona spp. presence in relation to landscape characteristics in western China

    PubMed Central

    Marston, Christopher G.; Danson, F. Mark; Armitage, Richard P.; Giraudoux, Patrick; Pleydell, David R.J.; Wang, Qian; Qui, Jiamin; Craig, Philip S.

    2014-01-01

    Understanding distribution patterns of hosts implicated in the transmission of zoonotic disease remains a key goal of parasitology. Here, random forests are employed to model spatial patterns of the presence of the plateau pika (Ochotona spp.) small mammal intermediate host for the parasitic tapeworm Echinococcus multilocularis which is responsible for a significant burden of human zoonoses in western China. Landsat ETM+ satellite imagery and digital elevation model data were utilized to generate quantified measures of environmental characteristics across a study area in Sichuan Province, China. Land cover maps were generated identifying the distribution of specific land cover types, with landscape metrics employed to describe the spatial organisation of land cover patches. Random forests were used to model spatial patterns of Ochotona spp. presence, enabling the relative importance of the environmental characteristics in relation to Ochotona spp. presence to be ranked. An index of habitat aggregation was identified as the most important variable in influencing Ochotona spp. presence, with area of degraded grassland the most important land cover class variable. 71% of the variance in Ochotona spp. presence was explained, with a 90.98% accuracy rate as determined by ‘out-of-bag’ error assessment. Identification of the environmental characteristics influencing Ochotona spp. presence enables us to better understand distribution patterns of hosts implicated in the transmission of Em. The predictive mapping of this Em host enables the identification of human populations at increased risk of infection, enabling preventative strategies to be adopted. PMID:25386042

  14. A random forest approach for predicting the presence of Echinococcus multilocularis intermediate host Ochotona spp. presence in relation to landscape characteristics in western China.

    PubMed

    Marston, Christopher G; Danson, F Mark; Armitage, Richard P; Giraudoux, Patrick; Pleydell, David R J; Wang, Qian; Qui, Jiamin; Craig, Philip S

    2014-12-01

    Understanding distribution patterns of hosts implicated in the transmission of zoonotic disease remains a key goal of parasitology. Here, random forests are employed to model spatial patterns of the presence of the plateau pika (Ochotona spp.) small mammal intermediate host for the parasitic tapeworm Echinococcus multilocularis which is responsible for a significant burden of human zoonoses in western China. Landsat ETM+ satellite imagery and digital elevation model data were utilized to generate quantified measures of environmental characteristics across a study area in Sichuan Province, China. Land cover maps were generated identifying the distribution of specific land cover types, with landscape metrics employed to describe the spatial organisation of land cover patches. Random forests were used to model spatial patterns of Ochotona spp. presence, enabling the relative importance of the environmental characteristics in relation to Ochotona spp. presence to be ranked. An index of habitat aggregation was identified as the most important variable in influencing Ochotona spp. presence, with area of degraded grassland the most important land cover class variable. 71% of the variance in Ochotona spp. presence was explained, with a 90.98% accuracy rate as determined by 'out-of-bag' error assessment. Identification of the environmental characteristics influencing Ochotona spp. presence enables us to better understand distribution patterns of hosts implicated in the transmission of Em. The predictive mapping of this Em host enables the identification of human populations at increased risk of infection, enabling preventative strategies to be adopted. PMID:25386042

  15. The potential of random forest and neural networks for biomass and recombinant protein modeling in Escherichia coli fed-batch fermentations.

    PubMed

    Melcher, Michael; Scharl, Theresa; Spangl, Bernhard; Luchner, Markus; Cserjan, Monika; Bayer, Karl; Leisch, Friedrich; Striedner, Gerald

    2015-09-01

    Product quality assurance strategies in production of biopharmaceuticals currently undergo a transformation from empirical "quality by testing" to rational, knowledge-based "quality by design" approaches. The major challenges in this context are the fragmentary understanding of bioprocesses and the severely limited real-time access to process variables related to product quality and quantity. Data driven modeling of process variables in combination with model predictive process control concepts represent a potential solution to these problems. The selection of statistical techniques best qualified for bioprocess data analysis and modeling is a key criterion. In this work a series of recombinant Escherichia coli fed-batch production processes with varying cultivation conditions employing a comprehensive on- and offline process monitoring platform was conducted. The applicability of two machine learning methods, random forest and neural networks, for the prediction of cell dry mass and recombinant protein based on online available process parameters and two-dimensional multi-wavelength fluorescence spectroscopy is investigated. Models solely based on routinely measured process variables give a satisfying prediction accuracy of about ± 4% for the cell dry mass, while additional spectroscopic information allows for an estimation of the protein concentration within ± 12%. The results clearly argue for a combined approach: neural networks as modeling technique and random forest as variable selection tool. PMID:26121295

  16. Analysis and Recognition of Traditional Chinese Medicine Pulse Based on the Hilbert-Huang Transform and Random Forest in Patients with Coronary Heart Disease

    PubMed Central

    Guo, Rui; Wang, Yiqin; Yan, Hanxia; Yan, Jianjun; Yuan, Fengyin; Xu, Zhaoxia; Liu, Guoping; Xu, Wenjie

    2015-01-01

    Objective. This research provides objective and quantitative parameters of the traditional Chinese medicine (TCM) pulse conditions for distinguishing between patients with the coronary heart disease (CHD) and normal people by using the proposed classification approach based on Hilbert-Huang transform (HHT) and random forest. Methods. The energy and the sample entropy features were extracted by applying the HHT to TCM pulse by treating these pulse signals as time series. By using the random forest classifier, the extracted two types of features and their combination were, respectively, used as input data to establish classification model. Results. Statistical results showed that there were significant differences in the pulse energy and sample entropy between the CHD group and the normal group. Moreover, the energy features, sample entropy features, and their combination were inputted as pulse feature vectors; the corresponding average recognition rates were 84%, 76.35%, and 90.21%, respectively. Conclusion. The proposed approach could be appropriately used to analyze pulses of patients with CHD, which can lay a foundation for research on objective and quantitative criteria on disease diagnosis or Zheng differentiation. PMID:26180536

  17. Classification of savanna tree species, in the Greater Kruger National Park region, by integrating hyperspectral and LiDAR data in a Random Forest data mining environment

    NASA Astrophysics Data System (ADS)

    Naidoo, L.; Cho, M. A.; Mathieu, R.; Asner, G.

    2012-04-01

    The accurate classification and mapping of individual trees at species level in the savanna ecosystem can provide numerous benefits for the managerial authorities. Such benefits include the mapping of economically useful tree species, which are a key source of food production and fuel wood for the local communities, and of problematic alien invasive and bush encroaching species, which can threaten the integrity of the environment and livelihoods of the local communities. Species level mapping is particularly challenging in African savannas which are complex, heterogeneous, and open environments with high intra-species spectral variability due to differences in geology, topography, rainfall, herbivory and human impacts within relatively short distances. Savanna vegetation are also highly irregular in canopy and crown shape, height and other structural dimensions with a combination of open grassland patches and dense woody thicket - a stark contrast to the more homogeneous forest vegetation. This study classified eight common savanna tree species in the Greater Kruger National Park region, South Africa, using a combination of hyperspectral and Light Detection and Ranging (LiDAR)-derived structural parameters, in the form of seven predictor datasets, in an automated Random Forest modelling approach. The most important predictors, which were found to play an important role in the different classification models and contributed to the success of the hybrid dataset model when combined, were species tree height; NDVI; the chlorophyll b wavelength (466 nm) and a selection of raw, continuum removed and Spectral Angle Mapper (SAM) bands. It was also concluded that the hybrid predictor dataset Random Forest model yielded the highest classification accuracy and prediction success for the eight savanna tree species with an overall classification accuracy of 87.68% and KHAT value of 0.843.

  18. Development and comparison of algorithms for generating a scan sequence for a random access scanner. [ZAP (and flow charts for ZIP and SCAN), in FORTRAN for DEC-10

    SciTech Connect

    Eason, R. O.

    1980-09-01

    Many data acquisition systems incorporate high-speed scanners to convert analog signals into digital format for further processing. Some systems multiplex many channels into a single scanner. A random access scanner whose scan sequence is specified by a table in random access memory will permit different scan rates on different channels. Generation of this scan table can be a tedious manual task when there are many channels (e.g. 50), when there are more than a few scan rates (e.g. 5), and/or when the ratio of the highest scan rate to the lowest scan rate becomes large (e.g. 100:1). An algorithm is developed which will generate these scan sequences for the random access scanner and implements the algorithm on a digital computer. Application of number theory to the mathematical statement of the problem led to development of several algorithms which were implemented in FORTRAN. The most efficient of these algorithms operates by partitioning the problem into a set of subproblems. Through recursion they solve each subproblem by partitioning it repeatedly into even smaller parts, continuing until a set of simple problems is created. From this process, a pictorial representation or wheel diagram of the problem can be constructed. From the wheel diagram and a description of the original problem, a scan table can be constructed. In addition, the wheel diagram can be used as a method of storing the scan sequence in a smaller amount of memory. The most efficient partitioning algorithm solved most scan table problems in less than a second of CPU time. Some types of problems, however, required as much as a few minutes of CPU time. 26 figures, 2 tables.

  19. Effect of sample size on multi-parametric prediction of tissue outcome in acute ischemic stroke using a random forest classifier

    NASA Astrophysics Data System (ADS)

    Forkert, Nils Daniel; Fiehler, Jens

    2015-03-01

    The tissue outcome prediction in acute ischemic stroke patients is highly relevant for clinical and research purposes. It has been shown that the combined analysis of diffusion and perfusion MRI datasets using high-level machine learning techniques leads to an improved prediction of final infarction compared to single perfusion parameter thresholding. However, most high-level classifiers require a previous training and, until now, it is ambiguous how many subjects are required for this, which is the focus of this work. 23 MRI datasets of acute stroke patients with known tissue outcome were used in this work. Relative values of diffusion and perfusion parameters as well as the binary tissue outcome were extracted on a voxel-by- voxel level for all patients and used for training of a random forest classifier. The number of patients used for training set definition was iteratively and randomly reduced from using all 22 other patients to only one other patient. Thus, 22 tissue outcome predictions were generated for each patient using the trained random forest classifiers and compared to the known tissue outcome using the Dice coefficient. Overall, a logarithmic relation between the number of patients used for training set definition and tissue outcome prediction accuracy was found. Quantitatively, a mean Dice coefficient of 0.45 was found for the prediction using the training set consisting of the voxel information from only one other patient, which increases to 0.53 if using all other patients (n=22). Based on extrapolation, 50-100 patients appear to be a reasonable tradeoff between tissue outcome prediction accuracy and effort required for data acquisition and preparation.

  20. Reliable rain rates from optical satellite sensors - a random forests-based approach for the hourly retrieval of rainfall rates from Meteosat SEVIRI

    NASA Astrophysics Data System (ADS)

    Kühnlein, Meike; Appelhans, Tim; Thies, Boris; Nauss, Thomas

    2013-04-01

    Many ecological and biodiversity-oriented projects require area-wide precipitation information and satellite-based rainfall retrievals are often the only option. Using optical and microphysical cloud property retrievals, area-wide information about the distribution of precipitating clouds can generally be provided from optical sensors aboard geostationary (GEO) weather satellites. However, the retrieval of spatio-temporal high resolution rainfall amounts from such sensors bears large uncertainties. In existing optical retrievals, the rainfall rate is generally retrieved as a function of the cloud-top temperature which leads to sufficient results for deep-convective systems but such a concept is inappropriate for any kind of advective/stratiform precipitation formation processes. To overcome this drawback, several authors suggest to use optical and microphysical cloud parameters not only for the rain-area delineation but also for the rain rate retrieval. In the present study, a method has been developed to estimate hourly rainfall rates using cloud physical properties retrieved from MSG SEVIRI data. The rainfall rate assignment is realized by using an ensemble classification and regression technique, called random forests. This method is already widely established in other disciplines, but has not yet been utilized extensively by climatologists. Random forests is used to assign rainfall rates to already identified rain areas in a two-step approach. First, the rain area is separated into areas of precipitation processes. Next, rainfall rates are assigned to these areas. For the development and validation of the new technique, radar-based precipitation data of the German Weather Service is used. The so-called RADOLAN RW product provide gauge-adjusted hourly precipitation amounts at a temporal resolution of one hour. Germany is chosen as the study area of the new technique. The region can be regarded as sufficiently representative for mid-latitudes precipitation

  1. Landslide susceptibility assessment in Lianhua County (China): A comparison between a random forest data mining technique and bivariate and multivariate statistical models

    NASA Astrophysics Data System (ADS)

    Hong, Haoyuan; Pourghasemi, Hamid Reza; Pourtaghi, Zohre Sadat

    2016-04-01

    Landslides are an important natural hazard that causes a great amount of damage around the world every year, especially during the rainy season. The Lianhua area is located in the middle of China's southern mountainous area, west of Jiangxi Province, and is known to be an area prone to landslides. The aim of this study was to evaluate and compare landslide susceptibility maps produced using the random forest (RF) data mining technique with those produced by bivariate (evidential belief function and frequency ratio) and multivariate (logistic regression) statistical models for Lianhua County, China. First, a landslide inventory map was prepared using aerial photograph interpretation, satellite images, and extensive field surveys. In total, 163 landslide events were recognized in the study area, with 114 landslides (70%) used for training and 49 landslides (30%) used for validation. Next, the landslide conditioning factors-including the slope angle, altitude, slope aspect, topographic wetness index (TWI), slope-length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, distance to roads, annual precipitation, land use, normalized difference vegetation index (NDVI), and lithology-were derived from the spatial database. Finally, the landslide susceptibility maps of Lianhua County were generated in ArcGIS 10.1 based on the random forest (RF), evidential belief function (EBF), frequency ratio (FR), and logistic regression (LR) approaches and were validated using a receiver operating characteristic (ROC) curve. The ROC plot assessment results showed that for landslide susceptibility maps produced using the EBF, FR, LR, and RF models, the area under the curve (AUC) values were 0.8122, 0.8134, 0.7751, and 0.7172, respectively. Therefore, we can conclude that all four models have an AUC of more than 0.70 and can be used in landslide susceptibility mapping in the study area; meanwhile, the EBF and FR models had the best performance for Lianhua

  2. Segmentation of the Cerebellar Peduncles Using a Random Forest Classifier and a Multi-object Geometric Deformable Model: Application to Spinocerebellar Ataxia Type 6

    PubMed Central

    Ye, Chuyang; Yang, Zhen; Ying, Sarah H.; Prince, Jerry L.

    2016-01-01

    The cerebellar peduncles, comprising the superior cerebellar peduncles (SCPs), the middle cerebellar peduncle (MCP), and the inferior cerebellar peduncles (ICPs), are white matter tracts that connect the cerebellum to other parts of the central nervous system. Methods for automatic segmentation and quantification of the cerebellar peduncles are needed for objectively and efficiently studying their structure and function. Diffusion tensor imaging (DTI) provides key information to support this goal, but it remains challenging because the tensors change dramatically in the decussation of the SCPs (dSCP), the region where the SCPs cross. This paper presents an automatic method for segmenting the cerebellar peduncles, including the dSCP. The method uses volumetric segmentation concepts based on extracted DTI features. The dSCP and noncrossing portions of the peduncles are modeled as separate objects, and are initially classified using a random forest classifier together with the DTI features. To obtain geometrically correct results, a multi-object geometric deformable model is used to refine the random forest classification. The method was evaluated using a leave-one-out cross-validation on five control subjects and four patients with spinocerebellar ataxia type 6 (SCA6). It was then used to evaluate group differences in the peduncles in a population of 32 controls and 11 SCA6 patients. In the SCA6 group, we have observed significant decreases in the volumes of the dSCP and the ICPs and significant increases in the mean diffusivity in the noncrossing SCPs, the MCP, and the ICPs. These results are consistent with a degeneration of the cerebellar peduncles in SCA6 patients. PMID:25749985

  3. Influence of multi-source and multi-temporal remotely sensed and ancillary data on the accuracy of random forest classification of wetlands in northern Minnesota

    USGS Publications Warehouse

    Corcoran, Jennifer M.; Knight, Joseph F.; Gallant, Alisa L.

    2013-01-01

    Wetland mapping at the landscape scale using remotely sensed data requires both affordable data and an efficient accurate classification method. Random forest classification offers several advantages over traditional land cover classification techniques, including a bootstrapping technique to generate robust estimations of outliers in the training data, as well as the capability of measuring classification confidence. Though the random forest classifier can generate complex decision trees with a multitude of input data and still not run a high risk of over fitting, there is a great need to reduce computational and operational costs by including only key input data sets without sacrificing a significant level of accuracy. Our main questions for this study site in Northern Minnesota were: (1) how does classification accuracy and confidence of mapping wetlands compare using different remote sensing platforms and sets of input data; (2) what are the key input variables for accurate differentiation of upland, water, and wetlands, including wetland type; and (3) which datasets and seasonal imagery yield the best accuracy for wetland classification. Our results show the key input variables include terrain (elevation and curvature) and soils descriptors (hydric), along with an assortment of remotely sensed data collected in the spring (satellite visible, near infrared, and thermal bands; satellite normalized vegetation index and Tasseled Cap greenness and wetness; and horizontal-horizontal (HH) and horizontal-vertical (HV) polarization using L-band satellite radar). We undertook this exploratory analysis to inform decisions by natural resource managers charged with monitoring wetland ecosystems and to aid in designing a system for consistent operational mapping of wetlands across landscapes similar to those found in Northern Minnesota.

  4. Free variable selection QSPR study to predict 19F chemical shifts of some fluorinated organic compounds using Random Forest and RBF-PLS methods

    NASA Astrophysics Data System (ADS)

    Goudarzi, Nasser

    2016-04-01

    In this work, two new and powerful chemometrics methods are applied for the modeling and prediction of the 19F chemical shift values of some fluorinated organic compounds. The radial basis function-partial least square (RBF-PLS) and random forest (RF) are employed to construct the models to predict the 19F chemical shifts. In this study, we didn't used from any variable selection method and RF method can be used as variable selection and modeling technique. Effects of the important parameters affecting the ability of the RF prediction power such as the number of trees (nt) and the number of randomly selected variables to split each node (m) were investigated. The root-mean-square errors of prediction (RMSEP) for the training set and the prediction set for the RBF-PLS and RF models were 44.70, 23.86, 29.77, and 23.69, respectively. Also, the correlation coefficients of the prediction set for the RBF-PLS and RF models were 0.8684 and 0.9313, respectively. The results obtained reveal that the RF model can be used as a powerful chemometrics tool for the quantitative structure-property relationship (QSPR) studies.

  5. Hierarchical Bayesian spatial models for predicting multiple forest variables using waveform LiDAR, hyperspectral imagery, and large inventory datasets

    USGS Publications Warehouse

    Finley, Andrew O.; Banerjee, Sudipto; Cook, Bruce D.; Bradford, John B.

    2013-01-01

    In this paper we detail a multivariate spatial regression model that couples LiDAR, hyperspectral and forest inventory data to predict forest outcome variables at a high spatial resolution. The proposed model is used to analyze forest inventory data collected on the US Forest Service Penobscot Experimental Forest (PEF), ME, USA. In addition to helping meet the regression model's assumptions, results from the PEF analysis suggest that the addition of multivariate spatial random effects improves model fit and predictive ability, compared with two commonly applied modeling approaches. This improvement results from explicitly modeling the covariation among forest outcome variables and spatial dependence among observations through the random effects. Direct application of such multivariate models to even moderately large datasets is often computationally infeasible because of cubic order matrix algorithms involved in estimation. We apply a spatial dimension reduction technique to help overcome this computational hurdle without sacrificing richness in modeling.

  6. Random forest estimation of genomic breeding values for disease susceptibility over different disease incidences and genomic architectures in simulated cow calibration groups.

    PubMed

    Naderi, S; Yin, T; König, S

    2016-09-01

    A simulation study was conducted to investigate the performance of random forest (RF) and genomic BLUP (GBLUP) for genomic predictions of binary disease traits based on cow calibration groups. Training and testing sets were modified in different scenarios according to disease incidence, the quantitative-genetic background of the trait (h(2)=0.30 and h(2)=0.10), and the genomic architecture [725 quantitative trait loci (QTL) and 290 QTL, populations with high and low levels of linkage disequilibrium (LD)]. For all scenarios, 10,005 SNP (depicting a low-density 10K SNP chip) and 50,025 SNP (depicting a 50K SNP chip) were evenly spaced along 29 chromosomes. Training and testing sets included 20,000 cows (4,000 sick, 16,000 healthy, disease incidence 20%) from the last 2 generations. Initially, 4,000 sick cows were assigned to the testing set, and the remaining 16,000 healthy cows represented the training set. In the ongoing allocation schemes, the number of sick cows in the training set increased stepwise by moving 10% of the sick animals from the testing set to the training set, and vice versa. The size of the training and testing sets was kept constant. Evaluation criteria for both GBLUP and RF were the correlations between genomic breeding values and true breeding values (prediction accuracy), and the area under the receiving operating characteristic curve (AUROC). Prediction accuracy and AUROC increased for both methods and all scenarios as increasing percentages of sick cows were allocated to the training set. Highest prediction accuracies were observed for disease incidences in training sets that reflected the population disease incidence of 0.20. For this allocation scheme, the largest prediction accuracies of 0.53 for RF and of 0.51 for GBLUP, and the largest AUROC of 0.66 for RF and of 0.64 for GBLUP, were achieved using 50,025 SNP, a heritability of 0.30, and 725 QTL. Heritability decreases from 0.30 to 0.10 and QTL reduction from 725 to 290 were associated

  7. Gender classification in low-resolution surveillance video: in-depth comparison of random forests and SVMs

    NASA Astrophysics Data System (ADS)

    Geelen, Christopher D.; Wijnhoven, Rob G. J.; Dubbelman, Gijs; de With, Peter H. N.

    2015-03-01

    This research considers gender classification in surveillance environments, typically involving low-resolution images and a large amount of viewpoint variations and occlusions. Gender classification is inherently difficult due to the large intra-class variation and interclass correlation. We have developed a gender classification system, which is successfully evaluated on two novel datasets, which realistically consider the above conditions, typical for surveillance. The system reaches a mean accuracy of up to 90% and approaches our human baseline of 92.6%, proving a high-quality gender classification system. We also present an in-depth discussion of the fundamental differences between SVM and RF classifiers. We conclude that balancing the degree of randomization in any classifier is required for the highest classification accuracy. For our problem, an RF-SVM hybrid classifier exploiting the combination of HSV and LBP features results in the highest classification accuracy of 89.9 0.2%, while classification computation time is negligible compared to the detection time of pedestrians.

  8. Integration of Random Forest with population-based outlier analyses provides insight on the genomic basis and evolution of run timing in Chinook salmon (Oncorhynchus tshawytscha).

    PubMed

    Brieuc, Marine S O; Ono, Kotaro; Drinan, Daniel P; Naish, Kerry A

    2015-06-01

    Anadromous Chinook salmon populations vary in the period of river entry at the initiation of adult freshwater migration, facilitating optimal arrival at natal spawning. Run timing is a polygenic trait that shows evidence of rapid parallel evolution in some lineages, signifying a key role for this phenotype in the ecological divergence between populations. Studying the genetic basis of local adaptation in quantitative traits is often impractical in wild populations. Therefore, we used a novel approach, Random Forest, to detect markers linked to run timing across 14 populations from contrasting environments in the Columbia River and Puget Sound, USA. The approach permits detection of loci of small effect on the phenotype. Divergence between populations at these loci was then examined using both principle component analysis and FST outlier analyses, to determine whether shared genetic changes resulted in similar phenotypes across different lineages. Sequencing of 9107 RAD markers in 414 individuals identified 33 predictor loci explaining 79.2% of trait variance. Discriminant analysis of principal components of the predictors revealed both shared and unique evolutionary pathways in the trait across different lineages, characterized by minor allele frequency changes. However, genome mapping of predictor loci also identified positional overlap with two genomic outlier regions, consistent with selection on loci of large effect. Therefore, the results suggest selective sweeps on few loci and minor changes in loci that were detected by this study. Use of a polygenic framework has provided initial insight into how divergence in a trait has occurred in the wild. PMID:25913096

  9. Identification of copper phthalocyanine blue polymorphs in unaged and aged paint systems by means of micro-Raman spectroscopy and Random Forest.

    PubMed

    Anghelone, Marta; Jembrih-Simbürger, Dubravka; Schreiner, Manfred

    2015-10-01

    Copper phthalocyanine (CuPc) blues (PB15) are largely used in art and industry as pigments. In these fields mainly three different polymorphic modifications of PB15 are employed: alpha, beta and epsilon. Differentiating among these CuPc forms can give important information for developing conservation strategy and can help in relative dating, since each form was introduced in the market in different time periods. This study focuses on the classification of Raman spectra measured using 532 nm excitation wavelength on: (i) dry pigment powders, (ii) unaged mock-ups of self-made paints, (iii) unaged commercial paints, and (iv) paints subjected to accelerated UV ageing. The ratios among integrated Raman bands are taken in consideration as features to perform Random Forest (RF). Features selection based on Gini Contrast score was carried out on the measured dataset to determine the Raman bands ratios with higher predictive power. These were used as polymorphic markers, in order to establish an easy and accessible method for the identification. Three different ratios and the presence of a characteristic vibrational band allowed the identification of the crystal modification in pigments powder as well as in unaged and aged paint films. PMID:25974675

  10. Fitting Nonlinear Ordinary Differential Equation Models with Random Effects and Unknown Initial Conditions Using the Stochastic Approximation Expectation-Maximization (SAEM) Algorithm.

    PubMed

    Chow, Sy-Miin; Lu, Zhaohua; Sherwood, Andrew; Zhu, Hongtu

    2016-03-01

    The past decade has evidenced the increased prevalence of irregularly spaced longitudinal data in social sciences. Clearly lacking, however, are modeling tools that allow researchers to fit dynamic models to irregularly spaced data, particularly data that show nonlinearity and heterogeneity in dynamical structures. We consider the issue of fitting multivariate nonlinear differential equation models with random effects and unknown initial conditions to irregularly spaced data. A stochastic approximation expectation-maximization algorithm is proposed and its performance is evaluated using a benchmark nonlinear dynamical systems model, namely, the Van der Pol oscillator equations. The empirical utility of the proposed technique is illustrated using a set of 24-h ambulatory cardiovascular data from 168 men and women. Pertinent methodological challenges and unresolved issues are discussed. PMID:25416456

  11. Random-iteration algorithm-based optical parallel architecture for fractal-image decoding by use of iterated-function system codes.

    PubMed

    Chang, H T; Kuo, C J

    1998-03-10

    An optical parallel architecture for the random-iteration algorithm to decode a fractal image by use of iterated-function system (IFS) codes is proposed. The code value is first converted into transmittance in film or a spatial light modulator in the optical part of the system. With an optical-to-electrical converter, electrical-to-optical converter, and some electronic circuits for addition and delay, we can perform the contractive affine transformation (CAT) denoted in IFS codes. In the proposed decoding architecture all CAT's generate points (image pixels) in parallel, and these points then are joined for display purposes. Therefore the decoding speed is improved greatly compared with existing serial-decoding architectures. In addition, an error and stability analysis that considers nonperfect elements is presented for the proposed optical system. Finally, simulation results are given to validate the proposed architecture. PMID:18268718

  12. Simulated annealing and metaheuristic for randomized priority search algorithms for the aerial refuelling parallel machine scheduling problem with due date-to-deadline windows and release times

    NASA Astrophysics Data System (ADS)

    Kaplan, Sezgin; Rabadi, Ghaith

    2013-01-01

    This article addresses the aerial refuelling scheduling problem (ARSP), where a set of fighter jets (jobs) with certain ready times must be refuelled from tankers (machines) by their due dates; otherwise, they reach a low fuel level (deadline) incurring a high cost. ARSP is an identical parallel machine scheduling problem with release times and due date-to-deadline windows to minimize the total weighted tardiness. A simulated annealing (SA) and metaheuristic for randomized priority search (Meta-RaPS) with the newly introduced composite dispatching rule, apparent piecewise tardiness cost with ready times (APTCR), are applied to the problem. Computational experiments compared the algorithms' solutions to optimal solutions for small problems and to each other for larger problems. To obtain optimal solutions, a mixed integer program with a piecewise weighted tardiness objective function was solved for up to 12 jobs. The results show that Meta-RaPS performs better in terms of average relative error but SA is more efficient.

  13. Estimating spatial variation in Alberta forest biomass from a combination of forest inventory and remote sensing data

    NASA Astrophysics Data System (ADS)

    Zhang, J.; Huang, S.; Hogg, E. H.; Lieffers, V.; Qin, Y.; He, F.

    2013-12-01

    Uncertainties in the estimation of tree biomass carbon storage across large areas pose challenges for the study of forest carbon cycling at regional and global scales. In this study, we attempted to estimate the present biomass carbon storage in Alberta, Canada, by taking advantage of a spatially explicit dataset derived from a combination of forest inventory data from 1968 plots and spaceborne light detection and ranging (LiDAR) canopy height data. Ten climatic variables together with elevation, were used for model development and assessment. Four approaches, including spatial interpolation, non-spatial and spatial regression models, and decision-tree based modelling with random forests algorithm (a machine-learning technique), were compared to find the "best" estimates. We found that the random forests approach provided the best accuracy for biomass estimates. Non-spatial and spatial regression models gave estimates similar to random forests, while spatial interpolation greatly overestimated the biomass storage. Using random forests, the total biomass stock in Alberta forests was estimated to be 3.11 × 109 Mg, with the average biomass density of 77.59 Mg ha-1. At the species level, three major tree species, lodgepole pine, trembling aspen and white spruce, stocked about 1.91 × 109 Mg biomass, accounting for 61% of total estimated biomass. Spatial distribution of biomass varied with natural regions, land cover types, and species. And the relative importance of predictor variables on determining biomass distribution varied with species. This study showed that the combination of ground-based inventory data, spaceborne LiDAR data, land cover classification, climatic and environmental variables was an efficient way to estimate the quantity, distribution and variation of forest biomass carbon stocks across large regions.

  14. Estimating spatial variation in Alberta forest biomass from a combination of forest inventory and remote sensing data

    NASA Astrophysics Data System (ADS)

    Zhang, J.; Huang, S.; Hogg, E. H.; Lieffers, V.; Qin, Y.; He, F.

    2014-05-01

    Uncertainties in the estimation of tree biomass carbon storage across large areas pose challenges for the study of forest carbon cycling at regional and global scales. In this study, we attempted to estimate the present aboveground biomass (AGB) in Alberta, Canada, by taking advantage of a spatially explicit data set derived from a combination of forest inventory data from 1968 plots and spaceborne light detection and ranging (lidar) canopy height data. Ten climatic variables, together with elevation, were used for model development and assessment. Four approaches, including spatial interpolation, non-spatial and spatial regression models, and decision-tree-based modeling with random forests algorithm (a machine-learning technique), were compared to find the "best" estimates. We found that the random forests approach provided the best accuracy for biomass estimates. Non-spatial and spatial regression models gave estimates similar to random forests, while spatial interpolation greatly overestimated the biomass storage. Using random forests, the total AGB stock in Alberta forests was estimated to be 2.26 × 109 Mg (megagram), with an average AGB density of 56.30 ± 35.94 Mg ha-1. At the species level, three major tree species, lodgepole pine, trembling aspen and white spruce, stocked about 1.39 × 109 Mg biomass, accounting for nearly 62% of total estimated AGB. Spatial distribution of biomass varied with natural regions, land cover types, and species. Furthermore, the relative importance of predictor variables on determining biomass distribution varied with species. This study showed that the combination of ground-based inventory data, spaceborne lidar data, land cover classification, and climatic and environmental variables was an efficient way to estimate the quantity, distribution and variation of forest biomass carbon stocks across large regions.

  15. Mapping forests in Monsoon Asia with ALOS PALSAR and MODIS imagery in 2010

    NASA Astrophysics Data System (ADS)

    Qin, Y.

    2015-12-01

    Spatial distribution and temporal dynamics of forests are important to climate change, carbon cycle, and biodiversity. An accurate forest map is required in monsoon Asia where extensive forest changes occurred. An algorithm was developed to map the distribution of forests in Monsoon Asia in 2010, with the integration of structure information from the Advanced Land Observation System (ALOS) Phased Array L-band Synthetic Aperture Radar (PALSAR) mosaic dataset, and phenology information from MOD13Q1 NDVI, and MOD09A1 land surface reflectance products. The PALSAR-based forest map was generated based on a decision tree classification, and assessed with the randomly selected ground truth samples from high spatial resolution images in Google Earth. The spatial and area comparison were performed between our forest map (OU/Fudan F/NF) and other forest maps generated by Japanese Space Exploration Agency (JAXA F/NF), European Space Agency (ESA F/NF), Boston University (MCD12Q1 F/NF), Food and Agricultural Organization (FAO FRA), and University of Maryland (Landsat forests). Then we investigate the reasons for the large uncertainties among these typical forest maps in 2010. This study could provide a way to monitor the dynamics of forests using the Synthetic Aperture Radar (SAR) and optical satellite images, and the resultant F/NF datasets can be used to analyze the impacts of changes in forests on climate and ecosystems.

  16. Preoperative overnight parenteral nutrition (TPN) improves skeletal muscle protein metabolism indicated by microarray algorithm analyses in a randomized trial.

    PubMed

    Iresjö, Britt-Marie; Engström, Cecilia; Lundholm, Kent

    2016-06-01

    Loss of muscle mass is associated with increased risk of morbidity and mortality in hospitalized patients. Uncertainties of treatment efficiency by short-term artificial nutrition remain, specifically improvement of protein balance in skeletal muscles. In this study, algorithmic microarray analysis was applied to map cellular changes related to muscle protein metabolism in human skeletal muscle tissue during provision of overnight preoperative total parenteral nutrition (TPN). Twenty-two patients (11/group) scheduled for upper GI surgery due to malignant or benign disease received a continuous peripheral all-in-one TPN infusion (30 kcal/kg/day, 0.16 gN/kg/day) or saline infusion for 12 h prior operation. Biopsies from the rectus abdominis muscle were taken at the start of operation for isolation of muscle RNA RNA expression microarray analyses were performed with Agilent Sureprint G3, 8 × 60K arrays using one-color labeling. 447 mRNAs were differently expressed between study and control patients (P < 0.1). mRNAs related to ribosomal biogenesis, mRNA processing, and translation were upregulated during overnight nutrition; particularly anabolic signaling S6K1 (P < 0.01-0.1). Transcripts of genes associated with lysosomal degradation showed consistently lower expression during TPN while mRNAs for ubiquitin-mediated degradation of proteins as well as transcripts related to intracellular signaling pathways, PI3 kinase/MAPkinase, were either increased or decreased. In conclusion, muscle mRNA alterations during overnight standard TPN infusions at constant rate altered mRNAs associated with mTOR signaling; increased initiation of protein translation; and suppressed autophagy/lysosomal degradation of proteins. This indicates that overnight preoperative parenteral nutrition is effective to promote muscle protein metabolism. PMID:27273879

  17. A Comparative Assessment of the Influences of Human Impacts on Soil Cd Concentrations Based on Stepwise Linear Regression, Classification and Regression Tree, and Random Forest Models

    PubMed Central

    Qiu, Lefeng; Wang, Kai; Long, Wenli; Wang, Ke; Hu, Wei; Amable, Gabriel S.

    2016-01-01

    Soil cadmium (Cd) contamination has attracted a great deal of attention because of its detrimental effects on animals and humans. This study aimed to develop and compare the performances of stepwise linear regression (SLR), classification and regression tree (CART) and random forest (RF) models in the prediction and mapping of the spatial distribution of soil Cd and to identify likely sources of Cd accumulation in Fuyang County, eastern China. Soil Cd data from 276 topsoil (0–20 cm) samples were collected and randomly divided into calibration (222 samples) and validation datasets (54 samples). Auxiliary data, including detailed land use information, soil organic matter, soil pH, and topographic data, were incorporated into the models to simulate the soil Cd concentrations and further identify the main factors influencing soil Cd variation. The predictive models for soil Cd concentration exhibited acceptable overall accuracies (72.22% for SLR, 70.37% for CART, and 75.93% for RF). The SLR model exhibited the largest predicted deviation, with a mean error (ME) of 0.074 mg/kg, a mean absolute error (MAE) of 0.160 mg/kg, and a root mean squared error (RMSE) of 0.274 mg/kg, and the RF model produced the results closest to the observed values, with an ME of 0.002 mg/kg, an MAE of 0.132 mg/kg, and an RMSE of 0.198 mg/kg. The RF model also exhibited the greatest R2 value (0.772). The CART model predictions closely followed, with ME, MAE, RMSE, and R2 values of 0.013 mg/kg, 0.154 mg/kg, 0.230 mg/kg and 0.644, respectively. The three prediction maps generally exhibited similar and realistic spatial patterns of soil Cd contamination. The heavily Cd-affected areas were primarily located in the alluvial valley plain of the Fuchun River and its tributaries because of the dramatic industrialization and urbanization processes that have occurred there. The most important variable for explaining high levels of soil Cd accumulation was the presence of metal smelting industries. The

  18. A Comparative Assessment of the Influences of Human Impacts on Soil Cd Concentrations Based on Stepwise Linear Regression, Classification and Regression Tree, and Random Forest Models.

    PubMed

    Qiu, Lefeng; Wang, Kai; Long, Wenli; Wang, Ke; Hu, Wei; Amable, Gabriel S

    2016-01-01

    Soil cadmium (Cd) contamination has attracted a great deal of attention because of its detrimental effects on animals and humans. This study aimed to develop and compare the performances of stepwise linear regression (SLR), classification and regression tree (CART) and random forest (RF) models in the prediction and mapping of the spatial distribution of soil Cd and to identify likely sources of Cd accumulation in Fuyang County, eastern China. Soil Cd data from 276 topsoil (0-20 cm) samples were collected and randomly divided into calibration (222 samples) and validation datasets (54 samples). Auxiliary data, including detailed land use information, soil organic matter, soil pH, and topographic data, were incorporated into the models to simulate the soil Cd concentrations and further identify the main factors influencing soil Cd variation. The predictive models for soil Cd concentration exhibited acceptable overall accuracies (72.22% for SLR, 70.37% for CART, and 75.93% for RF). The SLR model exhibited the largest predicted deviation, with a mean error (ME) of 0.074 mg/kg, a mean absolute error (MAE) of 0.160 mg/kg, and a root mean squared error (RMSE) of 0.274 mg/kg, and the RF model produced the results closest to the observed values, with an ME of 0.002 mg/kg, an MAE of 0.132 mg/kg, and an RMSE of 0.198 mg/kg. The RF model also exhibited the greatest R2 value (0.772). The CART model predictions closely followed, with ME, MAE, RMSE, and R2 values of 0.013 mg/kg, 0.154 mg/kg, 0.230 mg/kg and 0.644, respectively. The three prediction maps generally exhibited similar and realistic spatial patterns of soil Cd contamination. The heavily Cd-affected areas were primarily located in the alluvial valley plain of the Fuchun River and its tributaries because of the dramatic industrialization and urbanization processes that have occurred there. The most important variable for explaining high levels of soil Cd accumulation was the presence of metal smelting industries. The

  19. Using the random forest method to detect a response shift in the quality of life of multiple sclerosis patients: a cohort study

    PubMed Central

    2013-01-01

    Background Multiple sclerosis (MS), a common neurodegenerative disease, has well-described associations with quality of life (QoL) impairment. QoL changes found in longitudinal studies are difficult to interpret due to the potential response shift (RS) corresponding to respondents’ changing standards, values, and conceptualization of QoL. This study proposes to test the capacity of Random Forest (RF) for detecting RS reprioritization as the relative importance of QoL domains’ changes over time. Methods This was a longitudinal observational study. The main inclusion criteria were patients 18 years old or more with relapsing-remitting multiple sclerosis. Every 6 months up to month 24, QoL was recorded using generic and MS-specific questionnaires (MusiQoL and SF-36). At 24 months, individuals were divided into two ‘disability change’ groups: worsened and not-worsened patients. The RF method was performed based on Breiman’s description. Analyses were performed to determine which QoL scores of SF-36 predicted the MusiQoL index. The average variable importance (AVI) was estimated. Results A total of 417 (79.6%) patients were defined as not-worsened and 107 (20.4%) as worsened. A clear RS was identified in worsened patients. While the mental score AVI was almost one third higher than the physical score AVI at 12 months, it was 1.5 times lower at 24 months. Conclusion This work confirms that the RF method offers a useful statistical approach for RS detection. How to integrate the RS in the interpretation of QoL scores remains a challenge for future research. Trial registration ClinicalTrials.gov identifier: NCT00702065 PMID:23414459

  20. Using random forests to explore the effects of site attributes and soil properties on near-saturated and saturated hydraulic conductivity

    NASA Astrophysics Data System (ADS)

    Jorda, Helena; Koestel, John; Jarvis, Nicholas

    2014-05-01

    Knowledge of the near-saturated and saturated hydraulic conductivity of soil is fundamental for understanding important processes like groundwater contamination risks or runoff and soil erosion. Hydraulic conductivities are however difficult and time-consuming to determine by direct measurements, especially at the field scale or larger. So far, pedotransfer functions do not offer an especially reliable alternative since published approaches exhibit poor prediction performances. In our study we aimed at building pedotransfer functions by growing random forests (a statistical learning approach) on 486 datasets from the meta-database on tension-disk infiltrometer measurements collected from peer-reviewed literature and recently presented by Jarvis et al. (2013, Influence of soil, land use and climatic factors on the hydraulic conductivity of soil. Hydrol. Earth Syst. Sci. 17(12), 5185-5195). When some data from a specific source publication were allowed to enter the training set whereas others were used for validation, the results of a 10-fold cross-validation showed reasonable coefficients of determination of 0.53 for hydraulic conductivity at 10 cm tension, K10, and 0.41 for saturated conductivity, Ks. The estimated average annual temperature and precipitation at the site were the most important predictors for K10, while bulk density and estimated average annual temperature were most important for Ks prediction. The soil organic carbon content and the diameter of the disk infiltrometer were also important for the prediction of both K10 and Ks. However, coefficients of determination were around zero when all datasets of a specific source publication were excluded from the training set and exclusively used for validation. This may indicate experimenter bias, or that better predictors have to be found or that a larger dataset has to be used to infer meaningful pedotransfer functions for saturated and near-saturated hydraulic conductivities. More research is in progress

  1. Predicting protein-protein interactions from primary protein sequences using a novel multi-scale local feature representation scheme and the random forest.

    PubMed

    You, Zhu-Hong; Chan, Keith C C; Hu, Pengwei

    2015-01-01

    The study of protein-protein interactions (PPIs) can be very important for the understanding of biological cellular functions. However, detecting PPIs in the laboratories are both time-consuming and expensive. For this reason, there has been much recent effort to develop techniques for computational prediction of PPIs as this can complement laboratory procedures and provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale. Although much progress has already been achieved in this direction, the problem is still far from being solved. More effective approaches are still required to overcome the limitations of the current ones. In this study, a novel Multi-scale Local Descriptor (MLD) feature representation scheme is proposed to extract features from a protein sequence. This scheme can capture multi-scale local information by varying the length of protein-sequence segments. Based on the MLD, an ensemble learning method, the Random Forest (RF) method, is used as classifier. The MLD feature representation scheme facilitates the mining of interaction information from multi-scale continuous amino acid segments, making it easier to capture multiple overlapping continuous binding patterns within a protein sequence. When the proposed method is tested with the PPI data of Saccharomyces cerevisiae, it achieves a prediction accuracy of 94.72% with 94.34% sensitivity at the precision of 98.91%. Extensive experiments are performed to compare our method with existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors also with the H. pylori dataset. The reason why such good results are achieved can largely be credited to the learning capabilities of the RF model and the novel MLD feature representation scheme. The experiment results show that the proposed approach can be very promising for predicting PPIs and can be a useful tool for future

  2. Discovery of Novel Hepatitis C Virus NS5B Polymerase Inhibitors by Combining Random Forest, Multiple e-Pharmacophore Modeling and Docking

    PubMed Central

    Wei, Yu; Li, Jinlong; Qing, Jie; Huang, Mingjie; Wu, Ming; Gao, Fenghua; Li, Dongmei; Hong, Zhangyong; Kong, Lingbao; Huang, Weiqiang; Lin, Jianping

    2016-01-01

    The NS5B polymerase is one of the most attractive targets for developing new drugs to block Hepatitis C virus (HCV) infection. We describe the discovery of novel potent HCV NS5B polymerase inhibitors by employing a virtual screening (VS) approach, which is based on random forest (RB-VS), e-pharmacophore (PB-VS), and docking (DB-VS) methods. In the RB-VS stage, after feature selection, a model with 16 descriptors was used. In the PB-VS stage, six energy-based pharmacophore (e-pharmacophore) models from different crystal structures of the NS5B polymerase with ligands binding at the palm I, thumb I and thumb II regions were used. In the DB-VS stage, the Glide SP and XP docking protocols with default parameters were employed. In the virtual screening approach, the RB-VS, PB-VS and DB-VS methods were applied in increasing order of complexity to screen the InterBioScreen database. From the final hits, we selected 5 compounds for further anti-HCV activity and cellular cytotoxicity assay. All 5 compounds were found to inhibit NS5B polymerase with IC50 values of 2.01–23.84 μM and displayed anti-HCV activities with EC50 values ranging from 1.61 to 21.88 μM, and all compounds displayed no cellular cytotoxicity (CC50 > 100 μM) except compound N2, which displayed weak cytotoxicity with a CC50 value of 51.3 μM. The hit compound N2 had the best antiviral activity against HCV, with a selective index of 32.1. The 5 hit compounds with new scaffolds could potentially serve as NS5B polymerase inhibitors through further optimization and development. PMID:26845440

  3. Repeated measurements of blood lactate concentration as a prognostic marker in horses with acute colitis evaluated with classification and regression trees (CART) and random forest analysis.

    PubMed

    Petersen, M B; Tolver, A; Husted, L; Tølbøll, T H; Pihl, T H

    2016-07-01

    The objective of this study was to investigate the prognostic value of single and repeated measurements of blood l-lactate (Lac) and ionised calcium (iCa) concentrations, packed cell volume (PCV) and plasma total protein (TP) concentration in horses with acute colitis. A total of 66 adult horses admitted with acute colitis (<24 h) to a referral hospital in the 2002-2011 period were included. The prognostic value of Lac, iCa, PCV and TP recorded at admission and 6 h post admission was analysed with univariate analysis, logistic regression, classification and regression trees, as well as random forest analysis. Ponies and Icelandic horses made up 59% of the population, whilst the remaining 41% were horses. Blood lactate concentration at admission was the only individual parameter significantly associated with probability of survival to discharge (P < 0.001). In a training sample, a Lac cut-off value of 7 mmol/L had a sensitivity of 0.66 and a specificity of 0.92 in predicting survival. In independent test data, the sensitivity was 0.69 and the specificity was 0.76. At the observed survival rate (38%), the optimal decision tree identified horses as non-survivors when the Lac at admission was ≥4.3 mmol/L and the Lac 6 h post admission stayed at >2 mmol/L (sensitivity, 0.72; specificity, 0.8). In conclusion, blood lactate concentration measured at admission and repeated 6 h later aided the prognostic evaluation of horses with acute colitis in this population with a very high mortality rate. This should allow clinicians to give a more reliable prognosis for the horse. PMID:27240909

  4. Quantumness, Randomness and Computability

    NASA Astrophysics Data System (ADS)

    Solis, Aldo; Hirsch, Jorge G.

    2015-06-01

    Randomness plays a central role in the quantum mechanical description of our interactions. We review the relationship between the violation of Bell inequalities, non signaling and randomness. We discuss the challenge in defining a random string, and show that algorithmic information theory provides a necessary condition for randomness using Borel normality. We close with a view on incomputablity and its implications in physics.

  5. Woody vegetation cover monitoring with multi-temporal Landsat data and Random Forests: the case of the Northwest Province (South Africa)

    NASA Astrophysics Data System (ADS)

    Symeonakis, Elias; Higginbottom, Thomas; Petroulaki, Kyriaki

    2016-04-01

    Land degradation and desertification (LDD) are serious global threats to humans and the environment. Globally, 10-20% of drylands and 24% of the world's productive lands are potentially degraded, which affects 1.5 billion people and reduces GDP by €3.4 billion. In Africa, LDD processes affect up to a third of savannahs, leading to a decline in the ecosystem services provided to some of the continent's poorest and most vulnerable communities. Indirectly, LDD can be monitored using relevant indicators. The encroachment of woody plants into grasslands, and the subsequent conversion of savannahs and open woodlands into shrublands, has attracted a lot of attention over the last decades and has been identified as an indicator of LDD. According to some assessments, bush encroachment has rendered 1.1 million ha of South African savanna unusable, threatens another 27 million ha (~17% of the country), and has reduced the grazing capacity throughout the region by up to 50%. Mapping woody cover encroachment over large areas can only be effectively achieved using remote sensing data and techniques. The longest continuously operating Earth-observation program, the Landsat series, is now freely-available as an atmospherically corrected, cloud masked surface reflectance product. The availability and length of the Landsat archive is thus an unparalleled Earth-observation resource, particularly for long-term change detection and monitoring. Here, we map and monitor woody vegetation cover in the Northwest Province of South Africa, a mosaic of 12 Landsat scenes that expands over more than 100,000km2. We employ a multi-temporal approach with dry-season TM, ETM+ and OLI data from 15 epochs between 1989 to 2015. We use 0.5m-pixel colour aerial photography to collect >15,000 samples for training and validating a Random Forest model to map woody cover, grasses, crops, urban and bare areas. High classification accuracies are achieved, especially so for the two cover types indirectly

  6. Holocene local forest history at two sites in Småland, southern Sweden - insights from quantitative reconstructions using the Landscape Reconstruction Algorithm

    NASA Astrophysics Data System (ADS)

    Cui, Qiaoyu; Gaillard, Marie-José; Lemdahl, Geoffrey; Olsson, Fredrik; Sugita, Shinya

    2010-05-01

    Quantitative reconstruction of past vegetation using fossil pollen was long very problematic. It is well known that pollen percentages and pollen accumulation rates do not represent vegetation abundance properly because pollen values are influenced by many factors of which inter-taxonomic differences in pollen productivity and vegetation structure are the most important ones. It is also recognized that pollen assemblages from large sites (lakes or bogs) record the characteristics of the regional vegetation, while pollen assemblages from small sites record local features. Based on the theoretical understanding of the factors and mechanisms that affect pollen representation of vegetation, Sugita (2007a and b) proposed the Landscape Reconstruction Algorithm (LRA) to estimate vegetation abundance in percentage cover for well defined spatial scales. The LRA includes two models, REVEALS and LOVE. REVEALS estimates regional vegetation abundance at a spatial scale of 100 km x 100 km. LOVE estimates local vegetation abundance at the spatial scale of the relevant source area of pollen (RSAP sensu Sugita 1993) of the pollen site. REVEALS estimates are needed to apply LOVE in order to calculate the RSAP and the vegetation cover within the RSAP. The two models were validated theoretically and empirically. Two small bogs in southern Sweden were studied for pollen, plant macrofossil, charcoal, and coleoptera in order to reconstruct the local Holocene forest and fire history (e.g. Greisman and Gaillard 2009; Olsson et al. 2009). We applied the LOVE model in order to 1) compare the LOVE estimates with pollen percentages for a better understanding of the local forest history; 2) obtain more precise information on the local vegetation to explain between-sites differences in fire history. We used pollen records from two large lakes in Småland to obtain REVEALS estimates for twelve continuous 500-yrs time windows. Following the strategy of the Swedish VR LANDCLIM project (see Gaillard

  7. Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests

    PubMed Central

    2011-01-01

    Background Dementia and cognitive impairment associated with aging are a major medical and social concern. Neuropsychological testing is a key element in the diagnostic procedures of Mild Cognitive Impairment (MCI), but has presently a limited value in the prediction of progression to dementia. We advance the hypothesis that newer statistical classification methods derived from data mining and machine learning methods like Neural Networks, Support Vector Machines and Random Forests can improve accuracy, sensitivity and specificity of predictions obtained from neuropsychological testing. Seven non parametric classifiers derived from data mining methods (Multilayer Perceptrons Neural Networks, Radial Basis Function Neural Networks, Support Vector Machines, CART, CHAID and QUEST Classification Trees and Random Forests) were compared to three traditional classifiers (Linear Discriminant Analysis, Quadratic Discriminant Analysis and Logistic Regression) in terms of overall classification accuracy, specificity, sensitivity, Area under the ROC curve and Press'Q. Model predictors were 10 neuropsychological tests currently used in the diagnosis of dementia. Statistical distributions of classification parameters obtained from a 5-fold cross-validation were compared using the Friedman's nonparametric test. Results Press' Q test showed that all classifiers performed better than chance alone (p < 0.05). Support Vector Machines showed the larger overall classification accuracy (Median (Me) = 0.76) an area under the ROC (Me = 0.90). However this method showed high specificity (Me = 1.0) but low sensitivity (Me = 0.3). Random Forest ranked second in overall accuracy (Me = 0.73) with high area under the ROC (Me = 0.73) specificity (Me = 0.73) and sensitivity (Me = 0.64). Linear Discriminant Analysis also showed acceptable overall accuracy (Me = 0.66), with acceptable area under the ROC (Me = 0.72) specificity (Me = 0.66) and sensitivity (Me = 0.64). The remaining classifiers showed

  8. Landslide Susceptibility Analysis by the comparison and integration of Random Forest and Logistic Regression methods; application to the disaster of Nova Friburgo - Rio de Janeiro, Brasil (January 2011)

    NASA Astrophysics Data System (ADS)

    Esposito, Carlo; Barra, Anna; Evans, Stephen G.; Scarascia Mugnozza, Gabriele; Delaney, Keith

    2014-05-01

    The study of landslide susceptibility by multivariate statistical methods is based on finding a quantitative relationship between controlling factors and landslide occurrence. Such studies have become popular in the last few decades thanks to the development of geographic information systems (GIS) software and the related improved data management. In this work we applied a statistical approach to an area of high landslide susceptibility mainly due to its tropical climate and geological-geomorphological setting. The study area is located in the south-east region of Brazil that has frequently been affected by flood and landslide hazard, especially because of heavy rainfall events during the summer season. In this work we studied a disastrous event that occurred on January 11th and 12th of 2011, which involved Região Serrana (the mountainous region of Rio de Janeiro State) and caused more than 5000 landslides and at least 904 deaths. In order to produce susceptibility maps, we focused our attention on an area of 93,6 km2 that includes Nova Friburgo city. We utilized two different multivariate statistic methods: Logistic Regression (LR), already widely used in applied geosciences, and Random Forest (RF), which has only recently been applied to landslide susceptibility analysis. With reference to each mapping unit, the first method (LR) results in a probability of landslide occurrence, while the second one (RF) gives a prediction in terms of % of area susceptible to slope failure. With this aim in mind, a landslide inventory map (related to the studied event) has been drawn up through analyses of high-resolution GeoEye satellite images, in a GIS environment. Data layers of 11 causative factors have been created and processed in order to be used as continuous numerical or discrete categorical variables in statistical analysis. In particular, the logistic regression method has frequent difficulties in managing numerical continuous and discrete categorical variables

  9. Experimental evidence of quantum randomness incomputability

    SciTech Connect

    Calude, Cristian S.; Dinneen, Michael J.; Dumitrescu, Monica; Svozil, Karl

    2010-08-15

    In contrast with software-generated randomness (called pseudo-randomness), quantum randomness can be proven incomputable; that is, it is not exactly reproducible by any algorithm. We provide experimental evidence of incomputability--an asymptotic property--of quantum randomness by performing finite tests of randomness inspired by algorithmic information theory.

  10. Use of Comprehensive Two-Dimensional Gas Chromatography with Time-of-Flight Mass Spectrometric Detection and Random Forest Pattern Recognition Techniques for Classifying Chemical Threat Agents and Detecting Chemical Attribution Signatures.

    PubMed

    Strozier, Erich D; Mooney, Douglas D; Friedenberg, David A; Klupinski, Theodore P; Triplett, Cheryl A

    2016-07-19

    In this proof of concept study, chemical threat agent (CTA) samples were classified to their sources with accuracies of 87-100% by applying a random forest statistical pattern recognition technique to analytical data acquired by comprehensive two-dimensional gas chromatography with time-of-flight mass spectrometric detection (GC × GC-TOFMS). Three organophosphate pesticides, chlorpyrifos, dichlorvos, and dicrotophos, were used as the model CTAs, with data collected for 4-6 sources per CTA and 7-10 replicate analyses per source. The analytical data were also evaluated to determine tentatively identified chemical attribution signatures for the CTAs by comparing samples from different sources according to either the presence/absence of peaks or the relative responses of peaks. These results demonstrate that GC × GC-TOFMS analysis in combination with a random forest technique can be useful in sample classification and signature identification for pesticides. Furthermore, the results suggest that this combination of analytical chemistry and statistical approaches can be applied to forensic analysis of other chemicals for similar purposes. PMID:27295356

  11. SIDRA: a blind algorithm for signal detection in photometric surveys

    NASA Astrophysics Data System (ADS)

    Mislis, D.; Bachelet, E.; Alsubai, K. A.; Bramich, D. M.; Parley, N.

    2016-01-01

    We present the Signal Detection using Random-Forest Algorithm (SIDRA). SIDRA is a detection and classification algorithm based on the Machine Learning technique (Random Forest). The goal of this paper is to show the power of SIDRA for quick and accurate signal detection and classification. We first diagnose the power of the method with simulated light curves and try it on a subset of the Kepler space mission catalogue. We use five classes of simulated light curves (CONSTANT, TRANSIT, VARIABLE, MLENS and EB for constant light curves, transiting exoplanet, variable, microlensing events and eclipsing binaries, respectively) to analyse the power of the method. The algorithm uses four features in order to classify the light curves. The training sample contains 5000 light curves (1000 from each class) and 50 000 random light curves for testing. The total SIDRA success ratio is ≥90 per cent. Furthermore, the success ratio reaches 95-100 per cent for the CONSTANT, VARIABLE, EB and MLENS classes and 92 per cent for the TRANSIT class with a decision probability of 60 per cent. Because the TRANSIT class is the one which fails the most, we run a simultaneous fit using SIDRA and a Box Least Square (BLS)-based algorithm for searching for transiting exoplanets. As a result, our algorithm detects 7.5 per cent more planets than a classic BLS algorithm, with better results for lower signal-to-noise light curves. SIDRA succeeds to catch 98 per cent of the planet candidates in the Kepler sample and fails for 7 per cent of the false alarms subset. SIDRA promises to be useful for developing a detection algorithm and/or classifier for large photometric surveys such as TESS and PLATO exoplanet future space missions.

  12. Mapping forests in monsoon Asia with ALOS PALSAR 50-m mosaic images and MODIS imagery in 2010.

    PubMed

    Qin, Yuanwei; Xiao, Xiangming; Dong, Jinwei; Zhang, Geli; Roy, Partha Sarathi; Joshi, Pawan Kumar; Gilani, Hammad; Murthy, Manchiraju Sri Ramachandra; Jin, Cui; Wang, Jie; Zhang, Yao; Chen, Bangqian; Menarguez, Michael Angelo; Biradar, Chandrashekhar M; Bajgain, Rajen; Li, Xiangping; Dai, Shengqi; Hou, Ying; Xin, Fengfei; Moore, Berrien

    2016-01-01

    Extensive forest changes have occurred in monsoon Asia, substantially affecting climate, carbon cycle and biodiversity. Accurate forest cover maps at fine spatial resolutions are required to qualify and quantify these effects. In this study, an algorithm was developed to map forests in 2010, with the use of structure and biomass information from the Advanced Land Observation System (ALOS) Phased Array L-band Synthetic Aperture Radar (PALSAR) mosaic dataset and the phenological information from MODerate Resolution Imaging Spectroradiometer (MOD13Q1 and MOD09A1) products. Our forest map (PALSARMOD50 m F/NF) was assessed through randomly selected ground truth samples from high spatial resolution images and had an overall accuracy of 95%. Total area of forests in monsoon Asia in 2010 was estimated to be ~6.3 × 10(6 )km(2). The distribution of evergreen and deciduous forests agreed reasonably well with the median Normalized Difference Vegetation Index (NDVI) in winter. PALSARMOD50 m F/NF map showed good spatial and areal agreements with selected forest maps generated by the Japan Aerospace Exploration Agency (JAXA F/NF), European Space Agency (ESA F/NF), Boston University (MCD12Q1 F/NF), Food and Agricultural Organization (FAO FRA), and University of Maryland (Landsat forests), but relatively large differences and uncertainties in tropical forests and evergreen and deciduous forests. PMID:26864143

  13. Mapping forests in monsoon Asia with ALOS PALSAR 50-m mosaic images and MODIS imagery in 2010

    PubMed Central

    Qin, Yuanwei; Xiao, Xiangming; Dong, Jinwei; Zhang, Geli; Roy, Partha Sarathi; Joshi, Pawan Kumar; Gilani, Hammad; Murthy, Manchiraju Sri Ramachandra; Jin, Cui; Wang, Jie; Zhang, Yao; Chen, Bangqian; Menarguez, Michael Angelo; Biradar, Chandrashekhar M.; Bajgain, Rajen; Li, Xiangping; Dai, Shengqi; Hou, Ying; Xin, Fengfei; Moore III, Berrien

    2016-01-01

    Extensive forest changes have occurred in monsoon Asia, substantially affecting climate, carbon cycle and biodiversity. Accurate forest cover maps at fine spatial resolutions are required to qualify and quantify these effects. In this study, an algorithm was developed to map forests in 2010, with the use of structure and biomass information from the Advanced Land Observation System (ALOS) Phased Array L-band Synthetic Aperture Radar (PALSAR) mosaic dataset and the phenological information from MODerate Resolution Imaging Spectroradiometer (MOD13Q1 and MOD09A1) products. Our forest map (PALSARMOD50 m F/NF) was assessed through randomly selected ground truth samples from high spatial resolution images and had an overall accuracy of 95%. Total area of forests in monsoon Asia in 2010 was estimated to be ~6.3 × 106 km2. The distribution of evergreen and deciduous forests agreed reasonably well with the median Normalized Difference Vegetation Index (NDVI) in winter. PALSARMOD50 m F/NF map showed good spatial and areal agreements with selected forest maps generated by the Japan Aerospace Exploration Agency (JAXA F/NF), European Space Agency (ESA F/NF), Boston University (MCD12Q1 F/NF), Food and Agricultural Organization (FAO FRA), and University of Maryland (Landsat forests), but relatively large differences and uncertainties in tropical forests and evergreen and deciduous forests. PMID:26864143

  14. Mapping forests in monsoon Asia with ALOS PALSAR 50-m mosaic images and MODIS imagery in 2010

    NASA Astrophysics Data System (ADS)

    Qin, Yuanwei; Xiao, Xiangming; Dong, Jinwei; Zhang, Geli; Roy, Partha Sarathi; Joshi, Pawan Kumar; Gilani, Hammad; Murthy, Manchiraju Sri Ramachandra; Jin, Cui; Wang, Jie; Zhang, Yao; Chen, Bangqian; Menarguez, Michael Angelo; Biradar, Chandrashekhar M.; Bajgain, Rajen; Li, Xiangping; Dai, Shengqi; Hou, Ying; Xin, Fengfei; Moore, Berrien, III

    2016-02-01

    Extensive forest changes have occurred in monsoon Asia, substantially affecting climate, carbon cycle and biodiversity. Accurate forest cover maps at fine spatial resolutions are required to qualify and quantify these effects. In this study, an algorithm was developed to map forests in 2010, with the use of structure and biomass information from the Advanced Land Observation System (ALOS) Phased Array L-band Synthetic Aperture Radar (PALSAR) mosaic dataset and the phenological information from MODerate Resolution Imaging Spectroradiometer (MOD13Q1 and MOD09A1) products. Our forest map (PALSARMOD50 m F/NF) was assessed through randomly selected ground truth samples from high spatial resolution images and had an overall accuracy of 95%. Total area of forests in monsoon Asia in 2010 was estimated to be ~6.3 × 106 km2. The distribution of evergreen and deciduous forests agreed reasonably well with the median Normalized Difference Vegetation Index (NDVI) in winter. PALSARMOD50 m F/NF map showed good spatial and areal agreements with selected forest maps generated by the Japan Aerospace Exploration Agency (JAXA F/NF), European Space Agency (ESA F/NF), Boston University (MCD12Q1 F/NF), Food and Agricultural Organization (FAO FRA), and University of Maryland (Landsat forests), but relatively large differences and uncertainties in tropical forests and evergreen and deciduous forests.

  15. Algorithms and Algorithmic Languages.

    ERIC Educational Resources Information Center

    Veselov, V. M.; Koprov, V. M.

    This paper is intended as an introduction to a number of problems connected with the description of algorithms and algorithmic languages, particularly the syntaxes and semantics of algorithmic languages. The terms "letter, word, alphabet" are defined and described. The concept of the algorithm is defined and the relation between the algorithm and…

  16. Forest Management.

    ERIC Educational Resources Information Center

    Weicherding, Patrick J.; And Others

    This bulletin deals with forest management and provides an overview of forestry for the non-professional. The bulletin is divided into six sections: (1) What Is Forestry Management?; (2) How Is the Forest Measured?; (3) What Is Forest Protection?; (4) How Is the Forest Harvested?; (5) What Is Forest Regeneration?; and (6) What Is Forest…

  17. Scalable Nearest Neighbor Algorithms for High Dimensional Data.

    PubMed

    Muja, Marius; Lowe, David G

    2014-11-01

    For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching. PMID:26353063

  18. ISINA: INTEGRAL Source Identification Network Algorithm

    NASA Astrophysics Data System (ADS)

    Scaringi, S.; Bird, A. J.; Clark, D. J.; Dean, A. J.; Hill, A. B.; McBride, V. A.; Shaw, S. E.

    2008-11-01

    We give an overview of ISINA: INTEGRAL Source Identification Network Algorithm. This machine learning algorithm, using random forests, is applied to the IBIS/ISGRI data set in order to ease the production of unbiased future soft gamma-ray source catalogues. First, we introduce the data set and the problems encountered when dealing with images obtained using the coded mask technique. The initial step of source candidate searching is introduced and an initial candidate list is created. A description of the feature extraction on the initial candidate list is then performed together with feature merging for these candidates. Three training and testing sets are created in order to deal with the diverse time-scales encountered when dealing with the gamma-ray sky. Three independent random forests are built: one dealing with faint persistent source recognition, one dealing with strong persistent sources and a final one dealing with transients. For the latter, a new transient detection technique is introduced and described: the transient matrix. Finally the performance of the network is assessed and discussed using the testing set and some illustrative source examples. Based on observations with INTEGRAL, an ESA project with instruments and science data centre funded by ESA member states (especially the PI countries: Denmark, France, Germany, Italy, Spain), Czech Republic and Poland, and the participation of Russia and the USA. E-mail: simo@astro.soton.ac.uk

  19. Mapping forested wetlands in the Great Zhan River Basin through integrating optical, radar, and topographical data classification techniques.

    PubMed

    Na, X D; Zang, S Y; Wu, C S; Li, W L

    2015-11-01

    Knowledge of the spatial extent of forested wetlands is essential to many studies including wetland functioning assessment, greenhouse gas flux estimation, and wildlife suitable habitat identification. For discriminating forested wetlands from their adjacent land cover types, researchers have resorted to image analysis techniques applied to numerous remotely sensed data. While with some success, there is still no consensus on the optimal approaches for mapping forested wetlands. To address this problem, we examined two machine learning approaches, random forest (RF) and K-nearest neighbor (KNN) algorithms, and applied these two approaches to the framework of pixel-based and object-based classifications. The RF and KNN algorithms were constructed using predictors derived from Landsat 8 imagery, Radarsat-2 advanced synthetic aperture radar (SAR), and topographical indices. The results show that the objected-based classifications performed better than per-pixel classifications using the same algorithm (RF) in terms of overall accuracy and the difference of their kappa coefficients are statistically significant (p<0.01). There were noticeably omissions for forested and herbaceous wetlands based on the per-pixel classifications using the RF algorithm. As for the object-based image analysis, there were also statistically significant differences (p<0.01) of Kappa coefficient between results performed based on RF and KNN algorithms. The object-based classification using RF provided a more visually adequate distribution of interested land cover types, while the object classifications based on the KNN algorithm showed noticeably commissions for forested wetlands and omissions for agriculture land. This research proves that the object-based classification with RF using optical, radar, and topographical data improved the mapping accuracy of land covers and provided a feasible approach to discriminate the forested wetlands from the other land cover types in forestry area. PMID

  20. Mapping the distribution of the main host for plague in a complex landscape in Kazakhstan: An object-based approach using SPOT-5 XS, Landsat 7 ETM+, SRTM and multiple Random Forests.

    PubMed

    Wilschut, L I; Addink, E A; Heesterbeek, J A P; Dubyanskiy, V M; Davis, S A; Laudisoit, A; M Begon; Burdelov, L A; Atshabar, B B; de Jong, S M

    2013-08-01

    Plague is a zoonotic infectious disease present in great gerbil populations in Kazakhstan. Infectious disease dynamics are influenced by the spatial distribution of the carriers (hosts) of the disease. The great gerbil, the main host in our study area, lives in burrows, which can be recognized on high resolution satellite imagery. In this study, using earth observation data at various spatial scales, we map the spatial distribution of burrows in a semi-desert landscape. The study area consists of various landscape types. To evaluate whether identification of burrows by classification is possible in these landscape types, the study area was subdivided into eight landscape units, on the basis of Landsat 7 ETM+ derived Tasselled Cap Greenness and Brightness, and SRTM derived standard deviation in elevation. In the field, 904 burrows were mapped. Using two segmented 2.5 m resolution SPOT-5 XS satellite scenes, reference object sets were created. Random Forests were built for both SPOT scenes and used to classify the images. Additionally, a stratified classification was carried out, by building separate Random Forests per landscape unit. Burrows were successfully classified in all landscape units. In the 'steppe on floodplain' areas, classification worked best: producer's and user's accuracy in those areas reached 88% and 100%, respectively. In the 'floodplain' areas with a more heterogeneous vegetation cover, classification worked least well; there, accuracies were 86 and 58% respectively. Stratified classification improved the results in all landscape units where comparison was possible (four), increasing kappa coefficients by 13, 10, 9 and 1%, respectively. In this study, an innovative stratification method using high- and medium resolution imagery was applied in order to map host distribution on a large spatial scale. The burrow maps we developed will help to detect changes in the distribution of great gerbil populations and, moreover, serve as a unique empirical

  1. Mapping the distribution of the main host for plague in a complex landscape in Kazakhstan: An object-based approach using SPOT-5 XS, Landsat 7 ETM+, SRTM and multiple Random Forests

    NASA Astrophysics Data System (ADS)

    Wilschut, L. I.; Addink, E. A.; Heesterbeek, J. A. P.; Dubyanskiy, V. M.; Davis, S. A.; Laudisoit, A.; Begon, M.; Burdelov, L. A.; Atshabar, B. B.; de Jong, S. M.

    2013-08-01

    Plague is a zoonotic infectious disease present in great gerbil populations in Kazakhstan. Infectious disease dynamics are influenced by the spatial distribution of the carriers (hosts) of the disease. The great gerbil, the main host in our study area, lives in burrows, which can be recognized on high resolution satellite imagery. In this study, using earth observation data at various spatial scales, we map the spatial distribution of burrows in a semi-desert landscape. The study area consists of various landscape types. To evaluate whether identification of burrows by classification is possible in these landscape types, the study area was subdivided into eight landscape units, on the basis of Landsat 7 ETM+ derived Tasselled Cap Greenness and Brightness, and SRTM derived standard deviation in elevation. In the field, 904 burrows were mapped. Using two segmented 2.5 m resolution SPOT-5 XS satellite scenes, reference object sets were created. Random Forests were built for both SPOT scenes and used to classify the images. Additionally, a stratified classification was carried out, by building separate Random Forests per landscape unit. Burrows were successfully classified in all landscape units. In the ‘steppe on floodplain’ areas, classification worked best: producer's and user's accuracy in those areas reached 88% and 100%, respectively. In the ‘floodplain’ areas with a more heterogeneous vegetation cover, classification worked least well; there, accuracies were 86 and 58% respectively. Stratified classification improved the results in all landscape units where comparison was possible (four), increasing kappa coefficients by 13, 10, 9 and 1%, respectively. In this study, an innovative stratification method using high- and medium resolution imagery was applied in order to map host distribution on a large spatial scale. The burrow maps we developed will help to detect changes in the distribution of great gerbil populations and, moreover, serve as a unique

  2. Is random access memory random?

    NASA Technical Reports Server (NTRS)

    Denning, P. J.

    1986-01-01

    Most software is contructed on the assumption that the programs and data are stored in random access memory (RAM). Physical limitations on the relative speeds of processor and memory elements lead to a variety of memory organizations that match processor addressing rate with memory service rate. These include interleaved and cached memory. A very high fraction of a processor's address requests can be satified from the cache without reference to the main memory. The cache requests information from main memory in blocks that can be transferred at the full memory speed. Programmers who organize algorithms for locality can realize the highest performance from these computers.

  3. Mapping forest biomass from space - Fusion of hyperspectral EO1-hyperion data and Tandem-X and WorldView-2 canopy height models

    NASA Astrophysics Data System (ADS)

    Kattenborn, Teja; Maack, Joachim; Faßnacht, Fabian; Enßle, Fabian; Ermert, Jörg; Koch, Barbara

    2015-03-01

    Spaceborne sensors allow for wide-scale assessments of forest ecosystems. Combining the products of multiple sensors is hypothesized to improve the estimation of forest biomass. We applied interferometric (Tandem-X) and photogrammetric (WorldView-2) based predictors, e.g. canopy height models, in combination with hyperspectral predictors (EO1-Hyperion) by using 4 different machine learning algorithms for biomass estimation in temperate forest stands near Karlsruhe, Germany. An iterative model selection procedure was used to identify the optimal combination of predictors. The most accurate model (Random Forest) reached a r2 of 0.73 with a RMSE of 14.9% (29.4 t/ha). Further results revealed that the predictive accuracy depended highly on the statistical model and the area size of the field samples. We conclude that a fusion of canopy height and spectral information allows for accurate estimations of forest biomass from space.

  4. Angular Distribution of Particles Emerging from a Diffusive Region and its Implications for the Fleck-Canfield Random Walk Algorithm for Implicit Monte Carlo Radiation Transport

    SciTech Connect

    Cooper, M.A.

    2000-07-03

    We present various approximations for the angular distribution of particles emerging from an optically thick, purely isotropically scattering region into a vacuum. Our motivation is to use such a distribution for the Fleck-Canfield random walk method [1] for implicit Monte Carlo (IMC) [2] radiation transport problems. We demonstrate that the cosine distribution recommended in the original random walk paper [1] is a poor approximation to the angular distribution predicted by transport theory. Then we examine other approximations that more closely match the transport angular distribution.

  5. Random broadcast on random geometric graphs

    SciTech Connect

    Bradonjic, Milan; Elsasser, Robert; Friedrich, Tobias

    2009-01-01

    In this work, we consider the random broadcast time on random geometric graphs (RGGs). The classic random broadcast model, also known as push algorithm, is defined as: starting with one informed node, in each succeeding round every informed node chooses one of its neighbors uniformly at random and informs it. We consider the random broadcast time on RGGs, when with high probability: (i) RGG is connected, (ii) when there exists the giant component in RGG. We show that the random broadcast time is bounded by {Omicron}({radical} n + diam(component)), where diam(component) is a diameter of the entire graph, or the giant component, for the regimes (i), or (ii), respectively. In other words, for both regimes, we derive the broadcast time to be {Theta}(diam(G)), which is asymptotically optimal.

  6. Benchmarking protein classification algorithms via supervised cross-validation.

    PubMed

    Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor

    2008-04-24

    Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and

  7. On the likelihood of forests

    NASA Astrophysics Data System (ADS)

    Shang, Yilun

    2016-08-01

    How complex a network is crucially impacts its function and performance. In many modern applications, the networks involved have a growth property and sparse structures, which pose challenges to physicists and applied mathematicians. In this paper, we introduce the forest likelihood as a plausible measure to gauge how difficult it is to construct a forest in a non-preferential attachment way. Based on the notions of admittable labeling and path construction, we propose algorithms for computing the forest likelihood of a given forest. Concrete examples as well as the distributions of forest likelihoods for all forests with some fixed numbers of nodes are presented. Moreover, we illustrate the ideas on real-life networks, including a benzenoid tree, a mathematical family tree, and a peer-to-peer network.

  8. Mapping the distribution of the main host for plague in a complex landscape in Kazakhstan: An object-based approach using SPOT-5 XS, Landsat 7 ETM+, SRTM and multiple Random Forests

    PubMed Central

    Wilschut, L.I.; Addink, E.A.; Heesterbeek, J.A.P.; Dubyanskiy, V.M.; Davis, S.A.; Laudisoit, A.; M.Begon; Burdelov, L.A.; Atshabar, B.B.; de Jong, S.M.

    2013-01-01

    Plague is a zoonotic infectious disease present in great gerbil populations in Kazakhstan. Infectious disease dynamics are influenced by the spatial distribution of the carriers (hosts) of the disease. The great gerbil, the main host in our study area, lives in burrows, which can be recognized on high resolution satellite imagery. In this study, using earth observation data at various spatial scales, we map the spatial distribution of burrows in a semi-desert landscape. The study area consists of various landscape types. To evaluate whether identification of burrows by classification is possible in these landscape types, the study area was subdivided into eight landscape units, on the basis of Landsat 7 ETM+ derived Tasselled Cap Greenness and Brightness, and SRTM derived standard deviation in elevation. In the field, 904 burrows were mapped. Using two segmented 2.5 m resolution SPOT-5 XS satellite scenes, reference object sets were created. Random Forests were built for both SPOT scenes and used to classify the images. Additionally, a stratified classification was carried out, by building separate Random Forests per landscape unit. Burrows were successfully classified in all landscape units. In the ‘steppe on floodplain’ areas, classification worked best: producer's and user's accuracy in those areas reached 88% and 100%, respectively. In the ‘floodplain’ areas with a more heterogeneous vegetation cover, classification worked least well; there, accuracies were 86 and 58% respectively. Stratified classification improved the results in all landscape units where comparison was possible (four), increasing kappa coefficients by 13, 10, 9 and 1%, respectively. In this study, an innovative stratification method using high- and medium resolution imagery was applied in order to map host distribution on a large spatial scale. The burrow maps we developed will help to detect changes in the distribution of great gerbil populations and, moreover, serve as a unique

  9. Aboveground Biomass Monitoring over Siberian Boreal Forest Using Radar Remote Sensing Data

    NASA Astrophysics Data System (ADS)

    Stelmaszczuk-Gorska, M. A.; Thiel, C. J.; Schmullius, C.

    2014-12-01

    Aboveground biomass (AGB) plays an essential role in ecosystem research, global cycles, and is of vital importance in climate studies. AGB accumulated in the forests is of special monitoring interest as it contains the most of biomass comparing with other land biomes. The largest of the land biomes is boreal forest, which has a substantial carbon accumulation capability; carbon stock estimated to be 272 +/-23 Pg C (32%) [1]. Russian's forests are of particular concern, due to the largest source of uncertainty in global carbon stock calculations [1], and old inventory data that have not been updated in the last 25 years [2]. In this research new empirical models for AGB estimation are proposed. Using radar L-band data for AGB retrieval and optical data for an update of in situ data the processing scheme was developed. The approach was trained and validated in the Asian part of the boreal forest, in southern Russian Central Siberia; two Siberian Federal Districts: Krasnoyarsk Kray and Irkutsk Oblast. Together the training and testing forest territories cover an area of approximately 3,500 km2. ALOS PALSAR L-band single (HH - horizontal transmitted and received) and dual (HH and HV - horizontal transmitted, horizontal and vertical received) polarizations in Single Look Complex format (SLC) were used to calculate backscattering coefficient in gamma nought and coherence. In total more than 150 images acquired between 2006 and 2011 were available. The data were obtained through the ALOS Kyoto and Carbon Initiative Project (K&C). The data were used to calibrate a randomForest algorithm. Additionally, a simple linear and multiple-regression approach was used. The uncertainty of the AGB estimation at pixel and stand level were calculated approximately as 35% by validation against an independent dataset. The previous studies employing ALOS PALSAR data over boreal forests reported uncertainty of 39.4% using randomForest approach [2] or 42.8% using semi-empirical approach [3].

  10. A random-walk algorithm for modeling lithospheric density and the role of body forces in the evolution of the Midcontinent Rift

    NASA Astrophysics Data System (ADS)

    Levandowski, Will; Boyd, Oliver S.; Briggs, Rich W.; Gold, Ryan D.

    2015-12-01

    This paper develops a Monte Carlo algorithm for extracting three-dimensional lithospheric density models from geophysical data. Empirical scaling relationships between velocity and density create a 3-D starting density model, which is then iteratively refined until it reproduces observed gravity and topography. This approach permits deviations from uniform crustal velocity-density scaling, which provide insight into crustal lithology and prevent spurious mapping of crustal anomalies into the mantle. We test this algorithm on the Proterozoic Midcontinent Rift (MCR), north-central United States. The MCR provides a challenge because it hosts a gravity high overlying low shear-wave velocity crust in a generally flat region. Our initial density estimates are derived from a seismic velocity/crustal thickness model based on joint inversion of surface-wave dispersion and receiver functions. By adjusting these estimates to reproduce gravity and topography, we generate a lithospheric-scale model that reveals dense middle crust and eclogitized lowermost crust within the rift. Mantle lithospheric density beneath the MCR is not anomalous, consistent with geochemical evidence that lithospheric mantle was not the primary source of rift-related magmas and suggesting that extension occurred in response to far-field stress rather than a hot mantle plume. Similarly, the subsequent inversion of normal faults resulted from changing far-field stress that exploited not only warm, recently faulted crust but also a gravitational potential energy low in the MCR. The success of this density modeling algorithm in the face of such apparently contradictory geophysical properties suggests that it may be applicable to a variety of tectonic and geodynamic problems.

  11. A random-walk algorithm for modeling lithospheric density and the role of body forces in the evolution of the Midcontinent Rift

    USGS Publications Warehouse

    Levandowski, William Brower; Boyd, Oliver; Briggs, Richard; Gold, Ryan D.

    2015-01-01

    We test this algorithm on the Proterozoic Midcontinent Rift (MCR), north-central U.S. The MCR provides a challenge because it hosts a gravity high overlying low shear-wave velocity crust in a generally flat region. Our initial density estimates are derived from a seismic velocity/crustal thickness model based on joint inversion of surface-wave dispersion and receiver functions. By adjusting these estimates to reproduce gravity and topography, we generate a lithospheric-scale model that reveals dense middle crust and eclogitized lowermost crust within the rift. Mantle lithospheric density beneath the MCR is not anomalous, consistent with geochemical evidence that lithospheric mantle was not the primary source of rift-related magmas and suggesting that extension occurred in response to far-field stress rather than a hot mantle plume. Similarly, the subsequent inversion of normal faults resulted from changing far-field stress that exploited not only warm, recently faulted crust but also a gravitational potential energy low in the MCR. The success of this density modeling algorithm in the face of such apparently contradictory geophysical properties suggests that it may be applicable to a variety of tectonic and geodynamic problems. 

  12. Handling packet dropouts and random delays for unstable delayed processes in NCS by optimal tuning of PIλDμ controllers with evolutionary algorithms.

    PubMed

    Pan, Indranil; Das, Saptarshi; Gupta, Amitava

    2011-10-01

    The issues of stochastically varying network delays and packet dropouts in Networked Control System (NCS) applications have been simultaneously addressed by time domain optimal tuning of fractional order (FO) PID controllers. Different variants of evolutionary algorithms are used for the tuning process and their performances are compared. Also the effectiveness of the fractional order PI(λ)D(μ) controllers over their integer order counterparts is looked into. Two standard test bench plants with time delay and unstable poles which are encountered in process control applications are tuned with the proposed method to establish the validity of the tuning methodology. The proposed tuning methodology is independent of the specific choice of plant and is also applicable for less complicated systems. Thus it is useful in a wide variety of scenarios. The paper also shows the superiority of FOPID controllers over their conventional PID counterparts for NCS applications. PMID:21621208

  13. An Automated Three-Dimensional Detection and Segmentation Method for Touching Cells by Integrating Concave Points Clustering and Random Walker Algorithm

    PubMed Central

    Gong, Hui; Chen, Shangbin; Zhang, Bin; Ding, Wenxiang; Luo, Qingming; Li, Anan

    2014-01-01

    Characterizing cytoarchitecture is crucial for understanding brain functions and neural diseases. In neuroanatomy, it is an important task to accurately extract cell populations' centroids and contours. Recent advances have permitted imaging at single cell resolution for an entire mouse brain using the Nissl staining method. However, it is difficult to precisely segment numerous cells, especially those cells touching each other. As presented herein, we have developed an automated three-dimensional detection and segmentation method applied to the Nissl staining data, with the following two key steps: 1) concave points clustering to determine the seed points of touching cells; and 2) random walker segmentation to obtain cell contours. Also, we have evaluated the performance of our proposed method with several mouse brain datasets, which were captured with the micro-optical sectioning tomography imaging system, and the datasets include closely touching cells. Comparing with traditional detection and segmentation methods, our approach shows promising detection accuracy and high robustness. PMID:25111442

  14. The Effectiveness of Parent Training as a Treatment for Preschool Attention-Deficit/Hyperactivity Disorder: Study Protocol for a Randomized Controlled, Multicenter Trial of the New Forest Parenting Program in Everyday Clinical Practice

    PubMed Central

    Daley, David; Frydenberg, Morten; Rask, Charlotte U; Sonuga-Barke, Edmund; Thomsen, Per H

    2016-01-01

    Background Parent training is recommended as the first-line treatment for attention-deficit/hyperactivity disorder (ADHD) in preschool children. The New Forest Parenting Programme (NFPP) is an evidence-based parenting program developed specifically to target preschool ADHD. Objective The objective of this trial is to investigate whether the NFPP can be effectively delivered for children referred through official community pathways in everyday clinical practice. Methods A multicenter randomized controlled parallel arm trial design is employed. There are two treatment arms, NFPP and treatment as usual. NFPP consists of eight individually delivered parenting sessions, where the child attends during three of the sessions. Outcomes are examined at three time points (T1, T2, T3): T1 (baseline), T2 (week 12, post intervention), and T3 (6 month follow/up). 140 children between the ages of 3-7, with a clinical diagnosis of ADHD, informed by the Development and Well Being Assessment, and recruited from three child and adolescent psychiatry departments in Denmark will take part. Randomization is on a 1:1 basis, stratified for age and gender. Results The primary endpoint is change in ADHD symptoms as measured by the Preschool ADHD-Rating Scale (ADHD-RS) by T2. Secondary outcome measures include: effects on this measure at T3 and T2 and T3 measures of teacher reported Preschool ADHD-RS scores, parent and teacher rated scores on the Strength & Difficulties Questionnaire, direct observation of ADHD behaviors during Child’s Solo Play, observation of parent-child interaction, parent sense of competence, and family stress. Results will be reported using the standards set out in the Consolidated Standards of Reporting Trials Statement for Randomized Controlled Trials of nonpharmacological treatments. Conclusions The trial will provide evidence as to whether NFPP is a more effective treatment for preschool ADHD than the treatment usually offered in everyday clinical practice. Trial

  15. World's forests

    SciTech Connect

    Sedjo, R.A.; Clawson, M.

    1982-10-01

    An appropriate rate of deforestation is complicated because forests are associated with many problems involving local economic and social needs, the global need for wood, and the environmental impact on climates and the biological genetic pool. Stable forest land exists in the developed regions of North America, Europe, the USSR, Oceania, and China in the Temperate Zone. Tropical deforestation, however, is estimated at 0.58% per year, with the pressure lowest on virgin forests. While these data omit plantation forests, the level of replacement does not offset the decline. There is some disagreement over the rate and definition of deforestation, but studies showing that the world is in little danger of running out of forests should not discourage tropical areas where forests are declining from making appropriate responses to the problem. 3 references. (DCK)

  16. The LeFE algorithm: embracing the complexity of gene expression in the interpretation of microarray data.

    PubMed

    Eichler, Gabriel S; Reimers, Mark; Kane, David; Weinstein, John N

    2007-01-01

    Interpretation of microarray data remains a challenge, and most methods fail to consider the complex, nonlinear regulation of gene expression. To address that limitation, we introduce Learner of Functional Enrichment (LeFE), a statistical/machine learning algorithm based on Random Forest, and demonstrate it on several diverse datasets: smoker/never smoker, breast cancer classification, and cancer drug sensitivity. We also compare it with previously published algorithms, including Gene Set Enrichment Analysis. LeFE regularly identifies statistically significant functional themes consistent with known biology. PMID:17845722

  17. Algorithmic chemistry

    SciTech Connect

    Fontana, W.

    1990-12-13

    In this paper complex adaptive systems are defined by a self- referential loop in which objects encode functions that act back on these objects. A model for this loop is presented. It uses a simple recursive formal language, derived from the lambda-calculus, to provide a semantics that maps character strings into functions that manipulate symbols on strings. The interaction between two functions, or algorithms, is defined naturally within the language through function composition, and results in the production of a new function. An iterated map acting on sets of functions and a corresponding graph representation are defined. Their properties are useful to discuss the behavior of a fixed size ensemble of randomly interacting functions. This function gas'', or Turning gas'', is studied under various conditions, and evolves cooperative interaction patterns of considerable intricacy. These patterns adapt under the influence of perturbations consisting in the addition of new random functions to the system. Different organizations emerge depending on the availability of self-replicators.

  18. The Search for Gravitational Waves from the Coalescence of Black Hole Binary Systems in Data from the LIGO and Virgo Detectors Or: A Dark Walk through a Random Forest

    NASA Astrophysics Data System (ADS)

    Hodge, Kari Alison

    The LIGO and Virgo gravitational-wave observatories are complex and extremely sensitive strain detectors that can be used to search for a wide variety of gravitational waves from astrophysical and cosmological sources. In this thesis, I motivate the search for the gravitational wave signals from coalescing black hole binary systems with total mass between 25 and 100 solar masses. The mechanisms for formation of such systems are not well-understood, and we do not have many observational constraints on the parameters that guide the formation scenarios. Detection of gravitational waves from such systems---or, in the absence of detection, the tightening of upper limits on the rate of such coalescences---will provide valuable information that can inform the astrophysics of the formation of these systems. I review the search for these systems and place upper limits on the rate of black hole binary coalescences with total mass between 25 and 100 solar masses. I then show how the sensitivity of this search can be improved by up to 40% by the the application of the multivariate statistical classifier known as a random forest of bagged decision trees to more effectively discriminate between signal and non-Gaussian instrumental noise. I also discuss the use of this classifier in the search for the ringdown signal from the merger of two black holes with total mass between 50 and 450 solar masses and present upper limits. I also apply multivariate statistical classifiers to the problem of quantifying the non-Gaussianity of LIGO data. Despite these improvements, no gravitational-wave signals have been detected in LIGO data so far. However, the use of multivariate statistical classification can significantly improve the sensitivity of the Advanced LIGO detectors to such signals.

  19. Forest Fragmentation

    EPA Science Inventory

    This indicator describes forest fragmentation in the contiguous United States circa 2001. This information provides a broad, recent picture of the spatial pattern of the nation’s forests and the extent to which they are being broken into smaller patches and pierced or interspe...

  20. Improved forest change detection with terrain illumination corrected landsat images

    Technology Transfer Automated Retrieval System (TEKTRAN)

    An illumination correction algorithm has been developed to improve the accuracy of forest change detection from Landsat reflectance data. This algorithm is based on an empirical rotation model and was tested on the Landsat imagery pair over Cherokee National Forest, Tennessee, Uinta-Wasatch-Cache N...

  1. Satellite-based forest monitoring: spatial and temporal forecast of growing index and short-wave infrared band.

    PubMed

    Bayr, Caroline; Gallaun, Heinz; Kleb, Ulrike; Kornberger, Birgit; Steinegger, Martin; Winter, Martin

    2016-01-01

    For detecting anomalies or interventions in the field of forest monitoring we propose an approach based on the spatial and temporal forecast of satellite time series data. For each pixel of the satellite image three different types of forecasts are provided, namely spatial, temporal and combined spatio-temporal forecast. Spatial forecast means that a clustering algorithm is used to group the time series data based on the features normalised difference vegetation index (NDVI) and the short-wave infrared band (SWIR). For estimation of the typical temporal trajectory of the NDVI and SWIR during the vegetation period of each spatial cluster, we apply several methods of functional data analysis including functional principal component analysis, and a novel form of random regression forests with online learning (streaming) capability. The temporal forecast is carried out by means of functional time series analysis and an autoregressive integrated moving average model. The combination of the temporal forecasts, which is based on the past of the considered pixel, and spatial forecasts, which is based on highly correlated pixels within one cluster and their past, is performed by functional data analysis, and a variant of random regression forests adapted to online learning capabilities. For evaluation of the methods, the approaches are applied to a study area in Germany for monitoring forest damages caused by wind-storm, and to a study area in Spain for monitoring forest fires. PMID:27087034

  2. Mapping tree health using airborne laser scans and hyperspectral imagery: a case study for a floodplain eucalypt forest

    NASA Astrophysics Data System (ADS)

    Shendryk, Iurii; Tulbure, Mirela; Broich, Mark; McGrath, Andrew; Alexandrov, Sergey; Keith, David

    2016-04-01

    Airborne laser scanning (ALS) and hyperspectral imaging (HSI) are two complementary remote sensing technologies that provide comprehensive structural and spectral characteristics of forests over large areas. In this study we developed two algorithms: one for individual tree delineation utilizing ALS and the other utilizing ALS and HSI to characterize health of delineated trees in a structurally complex floodplain eucalypt forest. We conducted experiments in the largest eucalypt, river red gum forest in the world, located in the south-east of Australia that experienced severe dieback over the past six decades. For detection of individual trees from ALS we developed a novel bottom-up approach based on Euclidean distance clustering to detect tree trunks and random walks segmentation to further delineate tree crowns. Overall, our algorithm was able to detect 67% of tree trunks with diameter larger than 13 cm. We assessed the accuracy of tree delineations in terms of crown height and width, with correct delineation of 68% of tree crowns. The increase in ALS point density from ~12 to ~24 points/m2 resulted in tree trunk detection and crown delineation increase of 11% and 13%, respectively. Trees with incorrectly delineated crowns were generally attributed to areas with high tree density along water courses. The accurate delineation of trees allowed us to classify the health of this forest using machine learning and field-measured tree crown dieback and transparency ratios, which were good predictors of tree health in this forest. ALS and HSI derived indices were used as predictor variables to train and test object-oriented random forest classifier. Returned pulse width, intensity and density related ALS indices were the most important predictors in the tree health classifications. At the forest level in terms of tree crown dieback, 77% of trees were classified as healthy, 14% as declining and 9% as dying or dead with 81% mapping accuracy. Similarly, in terms of tree

  3. Evaluation of Algorithms for a Miles-in-Trail Decision Support Tool

    NASA Technical Reports Server (NTRS)

    Bloem, Michael; Hattaway, David; Bambos, Nicholas

    2012-01-01

    Four machine learning algorithms were prototyped and evaluated for use in a proposed decision support tool that would assist air traffic managers as they set Miles-in-Trail restrictions. The tool would display probabilities that each possible Miles-in-Trail value should be used in a given situation. The algorithms were evaluated with an expected Miles-in-Trail cost that assumes traffic managers set restrictions based on the tool-suggested probabilities. Basic Support Vector Machine, random forest, and decision tree algorithms were evaluated, as was a softmax regression algorithm that was modified to explicitly reduce the expected Miles-in-Trail cost. The algorithms were evaluated with data from the summer of 2011 for air traffic flows bound to the Newark Liberty International Airport (EWR) over the ARD, PENNS, and SHAFF fixes. The algorithms were provided with 18 input features that describe the weather at EWR, the runway configuration at EWR, the scheduled traffic demand at EWR and the fixes, and other traffic management initiatives in place at EWR. Features describing other traffic management initiatives at EWR and the weather at EWR achieved relatively high information gain scores, indicating that they are the most useful for estimating Miles-in-Trail. In spite of a high variance or over-fitting problem, the decision tree algorithm achieved the lowest expected Miles-in-Trail costs when the algorithms were evaluated using 10-fold cross validation with the summer 2011 data for these air traffic flows.

  4. Parallel algorithms and architectures

    SciTech Connect

    Albrecht, A.; Jung, H.; Mehlhorn, K.

    1987-01-01

    Contents of this book are the following: Preparata: Deterministic simulation of idealized parallel computers on more realistic ones; Convex hull of randomly chosen points from a polytope; Dataflow computing; Parallel in sequence; Towards the architecture of an elementary cortical processor; Parallel algorithms and static analysis of parallel programs; Parallel processing of combinatorial search; Communications; An O(nlogn) cost parallel algorithms for the single function coarsest partition problem; Systolic algorithms for computing the visibility polygon and triangulation of a polygonal region; and RELACS - A recursive layout computing system. Parallel linear conflict-free subtree access.

  5. Time fluctuation analysis of forest fire sequences

    NASA Astrophysics Data System (ADS)

    Vega Orozco, Carmen D.; Kanevski, Mikhaïl; Tonini, Marj; Golay, Jean; Pereira, Mário J. G.

    2013-04-01

    Forest fires are complex events involving both space and time fluctuations. Understanding of their dynamics and pattern distribution is of great importance in order to improve the resource allocation and support fire management actions at local and global levels. This study aims at characterizing the temporal fluctuations of forest fire sequences observed in Portugal, which is the country that holds the largest wildfire land dataset in Europe. This research applies several exploratory data analysis measures to 302,000 forest fires occurred from 1980 to 2007. The applied clustering measures are: Morisita clustering index, fractal and multifractal dimensions (box-counting), Ripley's K-function, Allan Factor, and variography. These algorithms enable a global time structural analysis describing the degree of clustering of a point pattern and defining whether the observed events occur randomly, in clusters or in a regular pattern. The considered methods are of general importance and can be used for other spatio-temporal events (i.e. crime, epidemiology, biodiversity, geomarketing, etc.). An important contribution of this research deals with the analysis and estimation of local measures of clustering that helps understanding their temporal structure. Each measure is described and executed for the raw data (forest fires geo-database) and results are compared to reference patterns generated under the null hypothesis of randomness (Poisson processes) embedded in the same time period of the raw data. This comparison enables estimating the degree of the deviation of the real data from a Poisson process. Generalizations to functional measures of these clustering methods, taking into account the phenomena, were also applied and adapted to detect time dependences in a measured variable (i.e. burned area). The time clustering of the raw data is compared several times with the Poisson processes at different thresholds of the measured function. Then, the clustering measure value

  6. Random Walks on Random Graphs

    NASA Astrophysics Data System (ADS)

    Cooper, Colin; Frieze, Alan

    The aim of this article is to discuss some of the notions and applications of random walks on finite graphs, especially as they apply to random graphs. In this section we give some basic definitions, in Section 2 we review applications of random walks in computer science, and in Section 3 we focus on walks in random graphs.

  7. Forest Cover Estimation in Ireland Using Radar Remote Sensing: A Comparative Analysis of Forest Cover Assessment Methodologies.

    PubMed

    Devaney, John; Barrett, Brian; Barrett, Frank; Redmond, John; O Halloran, John

    2015-01-01

    Quantification of spatial and temporal changes in forest cover is an essential component of forest monitoring programs. Due to its cloud free capability, Synthetic Aperture Radar (SAR) is an ideal source of information on forest dynamics in countries with near-constant cloud-cover. However, few studies have investigated the use of SAR for forest cover estimation in landscapes with highly sparse and fragmented forest cover. In this study, the potential use of L-band SAR for forest cover estimation in two regions (Longford and Sligo) in Ireland is investigated and compared to forest cover estimates derived from three national (Forestry2010, Prime2, National Forest Inventory), one pan-European (Forest Map 2006) and one global forest cover (Global Forest Change) product. Two machine-learning approaches (Random Forests and Extremely Randomised Trees) are evaluated. Both Random Forests and Extremely Randomised Trees classification accuracies were high (98.1-98.5%), with differences between the two classifiers being minimal (<0.5%). Increasing levels of post classification filtering led to a decrease in estimated forest area and an increase in overall accuracy of SAR-derived forest cover maps. All forest cover products were evaluated using an independent validation dataset. For the Longford region, the highest overall accuracy was recorded with the Forestry2010 dataset (97.42%) whereas in Sligo, highest overall accuracy was obtained for the Prime2 dataset (97.43%), although accuracies of SAR-derived forest maps were comparable. Our findings indicate that spaceborne radar could aid inventories in regions with low levels of forest cover in fragmented landscapes. The reduced accuracies observed for the global and pan-continental forest cover maps in comparison to national and SAR-derived forest maps indicate that caution should be exercised when applying these datasets for national reporting. PMID:26262681

  8. Forest Cover Estimation in Ireland Using Radar Remote Sensing: A Comparative Analysis of Forest Cover Assessment Methodologies

    PubMed Central

    Devaney, John; Barrett, Brian; Barrett, Frank; Redmond, John; O`Halloran, John

    2015-01-01

    Quantification of spatial and temporal changes in forest cover is an essential component of forest monitoring programs. Due to its cloud free capability, Synthetic Aperture Radar (SAR) is an ideal source of information on forest dynamics in countries with near-constant cloud-cover. However, few studies have investigated the use of SAR for forest cover estimation in landscapes with highly sparse and fragmented forest cover. In this study, the potential use of L-band SAR for forest cover estimation in two regions (Longford and Sligo) in Ireland is investigated and compared to forest cover estimates derived from three national (Forestry2010, Prime2, National Forest Inventory), one pan-European (Forest Map 2006) and one global forest cover (Global Forest Change) product. Two machine-learning approaches (Random Forests and Extremely Randomised Trees) are evaluated. Both Random Forests and Extremely Randomised Trees classification accuracies were high (98.1–98.5%), with differences between the two classifiers being minimal (<0.5%). Increasing levels of post classification filtering led to a decrease in estimated forest area and an increase in overall accuracy of SAR-derived forest cover maps. All forest cover products were evaluated using an independent validation dataset. For the Longford region, the highest overall accuracy was recorded with the Forestry2010 dataset (97.42%) whereas in Sligo, highest overall accuracy was obtained for the Prime2 dataset (97.43%), although accuracies of SAR-derived forest maps were comparable. Our findings indicate that spaceborne radar could aid inventories in regions with low levels of forest cover in fragmented landscapes. The reduced accuracies observed for the global and pan-continental forest cover maps in comparison to national and SAR-derived forest maps indicate that caution should be exercised when applying these datasets for national reporting. PMID:26262681

  9. Genetic algorithms

    NASA Technical Reports Server (NTRS)

    Wang, Lui; Bayer, Steven E.

    1991-01-01

    Genetic algorithms are mathematical, highly parallel, adaptive search procedures (i.e., problem solving methods) based loosely on the processes of natural genetics and Darwinian survival of the fittest. Basic genetic algorithms concepts are introduced, genetic algorithm applications are introduced, and results are presented from a project to develop a software tool that will enable the widespread use of genetic algorithm technology.

  10. Geological Mapping Using Machine Learning Algorithms

    NASA Astrophysics Data System (ADS)

    Harvey, A. S.; Fotopoulos, G.

    2016-06-01

    Remotely sensed spectral imagery, geophysical (magnetic and gravity), and geodetic (elevation) data are useful in a variety of Earth science applications such as environmental monitoring and mineral exploration. Using these data with Machine Learning Algorithms (MLA), which are widely used in image analysis and statistical pattern recognition applications, may enhance preliminary geological mapping and interpretation. This approach contributes towards a rapid and objective means of geological mapping in contrast to conventional field expedition techniques. In this study, four supervised MLAs (naïve Bayes, k-nearest neighbour, random forest, and support vector machines) are compared in order to assess their performance for correctly identifying geological rocktypes in an area with complete ground validation information. Geological maps of the Sudbury region are used for calibration and validation. Percent of correct classifications was used as indicators of performance. Results show that random forest is the best approach. As expected, MLA performance improves with more calibration clusters, i.e. a more uniform distribution of calibration data over the study region. Performance is generally low, though geological trends that correspond to a ground validation map are visualized. Low performance may be the result of poor spectral images of bare rock which can be covered by vegetation or water. The distribution of calibration clusters and MLA input parameters affect the performance of the MLAs. Generally, performance improves with more uniform sampling, though this increases required computational effort and time. With the achievable performance levels in this study, the technique is useful in identifying regions of interest and identifying general rocktype trends. In particular, phase I geological site investigations will benefit from this approach and lead to the selection of sites for advanced surveys.

  11. A unifying graph-cut image segmentation framework: algorithms it encompasses and equivalences among them

    NASA Astrophysics Data System (ADS)

    Ciesielski, Krzysztof Chris; Udupa, Jayaram K.; Falcão, A. X.; Miranda, P. A. V.

    2012-02-01

    We present a general graph-cut segmentation framework GGC, in which the delineated objects returned by the algorithms optimize the energy functions associated with the lp norm, 1 <= p <= ∞. Two classes of well known algorithms belong to GGC: the standard graph cut GC (such as the min-cut/max-flow algorithm) and the relative fuzzy connectedness algorithms RFC (including iterative RFC, IRFC). The norm-based description of GGC provides more elegant and mathematically better recognized framework of our earlier results from [18, 19]. Moreover, it allows precise theoretical comparison of GGC representable algorithms with the algorithms discussed in a recent paper [22] (min-cut/max-flow graph cut, random walker, shortest path/geodesic, Voronoi diagram, power watershed/shortest path forest), which optimize, via lp norms, the intermediate segmentation step, the labeling of scene voxels, but for which the final object need not optimize the used lp energy function. Actually, the comparison of the GGC representable algorithms with that encompassed in the framework described in [22] constitutes the main contribution of this work.

  12. A Geospatial Assessment of Mountain Pine Beetle Infestations and Their Effect on Forest Health in Okanogan-Wenatchee National Forest

    NASA Astrophysics Data System (ADS)

    Allain, M.; Nguyen, A.; Johnson, E.; Williams, E.; Tsai, S.; Prichard, S.; Freed, T.; Skiles, J. W.

    2010-12-01

    Fire-suppression over the past century has resulted in an accumulation of forest litter and increased tree density. As nutrients are sequestered in forest litter and not recycled by forest fires, soil nutrient concentrations have decreased. The forests of Northern Washington are in poor health as a result of these factors coupled with sequential droughts. The mountain pine beetle (MPB) thrives in such conditions, giving rise to an outbreak in Washington’s Okanogan-Wenatchee National Forest. These outbreaks occur in three successive stages— the green, red, and gray stages. Beetles first infest the tree in the green phase, leading to discoloration of needles in the red phase and eventually death in the gray phase. With the use of geospatial technology, these outbreaks can be better mapped and assessed to evaluate forest health. Field work on seventeen randomly selected sites was conducted using the point-centered quarter method. The stratified random sampling technique ensured that the sampled trees were representative of all classifications present. Additional measurements taken were soil nutrient concentrations (sodium [Na+], nitrate [NO3-], and potassium [K+]), soil pH, and tree temperatures. Satellite imagery was used to define infestation levels and geophysical parameters, such as land cover, vegetation classification, and vegetation stress. ASTER images were used with the Ratio Vegetation Index (RVI) to explore the differences in vegetation, while MODIS images were used to analyze the Disturbance Index (DI). Four other vegetation indices from Landsat TM5 were used to distinguish the green, red and gray phases. Selected imagery from the Hyperion sensor was used to run a minimum distance supervised classification in ENVI, thus testing the ability of Hyperion imagery to detect the green phase. The National Agricultural Imagery Program (NAIP) archive was used to generate accurate maps of beetle-infested regions. This algorithm was used to detect bark beetle

  13. Randomized selection on the GPU

    SciTech Connect

    Monroe, Laura Marie; Wendelberger, Joanne R; Michalak, Sarah E

    2011-01-13

    We implement here a fast and memory-sparing probabilistic top N selection algorithm on the GPU. To our knowledge, this is the first direct selection in the literature for the GPU. The algorithm proceeds via a probabilistic-guess-and-chcck process searching for the Nth element. It always gives a correct result and always terminates. The use of randomization reduces the amount of data that needs heavy processing, and so reduces the average time required for the algorithm. Probabilistic Las Vegas algorithms of this kind are a form of stochastic optimization and can be well suited to more general parallel processors with limited amounts of fast memory.

  14. Forested wetlands

    SciTech Connect

    Lugo, A.E.; Brinson, M.; Brown, S.

    1990-01-01

    Forested wetlands have important roles in global biogeochemical cycles, supporting freshwater and saltwater commercial fisheries, and in providing a place for wildlife of all kinds to flourish. Scientific attention towards these ecosystems has lagged with only a few comprehensive works on forested wetlands of the world. A major emphasis of this book is to develop unifying principles and data bases on the structure and function of forested wetlands, in order to stimulate scientific study of them. Wetlands are areas that are inundated or saturated by surface-water or ground-water, at such a frequency and duration that under natural conditions they support organisms adapted to poorly aerated and/or saturated soil. The strategy of classifying the conditions that control the structure and behavior of forested wetlands by assuming that the physiognomy and floristic composition of the system will reflect the total energy expenditure of the ecosystem; and the structural and functional characteristics of forested wetlands from different parts of the world are the major topics covered.

  15. Randomized parallel speedups for list ranking

    SciTech Connect

    Vishkin, U.

    1987-06-01

    The following problem is considered: given a linked list of length n, compute the distance of each element of the linked list from the end of the list. The problem has two standard deterministic algorithms: a linear time serial algorithm, and an O(n log n)/ rho + log n) time parallel algorithm using rho processors. The authors present a randomized parallel algorithm for the problem. The algorithm is designed for an exclusive-read exclusive-write parallel random access machine (EREW PRAM). It runs almost surely in time O(n/rho + log n log* n) using rho processors. Using a recently published parallel prefix sums algorithm the list-ranking algorithm can be adapted to run on a concurrent-read concurrent-write parallel random access machine (CRCW PRAM) almost surely in time O(n/rho + log n) using rho processors.

  16. Mapping stand-age distribution of Russian forests from satellite data

    NASA Astrophysics Data System (ADS)

    Chen, D.; Loboda, T. V.; Hall, A.; Channan, S.; Weber, C. Y.

    2013-12-01

    -based indices. The resultant map provides an estimate of forest age based on the regrowth curves observed from Landsat imagery. The accuracy of the resultant map is assessed against three datasets: 1) subset of the disturbance maps developed within the algorithm, 2) independent disturbance maps created by the Northern Eurasia Land Dynamics Analysis (NELDA) project, and 3) field-based stand-age distribution from forestry inventory units. The current version of the product presents a considerable improvement on the previous version which used Landsat data samples at a set of randomly selected locations, resulting a strong bias of the training samples towards the Landsat-rich regions (e.g. European Russia) whereas regions such as Siberia were under-sampled. Aiming at improving accuracy, the current method significantly increases the number of training Landsat samples compared to the previous work. Aside from the previously used data, the current method uses all available Landsat data for the under-sampled regions in order to increase the representativeness of the total samples. The finial accuracy assessment is still ongoing, however, the initial results suggested an overall accuracy expressed in Kappa > 0.8. We plan to release both the training data and the final disturbance map of the Russian boreal forest to the public after the validation is completed.

  17. Genomic-enabled prediction with classification algorithms.

    PubMed

    Ornella, L; Pérez, P; Tapia, E; González-Camacho, J M; Burgueño, J; Zhang, X; Singh, S; Vicente, F S; Bonnett, D; Dreisigacker, S; Singh, R; Long, N; Crossa, J

    2014-06-01

    Pearson's correlation coefficient (ρ) is the most commonly reported metric of the success of prediction in genomic selection (GS). However, in real breeding ρ may not be very useful for assessing the quality of the regression in the tails of the distribution, where individuals are chosen for selection. This research used 14 maize and 16 wheat data sets with different trait-environment combinations. Six different models were evaluated by means of a cross-validation scheme (50 random partitions each, with 90% of the individuals in the training set and 10% in the testing set). The predictive accuracy of these algorithms for selecting individuals belonging to the best α=10, 15, 20, 25, 30, 35, 40% of the distribution was estimated using Cohen's kappa coefficient (κ) and an ad hoc measure, which we call relative efficiency (RE), which indicates the expected genetic gain due to selection when individuals are selected based on GS exclusively. We put special emphasis on the analysis for α=15%, because it is a percentile commonly used in plant breeding programmes (for example, at CIMMYT). We also used ρ as a criterion for overall success. The algorithms used were: Bayesian LASSO (BL), Ridge Regression (RR), Reproducing Kernel Hilbert Spaces (RHKS), Random Forest Regression (RFR), and Support Vector Regression (SVR) with linear (lin) and Gaussian kernels (rbf). The performance of regression methods for selecting the best individuals was compared with that of three supervised classification algorithms: Random Forest Classification (RFC) and Support Vector Classification (SVC) with linear (lin) and Gaussian (rbf) kernels. Classification methods were evaluated using the same cross-validation scheme but with the response vector of the original training sets dichotomised using a given threshold. For α=15%, SVC-lin presented the highest κ coefficients in 13 of the 14 maize data sets, with best values ranging from 0.131 to 0.722 (statistically significant in 9 data sets

  18. Genomic-enabled prediction with classification algorithms

    PubMed Central

    Ornella, L; Pérez, P; Tapia, E; González-Camacho, J M; Burgueño, J; Zhang, X; Singh, S; Vicente, F S; Bonnett, D; Dreisigacker, S; Singh, R; Long, N; Crossa, J

    2014-01-01

    Pearson's correlation coefficient (ρ) is the most commonly reported metric of the success of prediction in genomic selection (GS). However, in real breeding ρ may not be very useful for assessing the quality of the regression in the tails of the distribution, where individuals are chosen for selection. This research used 14 maize and 16 wheat data sets with different trait–environment combinations. Six different models were evaluated by means of a cross-validation scheme (50 random partitions each, with 90% of the individuals in the training set and 10% in the testing set). The predictive accuracy of these algorithms for selecting individuals belonging to the best α=10, 15, 20, 25, 30, 35, 40% of the distribution was estimated using Cohen's kappa coefficient (κ) and an ad hoc measure, which we call relative efficiency (RE), which indicates the expected genetic gain due to selection when individuals are selected based on GS exclusively. We put special emphasis on the analysis for α=15%, because it is a percentile commonly used in plant breeding programmes (for example, at CIMMYT). We also used ρ as a criterion for overall success. The algorithms used were: Bayesian LASSO (BL), Ridge Regression (RR), Reproducing Kernel Hilbert Spaces (RHKS), Random Forest Regression (RFR), and Support Vector Regression (SVR) with linear (lin) and Gaussian kernels (rbf). The performance of regression methods for selecting the best individuals was compared with that of three supervised classification algorithms: Random Forest Classification (RFC) and Support Vector Classification (SVC) with linear (lin) and Gaussian (rbf) kernels. Classification methods were evaluated using the same cross-validation scheme but with the response vector of the original training sets dichotomised using a given threshold. For α=15%, SVC-lin presented the highest κ coefficients in 13 of the 14 maize data sets, with best values ranging from 0.131 to 0.722 (statistically significant in 9 data sets

  19. Testing an earthquake prediction algorithm

    USGS Publications Warehouse

    Kossobokov, V.G.; Healy, J.H.; Dewey, J.W.

    1997-01-01

    A test to evaluate earthquake prediction algorithms is being applied to a Russian algorithm known as M8. The M8 algorithm makes intermediate term predictions for earthquakes to occur in a large circle, based on integral counts of transient seismicity in the circle. In a retroactive prediction for the period January 1, 1985 to July 1, 1991 the algorithm as configured for the forward test would have predicted eight of ten strong earthquakes in the test area. A null hypothesis, based on random assignment of predictions, predicts eight earthquakes in 2.87% of the trials. The forward test began July 1, 1991 and will run through December 31, 1997. As of July 1, 1995, the algorithm had forward predicted five out of nine earthquakes in the test area, which success ratio would have been achieved in 53% of random trials with the null hypothesis.

  20. Forests & Trees.

    ERIC Educational Resources Information Center

    Gage, Susan

    1989-01-01

    This newsletter discusses the disappearance of the world's forests and the resulting environmental problems of erosion and flooding; loss of genetic diversity; climatic changes such as less rainfall, and intensifying of the greenhouse effect; and displacement and destruction of indigenous cultures. The articles, lessons, and activities are…

  1. Forest Imaging

    NASA Technical Reports Server (NTRS)

    1992-01-01

    NASA's Technology Applications Center, with other government and academic agencies, provided technology for improved resources management to the Cibola National Forest. Landsat satellite images enabled vegetation over a large area to be classified for purposes of timber analysis, wildlife habitat, range measurement and development of general vegetation maps.

  2. Application of AIS Technology to Forest Mapping

    NASA Technical Reports Server (NTRS)

    Yool, S. R.; Star, J. L.

    1985-01-01

    Concerns about environmental effects of large scale deforestation have prompted efforts to map forests over large areas using various remote sensing data and image processing techniques. Basic research on the spectral characteristics of forest vegetation are required to form a basis for development of new techniques, and for image interpretation. Examination of LANDSAT data and image processing algorithms over a portion of boreal forest have demonstrated the complexity of relations between the various expressions of forest canopies, environmental variability, and the relative capacities of different image processing algorithms to achieve high classification accuracies under these conditions. Airborne Imaging Spectrometer (AIS) data may in part provide the means to interpret the responses of standard data and techniques to the vegetation based on its relatively high spectral resolution.

  3. Improvements to the stand and hit algorithm

    SciTech Connect

    Boneh, A.; Boneh, S.; Caron, R.; Jibrin, S.

    1994-12-31

    The stand and hit algorithm is a probabilistic algorithm for detecting necessary constraints. The algorithm stands at a point in the feasible region and hits constraints by moving towards the boundary along randomly generated directions. In this talk we discuss methods for choosing the standing point. As well, we present the undetected first rule for determining the hit constraints.

  4. AncesTrees: ancestry estimation with randomized decision trees.

    PubMed

    Navega, David; Coelho, Catarina; Vicente, Ricardo; Ferreira, Maria Teresa; Wasterlain, Sofia; Cunha, Eugénia

    2015-09-01

    In forensic anthropology, ancestry estimation is essential in establishing the individual biological profile. The aim of this study is to present a new program--AncesTrees--developed for assessing ancestry based on metric analysis. AncesTrees relies on a machine learning ensemble algorithm, random forest, to classify the human skull. In the ensemble learning paradigm, several models are generated and co-jointly used to arrive at the final decision. The random forest algorithm creates ensembles of decision trees classifiers, a non-linear and non-parametric classification technique. The database used in AncesTrees is composed by 23 craniometric variables from 1,734 individuals, representative of six major ancestral groups and selected from the Howells' craniometric series. The program was tested in 128 adult crania from the following collections: the African slaves' skeletal collection of Valle da Gafaria; the Medical School Skull Collection and the Identified Skeletal Collection of 21st Century, both curated at the University of Coimbra. The first step of the test analysis was to perform ancestry estimation including all the ancestral groups of the database. The second stage of our test analysis was to conduct ancestry estimation including only the European and the African ancestral groups. In the first test analysis, 75% of the individuals of African ancestry and 79.2% of the individuals of European ancestry were correctly identified. The model involving only African and European ancestral groups had a better performance: 93.8% of all individuals were correctly classified. The obtained results show that AncesTrees can be a valuable tool in forensic anthropology. PMID:25053239

  5. Entangled decision forests and their application for semantic segmentation of CT images.

    PubMed

    Montillo, Albert; Shotton, Jamie; Winn, John; Iglesias, Juan Eugenio; Metaxas, Dimitri; Criminisi, Antonio

    2011-01-01

    This work addresses the challenging problem of simultaneously segmenting multiple anatomical structures in highly varied CT scans. We propose the entangled decision forest (EDF) as a new discriminative classifier which augments the state of the art decision forest, resulting in higher prediction accuracy and shortened decision time. Our main contribution is two-fold. First, we propose entangling the binary tests applied at each tree node in the forest, such that the test result can depend on the result of tests applied earlier in the same tree and at image points offset from the voxel to be classified. This is demonstrated to improve accuracy and capture long-range semantic context. Second, during training, we propose injecting randomness in a guided way, in which node feature types and parameters are randomly drawn from a learned (nonuniform) distribution. This further improves classification accuracy. We assess our probabilistic anatomy segmentation technique using a labeled database of CT image volumes of 250 different patients from various scan protocols and scanner vendors. In each volume, 12 anatomical structures have been manually segmented. The database comprises highly varied body shapes and sizes, a wide array of pathologies, scan resolutions, and diverse contrast agents. Quantitative comparisons with state of the art algorithms demonstrate both superior test accuracy and computational efficiency. PMID:21761656

  6. Discriminative boosted forest with convolutional neural network-based patch descriptor for object detection

    NASA Astrophysics Data System (ADS)

    Xiang, Tao; Li, Tao; Ye, Mao; Li, Xudong

    2016-01-01

    Object detection with intraclass variations is challenging. The existing methods have not achieved the optimal combinations of classifiers and features, especially features learned by convolutional neural networks (CNNs). To solve this problem, we propose an object-detection method based on improved random forest and local image patches represented by CNN features. First, we compute CNN-based patch descriptors for each sample by modified CNNs. Then, the random forest is built whose split functions are defined by patch selector and linear projection learned by linear support vector machine. To improve the classification accuracy, the split functions in each depth of the forest make up a local classifier, and all local classifiers are assembled in a layer-wise manner by a boosting algorithm. The main contributions of our approach are summarized as follows: (1) We propose a new local patch descriptor based on CNN features. (2) We define a patch-based split function which is optimized with maximum class-label purity and minimum classification error over the samples of the node. (3) Each local classifier is assembled by minimizing the global classification error. We evaluate the method on three well-known challenging datasets: TUD pedestrians, INRIA pedestrians, and UIUC cars. The experiments demonstrate that our method achieves state-of-the-art or competitive performance.

  7. How random are random numbers generated using photons?

    NASA Astrophysics Data System (ADS)

    Solis, Aldo; Angulo Martínez, Alí M.; Ramírez Alarcón, Roberto; Cruz Ramírez, Hector; U'Ren, Alfred B.; Hirsch, Jorge G.

    2015-06-01

    Randomness is fundamental in quantum theory, with many philosophical and practical implications. In this paper we discuss the concept of algorithmic randomness, which provides a quantitative method to assess the Borel normality of a given sequence of numbers, a necessary condition for it to be considered random. We use Borel normality as a tool to investigate the randomness of ten sequences of bits generated from the differences between detection times of photon pairs generated by spontaneous parametric downconversion. These sequences are shown to fulfil the randomness criteria without difficulties. As deviations from Borel normality for photon-generated random number sequences have been reported in previous work, a strategy to understand these diverging findings is outlined.

  8. Marble Algorithm: a solution to estimating ecological niches from presence-only records

    PubMed Central

    Qiao, Huijie; Lin, Congtian; Jiang, Zhigang; Ji, Liqiang

    2015-01-01

    We describe an algorithm that helps to predict potential distributional areas for species using presence-only records. The Marble Algorithm is a density-based clustering program based on Hutchinson’s concept of ecological niches as multidimensional hypervolumes in environmental space. The algorithm characterizes this niche space using the density-based spatial clustering of applications with noise (DBSCAN) algorithm. When MA is provided with a set of occurrence points in environmental space, the algorithm determines two parameters that allow the points to be grouped into several clusters. These clusters are used as reference sets describing the ecological niche, which can then be mapped onto geographic space and used as the potential distribution of the species. We used both virtual species and ten empirical datasets to compare MA with other distribution-modeling tools, including Bioclimate Analysis and Prediction System, Environmental Niche Factor Analysis, the Genetic Algorithm for Rule-set Production, Maximum Entropy Modeling, Artificial Neural Networks, Climate Space Models, Classification Tree Analysis, Generalised Additive Models, Generalised Boosted Models, Generalised Linear Models, Multivariate Adaptive Regression Splines and Random Forests. Results indicate that MA predicts potential distributional areas with high accuracy, moderate robustness, and above-average transferability on all datasets, particularly when dealing with small numbers of occurrences. PMID:26387771

  9. Image segmentation using random features

    NASA Astrophysics Data System (ADS)

    Bull, Geoff; Gao, Junbin; Antolovich, Michael

    2014-01-01

    This paper presents a novel algorithm for selecting random features via compressed sensing to improve the performance of Normalized Cuts in image segmentation. Normalized Cuts is a clustering algorithm that has been widely applied to segmenting images, using features such as brightness, intervening contours and Gabor filter responses. Some drawbacks of Normalized Cuts are that computation times and memory usage can be excessive, and the obtained segmentations are often poor. This paper addresses the need to improve the processing time of Normalized Cuts while improving the segmentations. A significant proportion of the time in calculating Normalized Cuts is spent computing an affinity matrix. A new algorithm has been developed that selects random features using compressed sensing techniques to reduce the computation needed for the affinity matrix. The new algorithm, when compared to the standard implementation of Normalized Cuts for segmenting images from the BSDS500, produces better segmentations in significantly less time.

  10. Electromagnetic wave extinction within a forested canopy

    NASA Technical Reports Server (NTRS)

    Karam, M. A.; Fung, A. K.

    1989-01-01

    A forested canopy is modeled by a collection of randomly oriented finite-length cylinders shaded by randomly oriented and distributed disk- or needle-shaped leaves. For a plane wave exciting the forested canopy, the extinction coefficient is formulated in terms of the extinction cross sections (ECSs) in the local frame of each forest component and the Eulerian angles of orientation (used to describe the orientation of each component). The ECSs in the local frame for the finite-length cylinders used to model the branches are obtained by using the forward-scattering theorem. ECSs in the local frame for the disk- and needle-shaped leaves are obtained by the summation of the absorption and scattering cross-sections. The behavior of the extinction coefficients with the incidence angle is investigated numerically for both deciduous and coniferous forest. The dependencies of the extinction coefficients on the orientation of the leaves are illustrated numerically.

  11. Mapping in random-structures

    SciTech Connect

    Reidys, C.M.

    1996-06-01

    A mapping in random-structures is defined on the vertices of a generalized hypercube Q{sub {alpha}}{sup n}. A random-structure will consist of (1) a random contact graph and (2) a family of relations imposed on adjacent vertices. The vertex set of a random contact graph will be the set of all coordinates of a vertex P {element_of} Q{sub {alpha}}{sup n}. Its edge will be the union of the edge sets of two random graphs. The first is a random 1-regular graph on 2m vertices (coordinates) and the second is a random graph G{sub p} with p = c{sub 2}/n on all n vertices (coordinates). The structure of the random contact graphs will be investigated and it will be shown that for certain values of m, c{sub 2} the mapping in random-structures allows to search by the set of random-structures. This is applied to mappings in RNA-secondary structures. Also, the results on random-structures might be helpful for designing 3D-folding algorithms for RNA.

  12. A Hybrid Color Space for Skin Detection Using Genetic Algorithm Heuristic Search and Principal Component Analysis Technique

    PubMed Central

    2015-01-01

    Color is one of the most prominent features of an image and used in many skin and face detection applications. Color space transformation is widely used by researchers to improve face and skin detection performance. Despite the substantial research efforts in this area, choosing a proper color space in terms of skin and face classification performance which can address issues like illumination variations, various camera characteristics and diversity in skin color tones has remained an open issue. This research proposes a new three-dimensional hybrid color space termed SKN by employing the Genetic Algorithm heuristic and Principal Component Analysis to find the optimal representation of human skin color in over seventeen existing color spaces. Genetic Algorithm heuristic is used to find the optimal color component combination setup in terms of skin detection accuracy while the Principal Component Analysis projects the optimal Genetic Algorithm solution to a less complex dimension. Pixel wise skin detection was used to evaluate the performance of the proposed color space. We have employed four classifiers including Random Forest, Naïve Bayes, Support Vector Machine and Multilayer Perceptron in order to generate the human skin color predictive model. The proposed color space was compared to some existing color spaces and shows superior results in terms of pixel-wise skin detection accuracy. Experimental results show that by using Random Forest classifier, the proposed SKN color space obtained an average F-score and True Positive Rate of 0.953 and False Positive Rate of 0.0482 which outperformed the existing color spaces in terms of pixel wise skin detection accuracy. The results also indicate that among the classifiers used in this study, Random Forest is the most suitable classifier for pixel wise skin detection applications. PMID:26267377

  13. A hybrid color space for skin detection using genetic algorithm heuristic search and principal component analysis technique.

    PubMed

    Maktabdar Oghaz, Mahdi; Maarof, Mohd Aizaini; Zainal, Anazida; Rohani, Mohd Foad; Yaghoubyan, S Hadi

    2015-01-01

    Color is one of the most prominent features of an image and used in many skin and face detection applications. Color space transformation is widely used by researchers to improve face and skin detection performance. Despite the substantial research efforts in this area, choosing a proper color space in terms of skin and face classification performance which can address issues like illumination variations, various camera characteristics and diversity in skin color tones has remained an open issue. This research proposes a new three-dimensional hybrid color space termed SKN by employing the Genetic Algorithm heuristic and Principal Component Analysis to find the optimal representation of human skin color in over seventeen existing color spaces. Genetic Algorithm heuristic is used to find the optimal color component combination setup in terms of skin detection accuracy while the Principal Component Analysis projects the optimal Genetic Algorithm solution to a less complex dimension. Pixel wise skin detection was used to evaluate the performance of the proposed color space. We have employed four classifiers including Random Forest, Naïve Bayes, Support Vector Machine and Multilayer Perceptron in order to generate the human skin color predictive model. The proposed color space was compared to some existing color spaces and shows superior results in terms of pixel-wise skin detection accuracy. Experimental results show that by using Random Forest classifier, the proposed SKN color space obtained an average F-score and True Positive Rate of 0.953 and False Positive Rate of 0.0482 which outperformed the existing color spaces in terms of pixel wise skin detection accuracy. The results also indicate that among the classifiers used in this study, Random Forest is the most suitable classifier for pixel wise skin detection applications. PMID:26267377

  14. Quasi-Random Sequence Generators.

    1994-03-01

    Version 00 LPTAU generates quasi-random sequences. The sequences are uniformly distributed sets of L=2**30 points in the N-dimensional unit cube: I**N=[0,1]. The sequences are used as nodes for multidimensional integration, as searching points in global optimization, as trial points in multicriteria decision making, as quasi-random points for quasi Monte Carlo algorithms.

  15. Simulating California reservoir operation using the classification and regression-tree algorithm combined with a shuffled cross-validation scheme

    NASA Astrophysics Data System (ADS)

    Yang, Tiantian; Gao, Xiaogang; Sorooshian, Soroosh; Li, Xin

    2016-03-01

    The controlled outflows from a reservoir or dam are highly dependent on the decisions made by the reservoir operators, instead of a natural hydrological process. Difference exists between the natural upstream inflows to reservoirs and the controlled outflows from reservoirs that supply the downstream users. With the decision maker's awareness of changing climate, reservoir management requires adaptable means to incorporate more information into decision making, such as water delivery requirement, environmental constraints, dry/wet conditions, etc. In this paper, a robust reservoir outflow simulation model is presented, which incorporates one of the well-developed data-mining models (Classification and Regression Tree) to predict the complicated human-controlled reservoir outflows and extract the reservoir operation patterns. A shuffled cross-validation approach is further implemented to improve CART's predictive performance. An application study of nine major reservoirs in California is carried out. Results produced by the enhanced CART, original CART, and random forest are compared with observation. The statistical measurements show that the enhanced CART and random forest overperform the CART control run in general, and the enhanced CART algorithm gives a better predictive performance over random forest in simulating the peak flows. The results also show that the proposed model is able to consistently and reasonably predict the expert release decisions. Experiments indicate that the release operation in the Oroville Lake is significantly dominated by SWP allocation amount and reservoirs with low elevation are more sensitive to inflow amount than others.

  16. Discriminant forest classification method and system

    DOEpatents

    Chen, Barry Y.; Hanley, William G.; Lemmond, Tracy D.; Hiller, Lawrence J.; Knapp, David A.; Mugge, Marshall J.

    2012-11-06

    A hybrid machine learning methodology and system for classification that combines classical random forest (RF) methodology with discriminant analysis (DA) techniques to provide enhanced classification capability. A DA technique which uses feature measurements of an object to predict its class membership, such as linear discriminant analysis (LDA) or Andersen-Bahadur linear discriminant technique (AB), is used to split the data at each node in each of its classification trees to train and grow the trees and the forest. When training is finished, a set of n DA-based decision trees of a discriminant forest is produced for use in predicting the classification of new samples of unknown class.

  17. Bearing fault component identification using information gain and machine learning algorithms

    NASA Astrophysics Data System (ADS)

    Vinay, Vakharia; Kumar, Gupta Vijay; Kumar, Kankar Pavan

    2015-04-01

    In the present study an attempt has been made to identify various bearing faults using machine learning algorithm. Vibration signals obtained from faults in inner race, outer race, rolling element and combined faults are considered. Raw vibration signal cannot be used directly since vibration signals are masked by noise. To overcome this difficulty combined time frequency domain method such as wavelet transform is used. Further wavelet selection criteria based on minimum permutation entropy is employed to select most appropriate base wavelet. Statistical features from selected wavelet coefficients are calculated to form feature vector. To reduce size of feature vector information gain attribute selection method is employed. Modified feature set is fed in to machine learning algorithm such as random forest and self-organizing map for getting maximize fault identification efficiency. Results obtained revealed that attribute selection method shows improvement in fault identification accuracy of bearing components.

  18. Dispersal of forest insects

    NASA Technical Reports Server (NTRS)

    Mcmanus, M. L.

    1979-01-01

    Dispersal flights of selected species of forest insects which are associated with periodic outbreaks of pests that occur over large contiguous forested areas are discussed. Gypsy moths, spruce budworms, and forest tent caterpillars were studied for their massive migrations in forested areas. Results indicate that large dispersals into forested areas are due to the females, except in the case of the gypsy moth.

  19. Prediction of 5-Year Survival with Data Mining Algorithms.

    PubMed

    Sailer, Fabian; Pobiruchin, Monika; Bochum, Sylvia; Martens, Uwe M; Schramm, Wendelin

    2015-01-01

    Survival time prediction at the time of diagnosis is of great importance to make decisions about treatment and long-term follow-up care. However, predicting the outcome of cancer on the basis of clinical information is a challenging task. We now examined the ability of ten different data mining algorithms (Perceptron, Rule Induction, Support Vector Machine, Linear Regression, Naïve Bayes, Decision Tree, k-nearest Neighbor, Logistic Regression, Neural Network, Random Forest) to predict the dichotomous attribute "5-year-survival" based on seven attributes (sex, UICC-stage, etc.) which are available at the time of diagnosis. For this study we made use of the nationwide German research data set on colon cancer provided by the Robert Koch Institute. To assess the results a comparison between data mining algorithms and physicians' opinions was performed. Therefore, physicians guessed the survival time by leveraging the same seven attributes. The average accuracy of the physicians' opinion was 59%, the average accuracy of the machine learning algorithms was 67.7%. PMID:26152957

  20. Random thoughts

    NASA Astrophysics Data System (ADS)

    ajansen; kwhitefoot; panteltje1; edprochak; sudhakar, the

    2014-07-01

    In reply to the physicsworld.com news story “How to make a quantum random-number generator from a mobile phone” (16 May, http://ow.ly/xFiYc, see also p5), which describes a way of delivering random numbers by counting the number of photons that impinge on each of the individual pixels in the camera of a Nokia N9 smartphone.

  1. Random sequential adsorption on fractals

    NASA Astrophysics Data System (ADS)

    Ciesla, Michal; Barbasz, Jakub

    2012-07-01

    Irreversible adsorption of spheres on flat collectors having dimension d < 2 is studied. Molecules are adsorbed on Sierpinski's triangle and carpet-like fractals (1 < d < 2), and on general Cantor set (d < 1). Adsorption process is modeled numerically using random sequential adsorption (RSA) algorithm. The paper concentrates on measurement of fundamental properties of coverages, i.e., maximal random coverage ratio and density autocorrelation function, as well as RSA kinetics. Obtained results allow to improve phenomenological relation between maximal random coverage ratio and collector dimension. Moreover, simulations show that, in general, most of known dimensional properties of adsorbed monolayers are valid for non-integer dimensions.

  2. Random sequential adsorption on fractals.

    PubMed

    Ciesla, Michal; Barbasz, Jakub

    2012-07-28

    Irreversible adsorption of spheres on flat collectors having dimension d < 2 is studied. Molecules are adsorbed on Sierpinski's triangle and carpet-like fractals (1 < d < 2), and on general Cantor set (d < 1). Adsorption process is modeled numerically using random sequential adsorption (RSA) algorithm. The paper concentrates on measurement of fundamental properties of coverages, i.e., maximal random coverage ratio and density autocorrelation function, as well as RSA kinetics. Obtained results allow to improve phenomenological relation between maximal random coverage ratio and collector dimension. Moreover, simulations show that, in general, most of known dimensional properties of adsorbed monolayers are valid for non-integer dimensions. PMID:22852643

  3. Quantum algorithms

    NASA Astrophysics Data System (ADS)

    Abrams, Daniel S.

    This thesis describes several new quantum algorithms. These include a polynomial time algorithm that uses a quantum fast Fourier transform to find eigenvalues and eigenvectors of a Hamiltonian operator, and that can be applied in cases (commonly found in ab initio physics and chemistry problems) for which all known classical algorithms require exponential time. Fast algorithms for simulating many body Fermi systems are also provided in both first and second quantized descriptions. An efficient quantum algorithm for anti-symmetrization is given as well as a detailed discussion of a simulation of the Hubbard model. In addition, quantum algorithms that calculate numerical integrals and various characteristics of stochastic processes are described. Two techniques are given, both of which obtain an exponential speed increase in comparison to the fastest known classical deterministic algorithms and a quadratic speed increase in comparison to classical Monte Carlo (probabilistic) methods. I derive a simpler and slightly faster version of Grover's mean algorithm, show how to apply quantum counting to the problem, develop some variations of these algorithms, and show how both (apparently distinct) approaches can be understood from the same unified framework. Finally, the relationship between physics and computation is explored in some more depth, and it is shown that computational complexity theory depends very sensitively on physical laws. In particular, it is shown that nonlinear quantum mechanics allows for the polynomial time solution of NP-complete and #P oracle problems. Using the Weinberg model as a simple example, the explicit construction of the necessary gates is derived from the underlying physics. Nonlinear quantum algorithms are also presented using Polchinski type nonlinearities which do not allow for superluminal communication. (Copies available exclusively from MIT Libraries, Rm. 14- 0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

  4. Computer generation of random deviates.

    PubMed

    Cormack, J; Shuter, B

    1991-06-01

    The need for random deviates arises in many scientific applications, such as the simulation of physical processes, numerical evaluation of complex mathematical formulae and the modeling of decision processes. In medical physics, Monte Carlo simulations have been used in radiology, radiation therapy and nuclear medicine. Specific instances include the modelling of x-ray scattering processes and the addition of random noise to images or curves in order to assess the effects of various processing procedures. Reliable sources of random deviates with statistical properties indistinguishable from true random deviates are a fundamental necessity for such tasks. This paper provides a review of computer algorithms which can be used to generate uniform random deviates and other distributions of interest to medical physicists, along with a few caveats relating to various problems and pitfalls which can occur. Source code listings for the generators discussed (in FORTRAN, Turbo-PASCAL and Data General ASSEMBLER) are available on request from the authors. PMID:1747086

  5. Effects of Forest Disturbances on Forest Structural Parameters Retrieval from Lidar Waveform Data

    NASA Technical Reports Server (NTRS)

    Ranson, K, Lon; Sun, G.

    2011-01-01

    The effect of forest disturbance on the lidar waveform and the forest biomass estimation was demonstrated by model simulation. The results show that the correlation between stand biomass and the lidar waveform indices changes when the stand spatial structure changes due to disturbances rather than the natural succession. This has to be considered in developing algorithms for regional or global mapping of biomass from lidar waveform data.

  6. Coloring random graphs.

    PubMed

    Mulet, R; Pagnani, A; Weigt, M; Zecchina, R

    2002-12-23

    We study the graph coloring problem over random graphs of finite average connectivity c. Given a number q of available colors, we find that graphs with low connectivity admit almost always a proper coloring, whereas graphs with high connectivity are uncolorable. Depending on q, we find the precise value of the critical average connectivity c(q). Moreover, we show that below c(q) there exists a clustering phase c in [c(d),c(q)] in which ground states spontaneously divide into an exponential number of clusters and where the proliferation of metastable states is responsible for the onset of complexity in local search algorithms. PMID:12484862

  7. Haplotyping algorithms

    SciTech Connect

    Sobel, E.; Lange, K.; O`Connell, J.R.

    1996-12-31

    Haplotyping is the logical process of inferring gene flow in a pedigree based on phenotyping results at a small number of genetic loci. This paper formalizes the haplotyping problem and suggests four algorithms for haplotype reconstruction. These algorithms range from exhaustive enumeration of all haplotype vectors to combinatorial optimization by simulated annealing. Application of the algorithms to published genetic analyses shows that manual haplotyping is often erroneous. Haplotyping is employed in screening pedigrees for phenotyping errors and in positional cloning of disease genes from conserved haplotypes in population isolates. 26 refs., 6 figs., 3 tabs.

  8. Research on Routing Selection Algorithm Based on Genetic Algorithm

    NASA Astrophysics Data System (ADS)

    Gao, Guohong; Zhang, Baojian; Li, Xueyong; Lv, Jinna

    The hereditary algorithm is a kind of random searching and method of optimizing based on living beings natural selection and hereditary mechanism. In recent years, because of the potentiality in solving complicate problems and the successful application in the fields of industrial project, hereditary algorithm has been widely concerned by the domestic and international scholar. Routing Selection communication has been defined a standard communication model of IP version 6.This paper proposes a service model of Routing Selection communication, and designs and implements a new Routing Selection algorithm based on genetic algorithm.The experimental simulation results show that this algorithm can get more resolution at less time and more balanced network load, which enhances search ratio and the availability of network resource, and improves the quality of service.

  9. Applying various algorithms for species distribution modelling.

    PubMed

    Li, Xinhai; Wang, Yuan

    2013-06-01

    Species distribution models have been used extensively in many fields, including climate change biology, landscape ecology and conservation biology. In the past 3 decades, a number of new models have been proposed, yet researchers still find it difficult to select appropriate models for data and objectives. In this review, we aim to provide insight into the prevailing species distribution models for newcomers in the field of modelling. We compared 11 popular models, including regression models (the generalized linear model, the generalized additive model, the multivariate adaptive regression splines model and hierarchical modelling), classification models (mixture discriminant analysis, the generalized boosting model, and classification and regression tree analysis) and complex models (artificial neural network, random forest, genetic algorithm for rule set production and maximum entropy approaches). Our objectives are: (i) to compare the strengths and weaknesses of the models, their characteristics and identify suitable situations for their use (in terms of data type and species-environment relationships) and (ii) to provide guidelines for model application, including 3 steps: model selection, model formulation and parameter estimation. PMID:23731809

  10. MODIS Based Estimation of Forest Aboveground Biomass in China

    PubMed Central

    Sun, Yan; Wang, Tao; Zeng, Zhenzhong; Piao, Shilong

    2015-01-01

    Accurate estimation of forest biomass C stock is essential to understand carbon cycles. However, current estimates of Chinese forest biomass are mostly based on inventory-based timber volumes and empirical conversion factors at the provincial scale, which could introduce large uncertainties in forest biomass estimation. Here we provide a data-driven estimate of Chinese forest aboveground biomass from 2001 to 2013 at a spatial resolution of 1 km by integrating a recently reviewed plot-level ground-measured forest aboveground biomass database with geospatial information from 1-km Moderate-Resolution Imaging Spectroradiometer (MODIS) dataset in a machine learning algorithm (the model tree ensemble, MTE). We show that Chinese forest aboveground biomass is 8.56 Pg C, which is mainly contributed by evergreen needle-leaf forests and deciduous broadleaf forests. The mean forest aboveground biomass density is 56.1 Mg C ha−1, with high values observed in temperate humid regions. The responses of forest aboveground biomass density to mean annual temperature are closely tied to water conditions; that is, negative responses dominate regions with mean annual precipitation less than 1300 mm y−1 and positive responses prevail in regions with mean annual precipitation higher than 2800 mm y−1. During the 2000s, the forests in China sequestered C by 61.9 Tg C y−1, and this C sink is mainly distributed in north China and may be attributed to warming climate, rising CO2 concentration, N deposition, and growth of young forests. PMID:26115195

  11. MODIS Based Estimation of Forest Aboveground Biomass in China.

    PubMed

    Yin, Guodong; Zhang, Yuan; Sun, Yan; Wang, Tao; Zeng, Zhenzhong; Piao, Shilong

    2015-01-01

    Accurate estimation of forest biomass C stock is essential to understand carbon cycles. However, current estimates of Chinese forest biomass are mostly based on inventory-based timber volumes and empirical conversion factors at the provincial scale, which could introduce large uncertainties in forest biomass estimation. Here we provide a data-driven estimate of Chinese forest aboveground biomass from 2001 to 2013 at a spatial resolution of 1 km by integrating a recently reviewed plot-level ground-measured forest aboveground biomass database with geospatial information from 1-km Moderate-Resolution Imaging Spectroradiometer (MODIS) dataset in a machine learning algorithm (the model tree ensemble, MTE). We show that Chinese forest aboveground biomass is 8.56 Pg C, which is mainly contributed by evergreen needle-leaf forests and deciduous broadleaf forests. The mean forest aboveground biomass density is 56.1 Mg C ha-1, with high values observed in temperate humid regions. The responses of forest aboveground biomass density to mean annual temperature are closely tied to water conditions; that is, negative responses dominate regions with mean annual precipitation less than 1300 mm y-1 and positive responses prevail in regions with mean annual precipitation higher than 2800 mm y-1. During the 2000s, the forests in China sequestered C by 61.9 Tg C y-1, and this C sink is mainly distributed in north China and may be attributed to warming climate, rising CO2 concentration, N deposition, and growth of young forests. PMID:26115195

  12. Forests through the Eye of a Satellite: Understanding regional forest-cover dynamics using Landsat Imagery

    NASA Astrophysics Data System (ADS)

    Baumann, Matthias

    Forests are changing at an alarming pace worldwide. Forests are an important provider of ecosystem services that contribute to human wellbeing, including the provision of timber and non-timber products, habitat for biodiversity, recreation amenities. Most prominently, forests serve as a sink for atmospheric carbon dioxide that ultimately helps to mitigate changes in the global climate. It is thus important to understand where, how and why forests change worldwide. My dissertation provides answers to these questions. The overarching goal of my dissertation is to improve our understanding of regional forest-cover dynamics by analyzing Landsat satellite imagery. I answer where forests change following drastic socio-economic shocks by using the breakdown of the Soviet Union as a natural experiment. My dissertation provides innovative algorithms to answer why forests change---because of human activities or because of natural events such as storms. Finally, I will show how dynamic forests are within one year by providing ways to characterize green-leaf phenology from satellite imagery. With my findings I directly contribute to a better understanding of the processes on the Earth's surface and I highlight the importance of satellite imagery to learn about regional and local forest-cover dynamics.

  13. Alerts of forest disturbance from MODIS imagery

    NASA Astrophysics Data System (ADS)

    Hammer, Dan; Kraft, Robin; Wheeler, David

    2014-12-01

    This paper reports the methodology and computational strategy for a forest cover disturbance alerting system. Analytical techniques from time series econometrics are applied to imagery from the Moderate Resolution Imaging Spectroradiometer (MODIS) sensor to detect temporal instability in vegetation indices. The characteristics from each MODIS pixel's spectral history are extracted and compared against historical data on forest cover loss to develop a geographically localized classification rule that can be applied across the humid tropical biome. The final output is a probability of forest disturbance for each 500 m pixel that is updated every 16 days. The primary objective is to provide high-confidence alerts of forest disturbance, while minimizing false positives. We find that the alerts serve this purpose exceedingly well in Pará, Brazil, with high probability alerts garnering a user accuracy of 98 percent over the training period and 93 percent after the training period (2000-2005) when compared against the PRODES deforestation data set, which is used to assess spatial accuracy. Implemented in Clojure and Java on the Hadoop distributed data processing platform, the algorithm is a fast, automated, and open source system for detecting forest disturbance. It is intended to be used in conjunction with higher-resolution imagery and data products that cannot be updated as quickly as MODIS-based data products. By highlighting hotspots of change, the algorithm and associated output can focus high-resolution data acquisition and aid in efforts to enforce local forest conservation efforts.

  14. Study of a global search algorithm for optimal control.

    NASA Technical Reports Server (NTRS)

    Brocker, D. H.; Kavanaugh, W. P.; Stewart, E. C.

    1967-01-01

    Adaptive random search algorithm utilizing boundary cost-function hypersurfaces measurement to implement Pontryagin maximum principle, discussing hybrid computer use, iterative solution and convergence properties

  15. Montana's forest resources. Forest Service resource bulletin

    SciTech Connect

    Conner, R.C.; O'Brien, R.A.

    1993-09-01

    The report includes highlights of the forest resource in Montana as of 1989. Also the study describes the extent, condition, and location of the State's forests with particular emphasis on timberland. Includes statistical tables, area by land classes, ownership, and forest type, growing stock and sawtimber volumes, growth, mortality, and removals for timberland.

  16. [Automated mapping of urban forests' disturbance and recovery in Nanjing, China].

    PubMed

    Lyu, Ying-ying; Zhuang, Yi-lin; Ren, Xin-yu; Li, Ming-shi; Xu, Wang-gu; Wang, Zhi

    2016-02-01

    Using Landsat TM/ETM dense time series observations spanning from 1987 to 2011, taking Laoshan forest farm and Purple Mountain as the research objects, the landsat ecosystem disturbance adaptive processing system (Ledaps) algorithm was used to generate surface reflectance datasets, which were fed to the vegetation change tracker model (VCT) model to derive urban forest disturbance and recovery products over Nanjing, followed by an intensive validation of the products. The results showed that there was a relatively high spatial agreement for forest disturbance products mapped by VCT, ranging from 65.4% to 95.0%. There was an apparent fluctuating forest disturbance and recovery rate over time, and the change trend of forest disturbance occurring at the two sites was roughly similar, but forest recovery was obviously different. Forest coverage in Purple Mountain was less than that in Laoshan forest farm, but the forest disturbance and recovery rates in Laoshan forest farm were larger than those in Purple Mountain. PMID:27396114

  17. Comparison of four machine learning algorithms for their applicability in satellite-based optical rainfall retrievals

    NASA Astrophysics Data System (ADS)

    Meyer, Hanna; Kühnlein, Meike; Appelhans, Tim; Nauss, Thomas

    2016-03-01

    Machine learning (ML) algorithms have successfully been demonstrated to be valuable tools in satellite-based rainfall retrievals which show the practicability of using ML algorithms when faced with high dimensional and complex data. Moreover, recent developments in parallel computing with ML present new possibilities for training and prediction speed and therefore make their usage in real-time systems feasible. This study compares four ML algorithms - random forests (RF), neural networks (NNET), averaged neural networks (AVNNET) and support vector machines (SVM) - for rainfall area detection and rainfall rate assignment using MSG SEVIRI data over Germany. Satellite-based proxies for cloud top height, cloud top temperature, cloud phase and cloud water path serve as predictor variables. The results indicate an overestimation of rainfall area delineation regardless of the ML algorithm (averaged bias = 1.8) but a high probability of detection ranging from 81% (SVM) to 85% (NNET). On a 24-hour basis, the performance of the rainfall rate assignment yielded R2 values between 0.39 (SVM) and 0.44 (AVNNET). Though the differences in the algorithms' performance were rather small, NNET and AVNNET were identified as the most suitable algorithms. On average, they demonstrated the best performance in rainfall area delineation as well as in rainfall rate assignment. NNET's computational speed is an additional advantage in work with large datasets such as in remote sensing based rainfall retrievals. However, since no single algorithm performed considerably better than the others we conclude that further research in providing suitable predictors for rainfall is of greater necessity than an optimization through the choice of the ML algorithm.

  18. Comparison of machine learning algorithms for their applicability in satellite-based optical rainfall retrievals

    NASA Astrophysics Data System (ADS)

    Meyer, Hanna; Kühnlein, Meike; Appelhans, Tim; Nauss, Thomas

    2015-04-01

    Machine learning (ML) algorithms have been successfully evaluated as valuable tools in satellite-based rainfall retrievals which shows the high potential of ML algorithms when faced with high dimensional and complex data. Moreover, the recent developments in parallel computing with ML offer new possibilities in terms of training and predicting speed and therefore makes their usage in real time systems feasible. The present study compares four ML algorithms for rainfall area detection and rainfall rate assignment during daytime, night-time and twilight using MSG SEVIRI data over Germany. Satellite-based proxies for cloud top height, cloud top temperature, cloud phase and cloud water path are applied as predictor variables. As machine learning algorithms, random forests (RF), neural networks (NNET), averaged neural networks (AVNNET) and support vector machines (SVM) are chosen. The comparison is realised in three steps. First, an extensive tuning study is carried out to customise each of the models. Secondly, the models are trained using the optimum values of model parameters found in the tuning study. Finally, the trained models are used to detect rainfall areas and to assign rainfall rates using an independent validation datasets which is compared against ground-based radar data. To train and validate the models, the radar-based RADOLAN RW product from the German Weather Service (DWD) is used which provides area-wide gauge-adjusted hourly precipitation information. Though the differences in the performance of the algorithms were rather small, NNET and AVNNET have been identified as the most suitable algorithms. On average, they showed the best performance in rainfall area delineation as well as in rainfall rate assignment. The fast computation time of NNET allows to work with large datasets as it is required in remote sensing based rainfall retrievals. However, since none of the algorithms performed considerably better that the others we conclude that research

  19. Random Vibrations

    NASA Technical Reports Server (NTRS)

    Messaro. Semma; Harrison, Phillip

    2010-01-01

    Ares I Zonal Random vibration environments due to acoustic impingement and combustion processes are develop for liftoff, ascent and reentry. Random Vibration test criteria for Ares I Upper Stage pyrotechnic components are developed by enveloping the applicable zonal environments where each component is located. Random vibration tests will be conducted to assure that these components will survive and function appropriately after exposure to the expected vibration environments. Methodology: Random Vibration test criteria for Ares I Upper Stage pyrotechnic components were desired that would envelope all the applicable environments where each component was located. Applicable Ares I Vehicle drawings and design information needed to be assessed to determine the location(s) for each component on the Ares I Upper Stage. Design and test criteria needed to be developed by plotting and enveloping the applicable environments using Microsoft Excel Spreadsheet Software and documenting them in a report Using Microsoft Word Processing Software. Conclusion: Random vibration liftoff, ascent, and green run design & test criteria for the Upper Stage Pyrotechnic Components were developed by using Microsoft Excel to envelope zonal environments applicable to each component. Results were transferred from Excel into a report using Microsoft Word. After the report is reviewed and edited by my mentor it will be submitted for publication as an attachment to a memorandum. Pyrotechnic component designers will extract criteria from my report for incorporation into the design and test specifications for components. Eventually the hardware will be tested to the environments I developed to assure that the components will survive and function appropriately after exposure to the expected vibration environments.

  20. A simple and practical control of the authenticity of organic sugarcane samples based on the use of machine-learning algorithms and trace elements determination by inductively coupled plasma mass spectrometry.

    PubMed

    Barbosa, Rommel M; Batista, Bruno L; Barião, Camila V; Varrique, Renan M; Coelho, Vinicius A; Campiglia, Andres D; Barbosa, Fernando

    2015-10-01

    A practical and easy control of the authenticity of organic sugarcane samples based on the use of machine-learning algorithms and trace elements determination by inductively coupled plasma mass spectrometry is proposed. Reference ranges for 32 chemical elements in 22 samples of sugarcane (13 organic and 9 non organic) were established and then two algorithms, Naive Bayes (NB) and Random Forest (RF), were evaluated to classify the samples. Accurate results (>90%) were obtained when using all variables (i.e., 32 elements). However, accuracy was improved (95.4% for NB) when only eight minerals (Rb, U, Al, Sr, Dy, Nb, Ta, Mo), chosen by a feature selection algorithm, were employed. Thus, the use of a fingerprint based on trace element levels associated with classification machine learning algorithms may be used as a simple alternative for authenticity evaluation of organic sugarcane samples. PMID:25872438

  1. The MCNP5 Random number generator

    SciTech Connect

    Brown, F. B.; Nagaya, Y.

    2002-01-01

    MCNP and other Monte Carlo particle transport codes use random number generators to produce random variates from a uniform distribution on the interval. These random variates are then used in subsequent sampling from probability distributions to simulate the physical behavior of particles during the transport process. This paper describes the new random number generator developed for MCNP Version 5. The new generator will optionally preserve the exact random sequence of previous versions and is entirely conformant to the Fortran-90 standard, hence completely portable. In addition, skip-ahead algorithms have been implemented to efficiently initialize the generator for new histories, a capability that greatly simplifies parallel algorithms. Further, the precision of the generator has been increased, extending the period by a factor of 10{sup 5}. Finally, the new generator has been subjected to 3 different sets of rigorous and extensive statistical tests to verify that it produces a sufficiently random sequence.

  2. Forest Health Detectives

    ERIC Educational Resources Information Center

    Bal, Tara L.

    2014-01-01

    "Forest health" is an important concept often not covered in tree, forest, insect, or fungal ecology and biology. With minimal, inexpensive equipment, students can investigate and conduct their own forest health survey to assess the percentage of trees with natural or artificial wounds or stress. Insects and diseases in the forest are…

  3. Modeling long-term suspended-sediment export from an undisturbed forest catchment

    NASA Astrophysics Data System (ADS)

    Zimmermann, Alexander; Francke, Till; Elsenbeer, Helmut

    2013-04-01

    Most estimates of suspended sediment yields from humid, undisturbed, and geologically stable forest environments fall within a range of 5 - 30 t km-2 a-1. These low natural erosion rates in small headwater catchments (≤ 1 km2) support the common impression that a well-developed forest cover prevents surface erosion. Interestingly, those estimates originate exclusively from areas with prevailing vertical hydrological flow paths. Forest environments dominated by (near-) surface flow paths (overland flow, pipe flow, and return flow) and a fast response to rainfall, however, are not an exceptional phenomenon, yet only very few sediment yields have been estimated for these areas. Not surprisingly, even fewer long-term (≥ 10 years) records exist. In this contribution we present our latest research which aims at quantifying long-term suspended-sediment export from an undisturbed rainforest catchment prone to frequent overland flow. A key aspect of our approach is the application of machine-learning techniques (Random Forest, Quantile Regression Forest) which allows not only the handling of non-Gaussian data, non-linear relations between predictors and response, and correlations between predictors, but also the assessment of prediction uncertainty. For the current study we provided the machine-learning algorithms exclusively with information from a high-resolution rainfall time series to reconstruct discharge and suspended sediment dynamics for a 21-year period. The significance of our results is threefold. First, our estimates clearly show that forest cover does not necessarily prevent erosion if wet antecedent conditions and large rainfalls coincide. During these situations, overland flow is widespread and sediment fluxes increase in a non-linear fashion due to the mobilization of new sediment sources. Second, our estimates indicate that annual suspended sediment yields of the undisturbed forest catchment show large fluctu