Currency crisis indication by using ensembles of support vector machine classifiers
NASA Astrophysics Data System (ADS)
Ramli, Nor Azuana; Ismail, Mohd Tahir; Wooi, Hooy Chee
2014-07-01
There are many methods that had been experimented in the analysis of currency crisis. However, not all methods could provide accurate indications. This paper introduces an ensemble of classifiers by using Support Vector Machine that's never been applied in analyses involving currency crisis before with the aim of increasing the indication accuracy. The proposed ensemble classifiers' performances are measured using percentage of accuracy, root mean squared error (RMSE), area under the Receiver Operating Characteristics (ROC) curve and Type II error. The performances of an ensemble of Support Vector Machine classifiers are compared with the single Support Vector Machine classifier and both of classifiers are tested on the data set from 27 countries with 12 macroeconomic indicators for each country. From our analyses, the results show that the ensemble of Support Vector Machine classifiers outperforms single Support Vector Machine classifier on the problem involving indicating a currency crisis in terms of a range of standard measures for comparing the performance of classifiers.
Lysine acetylation sites prediction using an ensemble of support vector machine classifiers.
Xu, Yan; Wang, Xiao-Bo; Ding, Jun; Wu, Ling-Yun; Deng, Nai-Yang
2010-05-07
Lysine acetylation is an essentially reversible and high regulated post-translational modification which regulates diverse protein properties. Experimental identification of acetylation sites is laborious and expensive. Hence, there is significant interest in the development of computational methods for reliable prediction of acetylation sites from amino acid sequences. In this paper we use an ensemble of support vector machine classifiers to perform this work. The experimentally determined acetylation lysine sites are extracted from Swiss-Prot database and scientific literatures. Experiment results show that an ensemble of support vector machine classifiers outperforms single support vector machine classifier and other computational methods such as PAIL and LysAcet on the problem of predicting acetylation lysine sites. The resulting method has been implemented in EnsemblePail, a web server for lysine acetylation sites prediction available at http://www.aporc.org/EnsemblePail/. Copyright (c) 2010 Elsevier Ltd. All rights reserved.
An ensemble of SVM classifiers based on gene pairs.
Tong, Muchenxuan; Liu, Kun-Hong; Xu, Chungui; Ju, Wenbin
2013-07-01
In this paper, a genetic algorithm (GA) based ensemble support vector machine (SVM) classifier built on gene pairs (GA-ESP) is proposed. The SVMs (base classifiers of the ensemble system) are trained on different informative gene pairs. These gene pairs are selected by the top scoring pair (TSP) criterion. Each of these pairs projects the original microarray expression onto a 2-D space. Extensive permutation of gene pairs may reveal more useful information and potentially lead to an ensemble classifier with satisfactory accuracy and interpretability. GA is further applied to select an optimized combination of base classifiers. The effectiveness of the GA-ESP classifier is evaluated on both binary-class and multi-class datasets. Copyright © 2013 Elsevier Ltd. All rights reserved.
Metal Oxide Gas Sensor Drift Compensation Using a Two-Dimensional Classifier Ensemble
Liu, Hang; Chu, Renzhi; Tang, Zhenan
2015-01-01
Sensor drift is the most challenging problem in gas sensing at present. We propose a novel two-dimensional classifier ensemble strategy to solve the gas discrimination problem, regardless of the gas concentration, with high accuracy over extended periods of time. This strategy is appropriate for multi-class classifiers that consist of combinations of pairwise classifiers, such as support vector machines. We compare the performance of the strategy with those of competing methods in an experiment based on a public dataset that was compiled over a period of three years. The experimental results demonstrate that the two-dimensional ensemble outperforms the other methods considered. Furthermore, we propose a pre-aging process inspired by that applied to the sensors to improve the stability of the classifier ensemble. The experimental results demonstrate that the weight of each multi-class classifier model in the ensemble remains fairly static before and after the addition of new classifier models to the ensemble, when a pre-aging procedure is applied. PMID:25942640
Ensemble of classifiers for ontology enrichment
NASA Astrophysics Data System (ADS)
Semenova, A. V.; Kureichik, V. M.
2018-05-01
A classifier is a basis of ontology learning systems. Classification of text documents is used in many applications, such as information retrieval, information extraction, definition of spam. A new ensemble of classifiers based on SVM (a method of support vectors), LSTM (neural network) and word embedding are suggested. An experiment was conducted on open data, which allows us to conclude that the proposed classification method is promising. The implementation of the proposed classifier is performed in the Matlab using the functions of the Text Analytics Toolbox. The principal difference between the proposed ensembles of classifiers is the high quality of classification of data at acceptable time costs.
Ling, Qing-Hua; Song, Yu-Qing; Han, Fei; Yang, Dan; Huang, De-Shuang
2016-01-01
For ensemble learning, how to select and combine the candidate classifiers are two key issues which influence the performance of the ensemble system dramatically. Random vector functional link networks (RVFL) without direct input-to-output links is one of suitable base-classifiers for ensemble systems because of its fast learning speed, simple structure and good generalization performance. In this paper, to obtain a more compact ensemble system with improved convergence performance, an improved ensemble of RVFL based on attractive and repulsive particle swarm optimization (ARPSO) with double optimization strategy is proposed. In the proposed method, ARPSO is applied to select and combine the candidate RVFL. As for using ARPSO to select the optimal base RVFL, ARPSO considers both the convergence accuracy on the validation data and the diversity of the candidate ensemble system to build the RVFL ensembles. In the process of combining RVFL, the ensemble weights corresponding to the base RVFL are initialized by the minimum norm least-square method and then further optimized by ARPSO. Finally, a few redundant RVFL is pruned, and thus the more compact ensemble of RVFL is obtained. Moreover, in this paper, theoretical analysis and justification on how to prune the base classifiers on classification problem is presented, and a simple and practically feasible strategy for pruning redundant base classifiers on both classification and regression problems is proposed. Since the double optimization is performed on the basis of the single optimization, the ensemble of RVFL built by the proposed method outperforms that built by some single optimization methods. Experiment results on function approximation and classification problems verify that the proposed method could improve its convergence accuracy as well as reduce the complexity of the ensemble system. PMID:27835638
Ling, Qing-Hua; Song, Yu-Qing; Han, Fei; Yang, Dan; Huang, De-Shuang
2016-01-01
For ensemble learning, how to select and combine the candidate classifiers are two key issues which influence the performance of the ensemble system dramatically. Random vector functional link networks (RVFL) without direct input-to-output links is one of suitable base-classifiers for ensemble systems because of its fast learning speed, simple structure and good generalization performance. In this paper, to obtain a more compact ensemble system with improved convergence performance, an improved ensemble of RVFL based on attractive and repulsive particle swarm optimization (ARPSO) with double optimization strategy is proposed. In the proposed method, ARPSO is applied to select and combine the candidate RVFL. As for using ARPSO to select the optimal base RVFL, ARPSO considers both the convergence accuracy on the validation data and the diversity of the candidate ensemble system to build the RVFL ensembles. In the process of combining RVFL, the ensemble weights corresponding to the base RVFL are initialized by the minimum norm least-square method and then further optimized by ARPSO. Finally, a few redundant RVFL is pruned, and thus the more compact ensemble of RVFL is obtained. Moreover, in this paper, theoretical analysis and justification on how to prune the base classifiers on classification problem is presented, and a simple and practically feasible strategy for pruning redundant base classifiers on both classification and regression problems is proposed. Since the double optimization is performed on the basis of the single optimization, the ensemble of RVFL built by the proposed method outperforms that built by some single optimization methods. Experiment results on function approximation and classification problems verify that the proposed method could improve its convergence accuracy as well as reduce the complexity of the ensemble system.
Ensemble Methods for Classification of Physical Activities from Wrist Accelerometry.
Chowdhury, Alok Kumar; Tjondronegoro, Dian; Chandran, Vinod; Trost, Stewart G
2017-09-01
To investigate whether the use of ensemble learning algorithms improve physical activity recognition accuracy compared to the single classifier algorithms, and to compare the classification accuracy achieved by three conventional ensemble machine learning methods (bagging, boosting, random forest) and a custom ensemble model comprising four algorithms commonly used for activity recognition (binary decision tree, k nearest neighbor, support vector machine, and neural network). The study used three independent data sets that included wrist-worn accelerometer data. For each data set, a four-step classification framework consisting of data preprocessing, feature extraction, normalization and feature selection, and classifier training and testing was implemented. For the custom ensemble, decisions from the single classifiers were aggregated using three decision fusion methods: weighted majority vote, naïve Bayes combination, and behavior knowledge space combination. Classifiers were cross-validated using leave-one subject out cross-validation and compared on the basis of average F1 scores. In all three data sets, ensemble learning methods consistently outperformed the individual classifiers. Among the conventional ensemble methods, random forest models provided consistently high activity recognition; however, the custom ensemble model using weighted majority voting demonstrated the highest classification accuracy in two of the three data sets. Combining multiple individual classifiers using conventional or custom ensemble learning methods can improve activity recognition accuracy from wrist-worn accelerometer data.
Zhou, Shenghan; Qian, Silin; Chang, Wenbing; Xiao, Yiyong; Cheng, Yang
2018-06-14
Timely and accurate state detection and fault diagnosis of rolling element bearings are very critical to ensuring the reliability of rotating machinery. This paper proposes a novel method of rolling bearing fault diagnosis based on a combination of ensemble empirical mode decomposition (EEMD), weighted permutation entropy (WPE) and an improved support vector machine (SVM) ensemble classifier. A hybrid voting (HV) strategy that combines SVM-based classifiers and cloud similarity measurement (CSM) was employed to improve the classification accuracy. First, the WPE value of the bearing vibration signal was calculated to detect the fault. Secondly, if a bearing fault occurred, the vibration signal was decomposed into a set of intrinsic mode functions (IMFs) by EEMD. The WPE values of the first several IMFs were calculated to form the fault feature vectors. Then, the SVM ensemble classifier was composed of binary SVM and the HV strategy to identify the bearing multi-fault types. Finally, the proposed model was fully evaluated by experiments and comparative studies. The results demonstrate that the proposed method can effectively detect bearing faults and maintain a high accuracy rate of fault recognition when a small number of training samples are available.
Arshad, Sannia; Rho, Seungmin
2014-01-01
We have presented a classification framework that combines multiple heterogeneous classifiers in the presence of class label noise. An extension of m-Mediods based modeling is presented that generates model of various classes whilst identifying and filtering noisy training data. This noise free data is further used to learn model for other classifiers such as GMM and SVM. A weight learning method is then introduced to learn weights on each class for different classifiers to construct an ensemble. For this purpose, we applied genetic algorithm to search for an optimal weight vector on which classifier ensemble is expected to give the best accuracy. The proposed approach is evaluated on variety of real life datasets. It is also compared with existing standard ensemble techniques such as Adaboost, Bagging, and Random Subspace Methods. Experimental results show the superiority of proposed ensemble method as compared to its competitors, especially in the presence of class label noise and imbalance classes. PMID:25295302
Khalid, Shehzad; Arshad, Sannia; Jabbar, Sohail; Rho, Seungmin
2014-01-01
We have presented a classification framework that combines multiple heterogeneous classifiers in the presence of class label noise. An extension of m-Mediods based modeling is presented that generates model of various classes whilst identifying and filtering noisy training data. This noise free data is further used to learn model for other classifiers such as GMM and SVM. A weight learning method is then introduced to learn weights on each class for different classifiers to construct an ensemble. For this purpose, we applied genetic algorithm to search for an optimal weight vector on which classifier ensemble is expected to give the best accuracy. The proposed approach is evaluated on variety of real life datasets. It is also compared with existing standard ensemble techniques such as Adaboost, Bagging, and Random Subspace Methods. Experimental results show the superiority of proposed ensemble method as compared to its competitors, especially in the presence of class label noise and imbalance classes.
An efficient ensemble learning method for gene microarray classification.
Osareh, Alireza; Shadgar, Bita
2013-01-01
The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. However, it has been also revealed that the basic classification techniques have intrinsic drawbacks in achieving accurate gene classification and cancer diagnosis. On the other hand, classifier ensembles have received increasing attention in various applications. Here, we address the gene classification issue using RotBoost ensemble methodology. This method is a combination of Rotation Forest and AdaBoost techniques which in turn preserve both desirable features of an ensemble architecture, that is, accuracy and diversity. To select a concise subset of informative genes, 5 different feature selection algorithms are considered. To assess the efficiency of the RotBoost, other nonensemble/ensemble techniques including Decision Trees, Support Vector Machines, Rotation Forest, AdaBoost, and Bagging are also deployed. Experimental results have revealed that the combination of the fast correlation-based feature selection method with ICA-based RotBoost ensemble is highly effective for gene classification. In fact, the proposed method can create ensemble classifiers which outperform not only the classifiers produced by the conventional machine learning but also the classifiers generated by two widely used conventional ensemble learning methods, that is, Bagging and AdaBoost.
Force Sensor Based Tool Condition Monitoring Using a Heterogeneous Ensemble Learning Model
Wang, Guofeng; Yang, Yinwei; Li, Zhimeng
2014-01-01
Tool condition monitoring (TCM) plays an important role in improving machining efficiency and guaranteeing workpiece quality. In order to realize reliable recognition of the tool condition, a robust classifier needs to be constructed to depict the relationship between tool wear states and sensory information. However, because of the complexity of the machining process and the uncertainty of the tool wear evolution, it is hard for a single classifier to fit all the collected samples without sacrificing generalization ability. In this paper, heterogeneous ensemble learning is proposed to realize tool condition monitoring in which the support vector machine (SVM), hidden Markov model (HMM) and radius basis function (RBF) are selected as base classifiers and a stacking ensemble strategy is further used to reflect the relationship between the outputs of these base classifiers and tool wear states. Based on the heterogeneous ensemble learning classifier, an online monitoring system is constructed in which the harmonic features are extracted from force signals and a minimal redundancy and maximal relevance (mRMR) algorithm is utilized to select the most prominent features. To verify the effectiveness of the proposed method, a titanium alloy milling experiment was carried out and samples with different tool wear states were collected to build the proposed heterogeneous ensemble learning classifier. Moreover, the homogeneous ensemble learning model and majority voting strategy are also adopted to make a comparison. The analysis and comparison results show that the proposed heterogeneous ensemble learning classifier performs better in both classification accuracy and stability. PMID:25405514
Force sensor based tool condition monitoring using a heterogeneous ensemble learning model.
Wang, Guofeng; Yang, Yinwei; Li, Zhimeng
2014-11-14
Tool condition monitoring (TCM) plays an important role in improving machining efficiency and guaranteeing workpiece quality. In order to realize reliable recognition of the tool condition, a robust classifier needs to be constructed to depict the relationship between tool wear states and sensory information. However, because of the complexity of the machining process and the uncertainty of the tool wear evolution, it is hard for a single classifier to fit all the collected samples without sacrificing generalization ability. In this paper, heterogeneous ensemble learning is proposed to realize tool condition monitoring in which the support vector machine (SVM), hidden Markov model (HMM) and radius basis function (RBF) are selected as base classifiers and a stacking ensemble strategy is further used to reflect the relationship between the outputs of these base classifiers and tool wear states. Based on the heterogeneous ensemble learning classifier, an online monitoring system is constructed in which the harmonic features are extracted from force signals and a minimal redundancy and maximal relevance (mRMR) algorithm is utilized to select the most prominent features. To verify the effectiveness of the proposed method, a titanium alloy milling experiment was carried out and samples with different tool wear states were collected to build the proposed heterogeneous ensemble learning classifier. Moreover, the homogeneous ensemble learning model and majority voting strategy are also adopted to make a comparison. The analysis and comparison results show that the proposed heterogeneous ensemble learning classifier performs better in both classification accuracy and stability.
Sørensen, Lauge; Nielsen, Mads
2018-05-15
The International Challenge for Automated Prediction of MCI from MRI data offered independent, standardized comparison of machine learning algorithms for multi-class classification of normal control (NC), mild cognitive impairment (MCI), converting MCI (cMCI), and Alzheimer's disease (AD) using brain imaging and general cognition. We proposed to use an ensemble of support vector machines (SVMs) that combined bagging without replacement and feature selection. SVM is the most commonly used algorithm in multivariate classification of dementia, and it was therefore valuable to evaluate the potential benefit of ensembling this type of classifier. The ensemble SVM, using either a linear or a radial basis function (RBF) kernel, achieved multi-class classification accuracies of 55.6% and 55.0% in the challenge test set (60 NC, 60 MCI, 60 cMCI, 60 AD), resulting in a third place in the challenge. Similar feature subset sizes were obtained for both kernels, and the most frequently selected MRI features were the volumes of the two hippocampal subregions left presubiculum and right subiculum. Post-challenge analysis revealed that enforcing a minimum number of selected features and increasing the number of ensemble classifiers improved classification accuracy up to 59.1%. The ensemble SVM outperformed single SVM classifications consistently in the challenge test set. Ensemble methods using bagging and feature selection can improve the performance of the commonly applied SVM classifier in dementia classification. This resulted in competitive classification accuracies in the International Challenge for Automated Prediction of MCI from MRI data. Copyright © 2018 Elsevier B.V. All rights reserved.
SVM and SVM Ensembles in Breast Cancer Prediction.
Huang, Min-Wei; Chen, Chih-Wen; Lin, Wei-Chao; Ke, Shih-Wen; Tsai, Chih-Fong
2017-01-01
Breast cancer is an all too common disease in women, making how to effectively predict it an active research problem. A number of statistical and machine learning techniques have been employed to develop various breast cancer prediction models. Among them, support vector machines (SVM) have been shown to outperform many related techniques. To construct the SVM classifier, it is first necessary to decide the kernel function, and different kernel functions can result in different prediction performance. However, there have been very few studies focused on examining the prediction performances of SVM based on different kernel functions. Moreover, it is unknown whether SVM classifier ensembles which have been proposed to improve the performance of single classifiers can outperform single SVM classifiers in terms of breast cancer prediction. Therefore, the aim of this paper is to fully assess the prediction performance of SVM and SVM ensembles over small and large scale breast cancer datasets. The classification accuracy, ROC, F-measure, and computational times of training SVM and SVM ensembles are compared. The experimental results show that linear kernel based SVM ensembles based on the bagging method and RBF kernel based SVM ensembles with the boosting method can be the better choices for a small scale dataset, where feature selection should be performed in the data pre-processing stage. For a large scale dataset, RBF kernel based SVM ensembles based on boosting perform better than the other classifiers.
SVM and SVM Ensembles in Breast Cancer Prediction
Huang, Min-Wei; Chen, Chih-Wen; Lin, Wei-Chao; Ke, Shih-Wen; Tsai, Chih-Fong
2017-01-01
Breast cancer is an all too common disease in women, making how to effectively predict it an active research problem. A number of statistical and machine learning techniques have been employed to develop various breast cancer prediction models. Among them, support vector machines (SVM) have been shown to outperform many related techniques. To construct the SVM classifier, it is first necessary to decide the kernel function, and different kernel functions can result in different prediction performance. However, there have been very few studies focused on examining the prediction performances of SVM based on different kernel functions. Moreover, it is unknown whether SVM classifier ensembles which have been proposed to improve the performance of single classifiers can outperform single SVM classifiers in terms of breast cancer prediction. Therefore, the aim of this paper is to fully assess the prediction performance of SVM and SVM ensembles over small and large scale breast cancer datasets. The classification accuracy, ROC, F-measure, and computational times of training SVM and SVM ensembles are compared. The experimental results show that linear kernel based SVM ensembles based on the bagging method and RBF kernel based SVM ensembles with the boosting method can be the better choices for a small scale dataset, where feature selection should be performed in the data pre-processing stage. For a large scale dataset, RBF kernel based SVM ensembles based on boosting perform better than the other classifiers. PMID:28060807
Yu, Hualong; Ni, Jun
2014-01-01
Training classifiers on skewed data can be technically challenging tasks, especially if the data is high-dimensional simultaneously, the tasks can become more difficult. In biomedicine field, skewed data type often appears. In this study, we try to deal with this problem by combining asymmetric bagging ensemble classifier (asBagging) that has been presented in previous work and an improved random subspace (RS) generation strategy that is called feature subspace (FSS). Specifically, FSS is a novel method to promote the balance level between accuracy and diversity of base classifiers in asBagging. In view of the strong generalization capability of support vector machine (SVM), we adopt it to be base classifier. Extensive experiments on four benchmark biomedicine data sets indicate that the proposed ensemble learning method outperforms many baseline approaches in terms of Accuracy, F-measure, G-mean and AUC evaluation criterions, thus it can be regarded as an effective and efficient tool to deal with high-dimensional and imbalanced biomedical data.
Bashir, Saba; Qamar, Usman; Khan, Farhan Hassan
2015-06-01
Conventional clinical decision support systems are based on individual classifiers or simple combination of these classifiers which tend to show moderate performance. This research paper presents a novel classifier ensemble framework based on enhanced bagging approach with multi-objective weighted voting scheme for prediction and analysis of heart disease. The proposed model overcomes the limitations of conventional performance by utilizing an ensemble of five heterogeneous classifiers: Naïve Bayes, linear regression, quadratic discriminant analysis, instance based learner and support vector machines. Five different datasets are used for experimentation, evaluation and validation. The datasets are obtained from publicly available data repositories. Effectiveness of the proposed ensemble is investigated by comparison of results with several classifiers. Prediction results of the proposed ensemble model are assessed by ten fold cross validation and ANOVA statistics. The experimental evaluation shows that the proposed framework deals with all type of attributes and achieved high diagnosis accuracy of 84.16 %, 93.29 % sensitivity, 96.70 % specificity, and 82.15 % f-measure. The f-ratio higher than f-critical and p value less than 0.05 for 95 % confidence interval indicate that the results are extremely statistically significant for most of the datasets.
Ensemble based on static classifier selection for automated diagnosis of Mild Cognitive Impairment.
Nanni, Loris; Lumini, Alessandra; Zaffonato, Nicolò
2018-05-15
Alzheimer's disease (AD) is the most common cause of neurodegenerative dementia in the elderly population. Scientific research is very active in the challenge of designing automated approaches to achieve an early and certain diagnosis. Recently an international competition among AD predictors has been organized: "A Machine learning neuroimaging challenge for automated diagnosis of Mild Cognitive Impairment" (MLNeCh). This competition is based on pre-processed sets of T1-weighted Magnetic Resonance Images (MRI) to be classified in four categories: stable AD, individuals with MCI who converted to AD, individuals with MCI who did not convert to AD and healthy controls. In this work, we propose a method to perform early diagnosis of AD, which is evaluated on MLNeCh dataset. Since the automatic classification of AD is based on the use of feature vectors of high dimensionality, different techniques of feature selection/reduction are compared in order to avoid the curse-of-dimensionality problem, then the classification method is obtained as the combination of Support Vector Machines trained using different clusters of data extracted from the whole training set. The multi-classifier approach proposed in this work outperforms all the stand-alone method tested in our experiments. The final ensemble is based on a set of classifiers, each trained on a different cluster of the training data. The proposed ensemble has the great advantage of performing well using a very reduced version of the data (the reduction factor is more than 90%). The MATLAB code for the ensemble of classifiers will be publicly available 1 to other researchers for future comparisons. Copyright © 2017 Elsevier B.V. All rights reserved.
Proposed hybrid-classifier ensemble algorithm to map snow cover area
NASA Astrophysics Data System (ADS)
Nijhawan, Rahul; Raman, Balasubramanian; Das, Josodhir
2018-01-01
Metaclassification ensemble approach is known to improve the prediction performance of snow-covered area. The methodology adopted in this case is based on neural network along with four state-of-art machine learning algorithms: support vector machine, artificial neural networks, spectral angle mapper, K-mean clustering, and a snow index: normalized difference snow index. An AdaBoost ensemble algorithm related to decision tree for snow-cover mapping is also proposed. According to available literature, these methods have been rarely used for snow-cover mapping. Employing the above techniques, a study was conducted for Raktavarn and Chaturangi Bamak glaciers, Uttarakhand, Himalaya using multispectral Landsat 7 ETM+ (enhanced thematic mapper) image. The study also compares the results with those obtained from statistical combination methods (majority rule and belief functions) and accuracies of individual classifiers. Accuracy assessment is performed by computing the quantity and allocation disagreement, analyzing statistic measures (accuracy, precision, specificity, AUC, and sensitivity) and receiver operating characteristic curves. A total of 225 combinations of parameters for individual classifiers were trained and tested on the dataset and results were compared with the proposed approach. It was observed that the proposed methodology produced the highest classification accuracy (95.21%), close to (94.01%) that was produced by the proposed AdaBoost ensemble algorithm. From the sets of observations, it was concluded that the ensemble of classifiers produced better results compared to individual classifiers.
Ozcift, Akin
2012-08-01
Parkinson disease (PD) is an age-related deterioration of certain nerve systems, which affects movement, balance, and muscle control of clients. PD is one of the common diseases which affect 1% of people older than 60 years. A new classification scheme based on support vector machine (SVM) selected features to train rotation forest (RF) ensemble classifiers is presented for improving diagnosis of PD. The dataset contains records of voice measurements from 31 people, 23 with PD and each record in the dataset is defined with 22 features. The diagnosis model first makes use of a linear SVM to select ten most relevant features from 22. As a second step of the classification model, six different classifiers are trained with the subset of features. Subsequently, at the third step, the accuracies of classifiers are improved by the utilization of RF ensemble classification strategy. The results of the experiments are evaluated using three metrics; classification accuracy (ACC), Kappa Error (KE) and Area under the Receiver Operating Characteristic (ROC) Curve (AUC). Performance measures of two base classifiers, i.e. KStar and IBk, demonstrated an apparent increase in PD diagnosis accuracy compared to similar studies in literature. After all, application of RF ensemble classification scheme improved PD diagnosis in 5 of 6 classifiers significantly. We, numerically, obtained about 97% accuracy in RF ensemble of IBk (a K-Nearest Neighbor variant) algorithm, which is a quite high performance for Parkinson disease diagnosis.
Ensemble of One-Class Classifiers for Personal Risk Detection Based on Wearable Sensor Data.
Rodríguez, Jorge; Barrera-Animas, Ari Y; Trejo, Luis A; Medina-Pérez, Miguel Angel; Monroy, Raúl
2016-09-29
This study introduces the One-Class K-means with Randomly-projected features Algorithm (OCKRA). OCKRA is an ensemble of one-class classifiers built over multiple projections of a dataset according to random feature subsets. Algorithms found in the literature spread over a wide range of applications where ensembles of one-class classifiers have been satisfactorily applied; however, none is oriented to the area under our study: personal risk detection. OCKRA has been designed with the aim of improving the detection performance in the problem posed by the Personal RIsk DEtection(PRIDE) dataset. PRIDE was built based on 23 test subjects, where the data for each user were captured using a set of sensors embedded in a wearable band. The performance of OCKRA was compared against support vector machine and three versions of the Parzen window classifier. On average, experimental results show that OCKRA outperformed the other classifiers for at least 0.53% of the area under the curve (AUC). In addition, OCKRA achieved an AUC above 90% for more than 57% of the users.
Ensemble of One-Class Classifiers for Personal Risk Detection Based on Wearable Sensor Data
Rodríguez, Jorge; Barrera-Animas, Ari Y.; Trejo, Luis A.; Medina-Pérez, Miguel Angel; Monroy, Raúl
2016-01-01
This study introduces the One-Class K-means with Randomly-projected features Algorithm (OCKRA). OCKRA is an ensemble of one-class classifiers built over multiple projections of a dataset according to random feature subsets. Algorithms found in the literature spread over a wide range of applications where ensembles of one-class classifiers have been satisfactorily applied; however, none is oriented to the area under our study: personal risk detection. OCKRA has been designed with the aim of improving the detection performance in the problem posed by the Personal RIsk DEtection(PRIDE) dataset. PRIDE was built based on 23 test subjects, where the data for each user were captured using a set of sensors embedded in a wearable band. The performance of OCKRA was compared against support vector machine and three versions of the Parzen window classifier. On average, experimental results show that OCKRA outperformed the other classifiers for at least 0.53% of the area under the curve (AUC). In addition, OCKRA achieved an AUC above 90% for more than 57% of the users. PMID:27690054
Ensembles of novelty detection classifiers for structural health monitoring using guided waves
NASA Astrophysics Data System (ADS)
Dib, Gerges; Karpenko, Oleksii; Koricho, Ermias; Khomenko, Anton; Haq, Mahmoodul; Udpa, Lalita
2018-01-01
Guided wave structural health monitoring uses sparse sensor networks embedded in sophisticated structures for defect detection and characterization. The biggest challenge of those sensor networks is developing robust techniques for reliable damage detection under changing environmental and operating conditions (EOC). To address this challenge, we develop a novelty classifier for damage detection based on one class support vector machines. We identify appropriate features for damage detection and introduce a feature aggregation method which quadratically increases the number of available training observations. We adopt a two-level voting scheme by using an ensemble of classifiers and predictions. Each classifier is trained on a different segment of the guided wave signal, and each classifier makes an ensemble of predictions based on a single observation. Using this approach, the classifier can be trained using a small number of baseline signals. We study the performance using Monte-Carlo simulations of an analytical model and data from impact damage experiments on a glass fiber composite plate. We also demonstrate the classifier performance using two types of baseline signals: fixed and rolling baseline training set. The former requires prior knowledge of baseline signals from all EOC, while the latter does not and leverages the fact that EOC vary slowly over time and can be modeled as a Gaussian process.
NASA Astrophysics Data System (ADS)
Lawi, Armin; Adhitya, Yudhi
2018-03-01
The objective of this research is to determine the quality of cocoa beans through morphology of their digital images. Samples of cocoa beans were scattered on a bright white paper under a controlled lighting condition. A compact digital camera was used to capture the images. The images were then processed to extract their morphological parameters. Classification process begins with an analysis of cocoa beans image based on morphological feature extraction. Parameters for extraction of morphological or physical feature parameters, i.e., Area, Perimeter, Major Axis Length, Minor Axis Length, Aspect Ratio, Circularity, Roundness, Ferret Diameter. The cocoa beans are classified into 4 groups, i.e.: Normal Beans, Broken Beans, Fractured Beans, and Skin Damaged Beans. The model of classification used in this paper is the Multiclass Ensemble Least-Squares Support Vector Machine (MELS-SVM), a proposed improvement model of SVM using ensemble method in which the separate hyperplanes are obtained by least square approach and the multiclass procedure uses One-Against- All method. The result of our proposed model showed that the classification with morphological feature input parameters were accurately as 99.705% for the four classes, respectively.
Data mining for the analysis of hippocampal zones in Alzheimer's disease
NASA Astrophysics Data System (ADS)
Ovando Vázquez, Cesaré M.
2012-02-01
In this work, a methodology to classify people with Alzheimer's Disease (AD), Healthy Controls (HC) and people with Mild Cognitive Impairment (MCI) is presented. This methodology consists of an ensemble of Support Vector Machines (SVM) with the hippocampal boxes (HB) as input data, these hippocampal zones are taken from Magnetic Resonance (MRI) and Positron Emission Tomography (PET) images. Two ways of constructing this ensemble are presented, the first consists of linear SVM models and the second of non-linear SVM models. Results demonstrate that the linear models classify HBs more accurately than the non-linear models between HC and MCI and that there are no differences between HC and AD.
Ensembles of novelty detection classifiers for structural health monitoring using guided waves
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dib, Gerges; Karpenko, Oleksii; Koricho, Ermias
Guided wave structural health monitoring uses sparse sensor networks embedded in sophisticated structures for defect detection and characterization. The biggest challenge of those sensor networks is developing robust techniques for reliable damage detection under changing environmental and operating conditions. To address this challenge, we develop a novelty classifier for damage detection based on one class support vector machines. We identify appropriate features for damage detection and introduce a feature aggregation method which quadratically increases the number of available training observations.We adopt a two-level voting scheme by using an ensemble of classifiers and predictions. Each classifier is trained on a differentmore » segment of the guided wave signal, and each classifier makes an ensemble of predictions based on a single observation. Using this approach, the classifier can be trained using a small number of baseline signals. We study the performance using monte-carlo simulations of an analytical model and data from impact damage experiments on a glass fiber composite plate.We also demonstrate the classifier performance using two types of baseline signals: fixed and rolling baseline training set. The former requires prior knowledge of baseline signals from all environmental and operating conditions, while the latter does not and leverages the fact that environmental and operating conditions vary slowly over time and can be modeled as a Gaussian process.« less
Bonet, Isis; Franco-Montero, Pedro; Rivero, Virginia; Teijeira, Marta; Borges, Fernanda; Uriarte, Eugenio; Morales Helguera, Aliuska
2013-12-23
A(2B) adenosine receptor antagonists may be beneficial in treating diseases like asthma, diabetes, diabetic retinopathy, and certain cancers. This has stimulated research for the development of potent ligands for this subtype, based on quantitative structure-affinity relationships. In this work, a new ensemble machine learning algorithm is proposed for classification and prediction of the ligand-binding affinity of A(2B) adenosine receptor antagonists. This algorithm is based on the training of different classifier models with multiple training sets (composed of the same compounds but represented by diverse features). The k-nearest neighbor, decision trees, neural networks, and support vector machines were used as single classifiers. To select the base classifiers for combining into the ensemble, several diversity measures were employed. The final multiclassifier prediction results were computed from the output obtained by using a combination of selected base classifiers output, by utilizing different mathematical functions including the following: majority vote, maximum and average probability. In this work, 10-fold cross- and external validation were used. The strategy led to the following results: i) the single classifiers, together with previous features selections, resulted in good overall accuracy, ii) a comparison between single classifiers, and their combinations in the multiclassifier model, showed that using our ensemble gave a better performance than the single classifier model, and iii) our multiclassifier model performed better than the most widely used multiclassifier models in the literature. The results and statistical analysis demonstrated the supremacy of our multiclassifier approach for predicting the affinity of A(2B) adenosine receptor antagonists, and it can be used to develop other QSAR models.
Chen, Zhiru; Hong, Wenxue
2016-02-01
Considering the low accuracy of prediction in the positive samples and poor overall classification effects caused by unbalanced sample data of MicroRNA (miRNA) target, we proposes a support vector machine (SVM)-integration of under-sampling and weight (IUSM) algorithm in this paper, an under-sampling based on the ensemble learning algorithm. The algorithm adopts SVM as learning algorithm and AdaBoost as integration framework, and embeds clustering-based under-sampling into the iterative process, aiming at reducing the degree of unbalanced distribution of positive and negative samples. Meanwhile, in the process of adaptive weight adjustment of the samples, the SVM-IUSM algorithm eliminates the abnormal ones in negative samples with robust sample weights smoothing mechanism so as to avoid over-learning. Finally, the prediction of miRNA target integrated classifier is achieved with the combination of multiple weak classifiers through the voting mechanism. The experiment revealed that the SVM-IUSW, compared with other algorithms on unbalanced dataset collection, could not only improve the accuracy of positive targets and the overall effect of classification, but also enhance the generalization ability of miRNA target classifier.
Marucci-Wellman, Helen R; Corns, Helen L; Lehto, Mark R
2017-01-01
Injury narratives are now available real time and include useful information for injury surveillance and prevention. However, manual classification of the cause or events leading to injury found in large batches of narratives, such as workers compensation claims databases, can be prohibitive. In this study we compare the utility of four machine learning algorithms (Naïve Bayes, Single word and Bi-gram models, Support Vector Machine and Logistic Regression) for classifying narratives into Bureau of Labor Statistics Occupational Injury and Illness event leading to injury classifications for a large workers compensation database. These algorithms are known to do well classifying narrative text and are fairly easy to implement with off-the-shelf software packages such as Python. We propose human-machine learning ensemble approaches which maximize the power and accuracy of the algorithms for machine-assigned codes and allow for strategic filtering of rare, emerging or ambiguous narratives for manual review. We compare human-machine approaches based on filtering on the prediction strength of the classifier vs. agreement between algorithms. Regularized Logistic Regression (LR) was the best performing algorithm alone. Using this algorithm and filtering out the bottom 30% of predictions for manual review resulted in high accuracy (overall sensitivity/positive predictive value of 0.89) of the final machine-human coded dataset. The best pairings of algorithms included Naïve Bayes with Support Vector Machine whereby the triple ensemble NB SW =NB BI-GRAM =SVM had very high performance (0.93 overall sensitivity/positive predictive value and high accuracy (i.e. high sensitivity and positive predictive values)) across both large and small categories leaving 41% of the narratives for manual review. Integrating LR into this ensemble mix improved performance only slightly. For large administrative datasets we propose incorporation of methods based on human-machine pairings such as we have done here, utilizing readily-available off-the-shelf machine learning techniques and resulting in only a fraction of narratives that require manual review. Human-machine ensemble methods are likely to improve performance over total manual coding. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.
NASA Astrophysics Data System (ADS)
Pahlavani, Parham; Bigdeli, Behnaz
2017-12-01
Hyperspectral images contain extremely rich spectral information that offer great potential to discriminate between various land cover classes. However, these images are usually composed of tens or hundreds of spectrally close bands, which result in high redundancy and great amount of computation time in hyperspectral classification. Furthermore, in the presence of mixed coverage pixels, crisp classifiers produced errors, omission and commission. This paper presents a mutual information-Dempster-Shafer system through an ensemble classification approach for classification of hyperspectral data. First, mutual information is applied to split data into a few independent partitions to overcome high dimensionality. Then, a fuzzy maximum likelihood classifies each band subset. Finally, Dempster-Shafer is applied to fuse the results of the fuzzy classifiers. In order to assess the proposed method, a crisp ensemble system based on a support vector machine as the crisp classifier and weighted majority voting as the crisp fusion method are applied on hyperspectral data. Furthermore, a dimension reduction system is utilized to assess the effectiveness of mutual information band splitting of the proposed method. The proposed methodology provides interesting conclusions on the effectiveness and potentiality of mutual information-Dempster-Shafer based classification of hyperspectral data.
Jiang, Xiaoying; Wei, Rong; Zhao, Yanjun; Zhang, Tongliang
2008-05-01
The knowledge of subnuclear localization in eukaryotic cells is essential for understanding the life function of nucleus. Developing prediction methods and tools for proteins subnuclear localization become important research fields in protein science for special characteristics in cell nuclear. In this study, a novel approach has been proposed to predict protein subnuclear localization. Sample of protein is represented by Pseudo Amino Acid (PseAA) composition based on approximate entropy (ApEn) concept, which reflects the complexity of time series. A novel ensemble classifier is designed incorporating three AdaBoost classifiers. The base classifier algorithms in three AdaBoost are decision stumps, fuzzy K nearest neighbors classifier, and radial basis-support vector machines, respectively. Different PseAA compositions are used as input data of different AdaBoost classifier in ensemble. Genetic algorithm is used to optimize the dimension and weight factor of PseAA composition. Two datasets often used in published works are used to validate the performance of the proposed approach. The obtained results of Jackknife cross-validation test are higher and more balance than them of other methods on same datasets. The promising results indicate that the proposed approach is effective and practical. It might become a useful tool in protein subnuclear localization. The software in Matlab and supplementary materials are available freely by contacting the corresponding author.
Using Support Vector Machine Ensembles for Target Audience Classification on Twitter
Lo, Siaw Ling; Chiong, Raymond; Cornforth, David
2015-01-01
The vast amount and diversity of the content shared on social media can pose a challenge for any business wanting to use it to identify potential customers. In this paper, our aim is to investigate the use of both unsupervised and supervised learning methods for target audience classification on Twitter with minimal annotation efforts. Topic domains were automatically discovered from contents shared by followers of an account owner using Twitter Latent Dirichlet Allocation (LDA). A Support Vector Machine (SVM) ensemble was then trained using contents from different account owners of the various topic domains identified by Twitter LDA. Experimental results show that the methods presented are able to successfully identify a target audience with high accuracy. In addition, we show that using a statistical inference approach such as bootstrapping in over-sampling, instead of using random sampling, to construct training datasets can achieve a better classifier in an SVM ensemble. We conclude that such an ensemble system can take advantage of data diversity, which enables real-world applications for differentiating prospective customers from the general audience, leading to business advantage in the crowded social media space. PMID:25874768
Using support vector machine ensembles for target audience classification on Twitter.
Lo, Siaw Ling; Chiong, Raymond; Cornforth, David
2015-01-01
The vast amount and diversity of the content shared on social media can pose a challenge for any business wanting to use it to identify potential customers. In this paper, our aim is to investigate the use of both unsupervised and supervised learning methods for target audience classification on Twitter with minimal annotation efforts. Topic domains were automatically discovered from contents shared by followers of an account owner using Twitter Latent Dirichlet Allocation (LDA). A Support Vector Machine (SVM) ensemble was then trained using contents from different account owners of the various topic domains identified by Twitter LDA. Experimental results show that the methods presented are able to successfully identify a target audience with high accuracy. In addition, we show that using a statistical inference approach such as bootstrapping in over-sampling, instead of using random sampling, to construct training datasets can achieve a better classifier in an SVM ensemble. We conclude that such an ensemble system can take advantage of data diversity, which enables real-world applications for differentiating prospective customers from the general audience, leading to business advantage in the crowded social media space.
Recognition of medication information from discharge summaries using ensembles of classifiers.
Doan, Son; Collier, Nigel; Xu, Hua; Pham, Hoang Duy; Tu, Minh Phuong
2012-05-07
Extraction of clinical information such as medications or problems from clinical text is an important task of clinical natural language processing (NLP). Rule-based methods are often used in clinical NLP systems because they are easy to adapt and customize. Recently, supervised machine learning methods have proven to be effective in clinical NLP as well. However, combining different classifiers to further improve the performance of clinical entity recognition systems has not been investigated extensively. Combining classifiers into an ensemble classifier presents both challenges and opportunities to improve performance in such NLP tasks. We investigated ensemble classifiers that used different voting strategies to combine outputs from three individual classifiers: a rule-based system, a support vector machine (SVM) based system, and a conditional random field (CRF) based system. Three voting methods were proposed and evaluated using the annotated data sets from the 2009 i2b2 NLP challenge: simple majority, local SVM-based voting, and local CRF-based voting. Evaluation on 268 manually annotated discharge summaries from the i2b2 challenge showed that the local CRF-based voting method achieved the best F-score of 90.84% (94.11% Precision, 87.81% Recall) for 10-fold cross-validation. We then compared our systems with the first-ranked system in the challenge by using the same training and test sets. Our system based on majority voting achieved a better F-score of 89.65% (93.91% Precision, 85.76% Recall) than the previously reported F-score of 89.19% (93.78% Precision, 85.03% Recall) by the first-ranked system in the challenge. Our experimental results using the 2009 i2b2 challenge datasets showed that ensemble classifiers that combine individual classifiers into a voting system could achieve better performance than a single classifier in recognizing medication information from clinical text. It suggests that simple strategies that can be easily implemented such as majority voting could have the potential to significantly improve clinical entity recognition.
Rodriguez-Diaz, Eladio; Castanon, David A; Singh, Satish K; Bigio, Irving J
2011-06-01
Optical spectroscopy has shown potential as a real-time, in vivo, diagnostic tool for identifying neoplasia during endoscopy. We present the development of a diagnostic algorithm to classify elastic-scattering spectroscopy (ESS) spectra as either neoplastic or non-neoplastic. The algorithm is based on pattern recognition methods, including ensemble classifiers, in which members of the ensemble are trained on different regions of the ESS spectrum, and misclassification-rejection, where the algorithm identifies and refrains from classifying samples that are at higher risk of being misclassified. These "rejected" samples can be reexamined by simply repositioning the probe to obtain additional optical readings or ultimately by sending the polyp for histopathological assessment, as per standard practice. Prospective validation using separate training and testing sets result in a baseline performance of sensitivity = .83, specificity = .79, using the standard framework of feature extraction (principal component analysis) followed by classification (with linear support vector machines). With the developed algorithm, performance improves to Se ∼ 0.90, Sp ∼ 0.90, at a cost of rejecting 20-33% of the samples. These results are on par with a panel of expert pathologists. For colonoscopic prevention of colorectal cancer, our system could reduce biopsy risk and cost, obviate retrieval of non-neoplastic polyps, decrease procedure time, and improve assessment of cancer risk.
Rodriguez-Diaz, Eladio; Castanon, David A.; Singh, Satish K.; Bigio, Irving J.
2011-01-01
Optical spectroscopy has shown potential as a real-time, in vivo, diagnostic tool for identifying neoplasia during endoscopy. We present the development of a diagnostic algorithm to classify elastic-scattering spectroscopy (ESS) spectra as either neoplastic or non-neoplastic. The algorithm is based on pattern recognition methods, including ensemble classifiers, in which members of the ensemble are trained on different regions of the ESS spectrum, and misclassification-rejection, where the algorithm identifies and refrains from classifying samples that are at higher risk of being misclassified. These “rejected” samples can be reexamined by simply repositioning the probe to obtain additional optical readings or ultimately by sending the polyp for histopathological assessment, as per standard practice. Prospective validation using separate training and testing sets result in a baseline performance of sensitivity = .83, specificity = .79, using the standard framework of feature extraction (principal component analysis) followed by classification (with linear support vector machines). With the developed algorithm, performance improves to Se ∼ 0.90, Sp ∼ 0.90, at a cost of rejecting 20–33% of the samples. These results are on par with a panel of expert pathologists. For colonoscopic prevention of colorectal cancer, our system could reduce biopsy risk and cost, obviate retrieval of non-neoplastic polyps, decrease procedure time, and improve assessment of cancer risk. PMID:21721830
Prediction of Weather Impacted Airport Capacity using Ensemble Learning
NASA Technical Reports Server (NTRS)
Wang, Yao Xun
2011-01-01
Ensemble learning with the Bagging Decision Tree (BDT) model was used to assess the impact of weather on airport capacities at selected high-demand airports in the United States. The ensemble bagging decision tree models were developed and validated using the Federal Aviation Administration (FAA) Aviation System Performance Metrics (ASPM) data and weather forecast at these airports. The study examines the performance of BDT, along with traditional single Support Vector Machines (SVM), for airport runway configuration selection and airport arrival rates (AAR) prediction during weather impacts. Testing of these models was accomplished using observed weather, weather forecast, and airport operation information at the chosen airports. The experimental results show that ensemble methods are more accurate than a single SVM classifier. The airport capacity ensemble method presented here can be used as a decision support model that supports air traffic flow management to meet the weather impacted airport capacity in order to reduce costs and increase safety.
Ali, Safdar; Majid, Abdul; Khan, Asifullah
2014-04-01
Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed 'IDM-PhyChm-Ens' method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our analysis demonstrates that ensemble-RF, ensemble-SVM and ensemble-KNN are more effective than their individual counterparts. The proposed 'IDM-PhyChm-Ens' method has shown improved performance compared to existing techniques.
Park, Jae Hong; Kim, Chang-Eop; Shin, Jaewoo; Im, Changkyun; Koh, Chin Su; Seo, In Seok; Kim, Sang Jeong; Shin, Hyung-Cheul
2013-10-01
Chronic monitoring of the state of the bladder can be used to notify patients with urinary dysfunction when the bladder should be voided. Given that many spinal neurons respond both to somatic and visceral inputs, it is necessary to extract bladder information selectively from the spinal cord. Here, we hypothesize that sensory information with distinct modalities should be represented by the distinct ensemble activity patterns within the neuronal population and, therefore, analyzing the activity patterns of the neuronal population could distinguish bladder fullness from somatic stimuli. We simultaneously recorded 26-27 single unit activities in response to bladder distension or tactile stimuli in the dorsal spinal cord of each Sprague-Dawley rat. In order to discriminate between bladder fullness and tactile stimulus inputs, we analyzed the ensemble activity patterns of the entire neuronal population. A support vector machine (SVM) was employed as a classifier, and discrimination performance was measured by k-fold cross-validation tests. Most of the units responding to bladder fullness also responded to the tactile stimuli (88.9-100%). The SVM classifier precisely distinguished the bladder fullness from the somatic input (100%), indicating that the ensemble activity patterns of the unit population in the spinal cord are distinct enough to identify the current input modality. Moreover, our ensemble activity pattern-based classifier showed high robustness against random losses of signals. This study is the first to demonstrate that the two main issues of electroneurographic monitoring of bladder fullness, low signals and selectiveness, can be solved by an ensemble activity pattern-based approach, improving the feasibility of chronic monitoring of bladder fullness by neural recording.
Ali, Safdar; Majid, Abdul
2015-04-01
The diagnostic of human breast cancer is an intricate process and specific indicators may produce negative results. In order to avoid misleading results, accurate and reliable diagnostic system for breast cancer is indispensable. Recently, several interesting machine-learning (ML) approaches are proposed for prediction of breast cancer. To this end, we developed a novel classifier stacking based evolutionary ensemble system "Can-Evo-Ens" for predicting amino acid sequences associated with breast cancer. In this paper, first, we selected four diverse-type of ML algorithms of Naïve Bayes, K-Nearest Neighbor, Support Vector Machines, and Random Forest as base-level classifiers. These classifiers are trained individually in different feature spaces using physicochemical properties of amino acids. In order to exploit the decision spaces, the preliminary predictions of base-level classifiers are stacked. Genetic programming (GP) is then employed to develop a meta-classifier that optimal combine the predictions of the base classifiers. The most suitable threshold value of the best-evolved predictor is computed using Particle Swarm Optimization technique. Our experiments have demonstrated the robustness of Can-Evo-Ens system for independent validation dataset. The proposed system has achieved the highest value of Area Under Curve (AUC) of ROC Curve of 99.95% for cancer prediction. The comparative results revealed that proposed approach is better than individual ML approaches and conventional ensemble approaches of AdaBoostM1, Bagging, GentleBoost, and Random Subspace. It is expected that the proposed novel system would have a major impact on the fields of Biomedical, Genomics, Proteomics, Bioinformatics, and Drug Development. Copyright © 2015 Elsevier Inc. All rights reserved.
Hussain, Lal
2018-06-01
Epilepsy is a neurological disorder produced due to abnormal excitability of neurons in the brain. The research reveals that brain activity is monitored through electroencephalogram (EEG) of patients suffered from seizure to detect the epileptic seizure. The performance of EEG detection based epilepsy require feature extracting strategies. In this research, we have extracted varying features extracting strategies based on time and frequency domain characteristics, nonlinear, wavelet based entropy and few statistical features. A deeper study was undertaken using novel machine learning classifiers by considering multiple factors. The support vector machine kernels are evaluated based on multiclass kernel and box constraint level. Likewise, for K-nearest neighbors (KNN), we computed the different distance metrics, Neighbor weights and Neighbors. Similarly, the decision trees we tuned the paramours based on maximum splits and split criteria and ensemble classifiers are evaluated based on different ensemble methods and learning rate. For training/testing tenfold Cross validation was employed and performance was evaluated in form of TPR, NPR, PPV, accuracy and AUC. In this research, a deeper analysis approach was performed using diverse features extracting strategies using robust machine learning classifiers with more advanced optimal options. Support Vector Machine linear kernel and KNN with City block distance metric give the overall highest accuracy of 99.5% which was higher than using the default parameters for these classifiers. Moreover, highest separation (AUC = 0.9991, 0.9990) were obtained at different kernel scales using SVM. Additionally, the K-nearest neighbors with inverse squared distance weight give higher performance at different Neighbors. Moreover, to distinguish the postictal heart rate oscillations from epileptic ictal subjects, and highest performance of 100% was obtained using different machine learning classifiers.
Exploiting ensemble learning for automatic cataract detection and grading.
Yang, Ji-Jiang; Li, Jianqiang; Shen, Ruifang; Zeng, Yang; He, Jian; Bi, Jing; Li, Yong; Zhang, Qinyan; Peng, Lihui; Wang, Qing
2016-02-01
Cataract is defined as a lenticular opacity presenting usually with poor visual acuity. It is one of the most common causes of visual impairment worldwide. Early diagnosis demands the expertise of trained healthcare professionals, which may present a barrier to early intervention due to underlying costs. To date, studies reported in the literature utilize a single learning model for retinal image classification in grading cataract severity. We present an ensemble learning based approach as a means to improving diagnostic accuracy. Three independent feature sets, i.e., wavelet-, sketch-, and texture-based features, are extracted from each fundus image. For each feature set, two base learning models, i.e., Support Vector Machine and Back Propagation Neural Network, are built. Then, the ensemble methods, majority voting and stacking, are investigated to combine the multiple base learning models for final fundus image classification. Empirical experiments are conducted for cataract detection (two-class task, i.e., cataract or non-cataractous) and cataract grading (four-class task, i.e., non-cataractous, mild, moderate or severe) tasks. The best performance of the ensemble classifier is 93.2% and 84.5% in terms of the correct classification rates for cataract detection and grading tasks, respectively. The results demonstrate that the ensemble classifier outperforms the single learning model significantly, which also illustrates the effectiveness of the proposed approach. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Improved method for predicting protein fold patterns with ensemble classifiers.
Chen, W; Liu, X; Huang, Y; Jiang, Y; Zou, Q; Lin, C
2012-01-27
Protein folding is recognized as a critical problem in the field of biophysics in the 21st century. Predicting protein-folding patterns is challenging due to the complex structure of proteins. In an attempt to solve this problem, we employed ensemble classifiers to improve prediction accuracy. In our experiments, 188-dimensional features were extracted based on the composition and physical-chemical property of proteins and 20-dimensional features were selected using a coupled position-specific scoring matrix. Compared with traditional prediction methods, these methods were superior in terms of prediction accuracy. The 188-dimensional feature-based method achieved 71.2% accuracy in five cross-validations. The accuracy rose to 77% when we used a 20-dimensional feature vector. These methods were used on recent data, with 54.2% accuracy. Source codes and dataset, together with web server and software tools for prediction, are available at: http://datamining.xmu.edu.cn/main/~cwc/ProteinPredict.html.
NASA Astrophysics Data System (ADS)
Li, Hui; Hong, Lu-Yao; Zhou, Qing; Yu, Hai-Jie
2015-08-01
The business failure of numerous companies results in financial crises. The high social costs associated with such crises have made people to search for effective tools for business risk prediction, among which, support vector machine is very effective. Several modelling means, including single-technique modelling, hybrid modelling, and ensemble modelling, have been suggested in forecasting business risk with support vector machine. However, existing literature seldom focuses on the general modelling frame for business risk prediction, and seldom investigates performance differences among different modelling means. We reviewed researches on forecasting business risk with support vector machine, proposed the general assisted prediction modelling frame with hybridisation and ensemble (APMF-WHAE), and finally, investigated the use of principal components analysis, support vector machine, random sampling, and group decision, under the general frame in forecasting business risk. Under the APMF-WHAE frame with support vector machine as the base predictive model, four specific predictive models were produced, namely, pure support vector machine, a hybrid support vector machine involved with principal components analysis, a support vector machine ensemble involved with random sampling and group decision, and an ensemble of hybrid support vector machine using group decision to integrate various hybrid support vector machines on variables produced from principle components analysis and samples from random sampling. The experimental results indicate that hybrid support vector machine and ensemble of hybrid support vector machines were able to produce dominating performance than pure support vector machine and support vector machine ensemble.
A comparison of breeding and ensemble transform vectors for global ensemble generation
NASA Astrophysics Data System (ADS)
Deng, Guo; Tian, Hua; Li, Xiaoli; Chen, Jing; Gong, Jiandong; Jiao, Meiyan
2012-02-01
To compare the initial perturbation techniques using breeding vectors and ensemble transform vectors, three ensemble prediction systems using both initial perturbation methods but with different ensemble member sizes based on the spectral model T213/L31 are constructed at the National Meteorological Center, China Meteorological Administration (NMC/CMA). A series of ensemble verification scores such as forecast skill of the ensemble mean, ensemble resolution, and ensemble reliability are introduced to identify the most important attributes of ensemble forecast systems. The results indicate that the ensemble transform technique is superior to the breeding vector method in light of the evaluation of anomaly correlation coefficient (ACC), which is a deterministic character of the ensemble mean, the root-mean-square error (RMSE) and spread, which are of probabilistic attributes, and the continuous ranked probability score (CRPS) and its decomposition. The advantage of the ensemble transform approach is attributed to its orthogonality among ensemble perturbations as well as its consistence with the data assimilation system. Therefore, this study may serve as a reference for configuration of the best ensemble prediction system to be used in operation.
Wang, Xiao; Zhang, Jun; Li, Guo-Zheng
2015-01-01
It has become a very important and full of challenge task to predict bacterial protein subcellular locations using computational methods. Although there exist a lot of prediction methods for bacterial proteins, the majority of these methods can only deal with single-location proteins. But unfortunately many multi-location proteins are located in the bacterial cells. Moreover, multi-location proteins have special biological functions capable of helping the development of new drugs. So it is necessary to develop new computational methods for accurately predicting subcellular locations of multi-location bacterial proteins. In this article, two efficient multi-label predictors, Gpos-ECC-mPLoc and Gneg-ECC-mPLoc, are developed to predict the subcellular locations of multi-label gram-positive and gram-negative bacterial proteins respectively. The two multi-label predictors construct the GO vectors by using the GO terms of homologous proteins of query proteins and then adopt a powerful multi-label ensemble classifier to make the final multi-label prediction. The two multi-label predictors have the following advantages: (1) they improve the prediction performance of multi-label proteins by taking the correlations among different labels into account; (2) they ensemble multiple CC classifiers and further generate better prediction results by ensemble learning; and (3) they construct the GO vectors by using the frequency of occurrences of GO terms in the typical homologous set instead of using 0/1 values. Experimental results show that Gpos-ECC-mPLoc and Gneg-ECC-mPLoc can efficiently predict the subcellular locations of multi-label gram-positive and gram-negative bacterial proteins respectively. Gpos-ECC-mPLoc and Gneg-ECC-mPLoc can efficiently improve prediction accuracy of subcellular localization of multi-location gram-positive and gram-negative bacterial proteins respectively. The online web servers for Gpos-ECC-mPLoc and Gneg-ECC-mPLoc predictors are freely accessible at http://biomed.zzuli.edu.cn/bioinfo/gpos-ecc-mploc/ and http://biomed.zzuli.edu.cn/bioinfo/gneg-ecc-mploc/ respectively.
Biomedical named entity extraction: some issues of corpus compatibilities.
Ekbal, Asif; Saha, Sriparna; Sikdar, Utpal Kumar
2013-01-01
Named Entity (NE) extraction is one of the most fundamental and important tasks in biomedical information extraction. It involves identification of certain entities from text and their classification into some predefined categories. In the biomedical community, there is yet no general consensus regarding named entity (NE) annotation; thus, it is very difficult to compare the existing systems due to corpus incompatibilities. Due to this problem we can not also exploit the advantages of using different corpora together. In our present work we address the issues of corpus compatibilities, and use a single objective optimization (SOO) based classifier ensemble technique that uses the search capability of genetic algorithm (GA) for NE extraction in biomedicine. We hypothesize that the reliability of predictions of each classifier differs among the various output classes. We use Conditional Random Field (CRF) and Support Vector Machine (SVM) frameworks to build a number of models depending upon the various representations of the set of features and/or feature templates. It is to be noted that we tried to extract the features without using any deep domain knowledge and/or resources. In order to assess the challenges of corpus compatibilities, we experiment with the different benchmark datasets and their various combinations. Comparison results with the existing approaches prove the efficacy of the used technique. GA based ensemble achieves around 2% performance improvements over the individual classifiers. Degradation in performance on the integrated corpus clearly shows the difficulties of the task. In summary, our used ensemble based approach attains the state-of-the-art performance levels for entity extraction in three different kinds of biomedical datasets. The possible reasons behind the better performance in our used approach are the (i). use of variety and rich features as described in Subsection "Features for named entity extraction"; (ii) use of GA based classifier ensemble technique to combine the outputs of multiple classifiers.
An ensemble of dissimilarity based classifiers for Mackerel gender determination
NASA Astrophysics Data System (ADS)
Blanco, A.; Rodriguez, R.; Martinez-Maranon, I.
2014-03-01
Mackerel is an infravalored fish captured by European fishing vessels. A manner to add value to this specie can be achieved by trying to classify it attending to its sex. Colour measurements were performed on Mackerel females and males (fresh and defrozen) extracted gonads to obtain differences between sexes. Several linear and non linear classifiers such as Support Vector Machines (SVM), k Nearest Neighbors (k-NN) or Diagonal Linear Discriminant Analysis (DLDA) can been applied to this problem. However, theyare usually based on Euclidean distances that fail to reflect accurately the sample proximities. Classifiers based on non-Euclidean dissimilarities misclassify a different set of patterns. We combine different kind of dissimilarity based classifiers. The diversity is induced considering a set of complementary dissimilarities for each model. The experimental results suggest that our algorithm helps to improve classifiers based on a single dissimilarity.
Fault Detection of Bearing Systems through EEMD and Optimization Algorithm
Lee, Dong-Han; Ahn, Jong-Hyo; Koh, Bong-Hwan
2017-01-01
This study proposes a fault detection and diagnosis method for bearing systems using ensemble empirical mode decomposition (EEMD) based feature extraction, in conjunction with particle swarm optimization (PSO), principal component analysis (PCA), and Isomap. First, a mathematical model is assumed to generate vibration signals from damaged bearing components, such as the inner-race, outer-race, and rolling elements. The process of decomposing vibration signals into intrinsic mode functions (IMFs) and extracting statistical features is introduced to develop a damage-sensitive parameter vector. Finally, PCA and Isomap algorithm are used to classify and visualize this parameter vector, to separate damage characteristics from healthy bearing components. Moreover, the PSO-based optimization algorithm improves the classification performance by selecting proper weightings for the parameter vector, to maximize the visualization effect of separating and grouping of parameter vectors in three-dimensional space. PMID:29143772
Active relearning for robust supervised classification of pulmonary emphysema
NASA Astrophysics Data System (ADS)
Raghunath, Sushravya; Rajagopalan, Srinivasan; Karwoski, Ronald A.; Bartholmai, Brian J.; Robb, Richard A.
2012-03-01
Radiologists are adept at recognizing the appearance of lung parenchymal abnormalities in CT scans. However, the inconsistent differential diagnosis, due to subjective aggregation, mandates supervised classification. Towards optimizing Emphysema classification, we introduce a physician-in-the-loop feedback approach in order to minimize uncertainty in the selected training samples. Using multi-view inductive learning with the training samples, an ensemble of Support Vector Machine (SVM) models, each based on a specific pair-wise dissimilarity metric, was constructed in less than six seconds. In the active relearning phase, the ensemble-expert label conflicts were resolved by an expert. This just-in-time feedback with unoptimized SVMs yielded 15% increase in classification accuracy and 25% reduction in the number of support vectors. The generality of relearning was assessed in the optimized parameter space of six different classifiers across seven dissimilarity metrics. The resultant average accuracy improved to 21%. The co-operative feedback method proposed here could enhance both diagnostic and staging throughput efficiency in chest radiology practice.
ANALYSIS OF SAMPLING TECHNIQUES FOR IMBALANCED DATA: AN N=648 ADNI STUDY
Dubey, Rashmi; Zhou, Jiayu; Wang, Yalin; Thompson, Paul M.; Ye, Jieping
2013-01-01
Many neuroimaging applications deal with imbalanced imaging data. For example, in Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, the mild cognitive impairment (MCI) cases eligible for the study are nearly two times the Alzheimer’s disease (AD) patients for structural magnetic resonance imaging (MRI) modality and six times the control cases for proteomics modality. Constructing an accurate classifier from imbalanced data is a challenging task. Traditional classifiers that aim to maximize the overall prediction accuracy tend to classify all data into the majority class. In this paper, we study an ensemble system of feature selection and data sampling for the class imbalance problem. We systematically analyze various sampling techniques by examining the efficacy of different rates and types of undersampling, oversampling, and a combination of over and under sampling approaches. We thoroughly examine six widely used feature selection algorithms to identify significant biomarkers and thereby reduce the complexity of the data. The efficacy of the ensemble techniques is evaluated using two different classifiers including Random Forest and Support Vector Machines based on classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity measures. Our extensive experimental results show that for various problem settings in ADNI, (1). a balanced training set obtained with K-Medoids technique based undersampling gives the best overall performance among different data sampling techniques and no sampling approach; and (2). sparse logistic regression with stability selection achieves competitive performance among various feature selection algorithms. Comprehensive experiments with various settings show that our proposed ensemble model of multiple undersampled datasets yields stable and promising results. PMID:24176869
Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study.
Dubey, Rashmi; Zhou, Jiayu; Wang, Yalin; Thompson, Paul M; Ye, Jieping
2014-02-15
Many neuroimaging applications deal with imbalanced imaging data. For example, in Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the mild cognitive impairment (MCI) cases eligible for the study are nearly two times the Alzheimer's disease (AD) patients for structural magnetic resonance imaging (MRI) modality and six times the control cases for proteomics modality. Constructing an accurate classifier from imbalanced data is a challenging task. Traditional classifiers that aim to maximize the overall prediction accuracy tend to classify all data into the majority class. In this paper, we study an ensemble system of feature selection and data sampling for the class imbalance problem. We systematically analyze various sampling techniques by examining the efficacy of different rates and types of undersampling, oversampling, and a combination of over and undersampling approaches. We thoroughly examine six widely used feature selection algorithms to identify significant biomarkers and thereby reduce the complexity of the data. The efficacy of the ensemble techniques is evaluated using two different classifiers including Random Forest and Support Vector Machines based on classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity measures. Our extensive experimental results show that for various problem settings in ADNI, (1) a balanced training set obtained with K-Medoids technique based undersampling gives the best overall performance among different data sampling techniques and no sampling approach; and (2) sparse logistic regression with stability selection achieves competitive performance among various feature selection algorithms. Comprehensive experiments with various settings show that our proposed ensemble model of multiple undersampled datasets yields stable and promising results. © 2013 Elsevier Inc. All rights reserved.
Quantum ensembles of quantum classifiers.
Schuld, Maria; Petruccione, Francesco
2018-02-09
Quantum machine learning witnesses an increasing amount of quantum algorithms for data-driven decision making, a problem with potential applications ranging from automated image recognition to medical diagnosis. Many of those algorithms are implementations of quantum classifiers, or models for the classification of data inputs with a quantum computer. Following the success of collective decision making with ensembles in classical machine learning, this paper introduces the concept of quantum ensembles of quantum classifiers. Creating the ensemble corresponds to a state preparation routine, after which the quantum classifiers are evaluated in parallel and their combined decision is accessed by a single-qubit measurement. This framework naturally allows for exponentially large ensembles in which - similar to Bayesian learning - the individual classifiers do not have to be trained. As an example, we analyse an exponentially large quantum ensemble in which each classifier is weighed according to its performance in classifying the training data, leading to new results for quantum as well as classical machine learning.
2012-01-01
Background Biomarker panels derived separately from genomic and proteomic data and with a variety of computational methods have demonstrated promising classification performance in various diseases. An open question is how to create effective proteo-genomic panels. The framework of ensemble classifiers has been applied successfully in various analytical domains to combine classifiers so that the performance of the ensemble exceeds the performance of individual classifiers. Using blood-based diagnosis of acute renal allograft rejection as a case study, we address the following question in this paper: Can acute rejection classification performance be improved by combining individual genomic and proteomic classifiers in an ensemble? Results The first part of the paper presents a computational biomarker development pipeline for genomic and proteomic data. The pipeline begins with data acquisition (e.g., from bio-samples to microarray data), quality control, statistical analysis and mining of the data, and finally various forms of validation. The pipeline ensures that the various classifiers to be combined later in an ensemble are diverse and adequate for clinical use. Five mRNA genomic and five proteomic classifiers were developed independently using single time-point blood samples from 11 acute-rejection and 22 non-rejection renal transplant patients. The second part of the paper examines five ensembles ranging in size from two to 10 individual classifiers. Performance of ensembles is characterized by area under the curve (AUC), sensitivity, and specificity, as derived from the probability of acute rejection for individual classifiers in the ensemble in combination with one of two aggregation methods: (1) Average Probability or (2) Vote Threshold. One ensemble demonstrated superior performance and was able to improve sensitivity and AUC beyond the best values observed for any of the individual classifiers in the ensemble, while staying within the range of observed specificity. The Vote Threshold aggregation method achieved improved sensitivity for all 5 ensembles, but typically at the cost of decreased specificity. Conclusion Proteo-genomic biomarker ensemble classifiers show promise in the diagnosis of acute renal allograft rejection and can improve classification performance beyond that of individual genomic or proteomic classifiers alone. Validation of our results in an international multicenter study is currently underway. PMID:23216969
Günther, Oliver P; Chen, Virginia; Freue, Gabriela Cohen; Balshaw, Robert F; Tebbutt, Scott J; Hollander, Zsuzsanna; Takhar, Mandeep; McMaster, W Robert; McManus, Bruce M; Keown, Paul A; Ng, Raymond T
2012-12-08
Biomarker panels derived separately from genomic and proteomic data and with a variety of computational methods have demonstrated promising classification performance in various diseases. An open question is how to create effective proteo-genomic panels. The framework of ensemble classifiers has been applied successfully in various analytical domains to combine classifiers so that the performance of the ensemble exceeds the performance of individual classifiers. Using blood-based diagnosis of acute renal allograft rejection as a case study, we address the following question in this paper: Can acute rejection classification performance be improved by combining individual genomic and proteomic classifiers in an ensemble? The first part of the paper presents a computational biomarker development pipeline for genomic and proteomic data. The pipeline begins with data acquisition (e.g., from bio-samples to microarray data), quality control, statistical analysis and mining of the data, and finally various forms of validation. The pipeline ensures that the various classifiers to be combined later in an ensemble are diverse and adequate for clinical use. Five mRNA genomic and five proteomic classifiers were developed independently using single time-point blood samples from 11 acute-rejection and 22 non-rejection renal transplant patients. The second part of the paper examines five ensembles ranging in size from two to 10 individual classifiers. Performance of ensembles is characterized by area under the curve (AUC), sensitivity, and specificity, as derived from the probability of acute rejection for individual classifiers in the ensemble in combination with one of two aggregation methods: (1) Average Probability or (2) Vote Threshold. One ensemble demonstrated superior performance and was able to improve sensitivity and AUC beyond the best values observed for any of the individual classifiers in the ensemble, while staying within the range of observed specificity. The Vote Threshold aggregation method achieved improved sensitivity for all 5 ensembles, but typically at the cost of decreased specificity. Proteo-genomic biomarker ensemble classifiers show promise in the diagnosis of acute renal allograft rejection and can improve classification performance beyond that of individual genomic or proteomic classifiers alone. Validation of our results in an international multicenter study is currently underway.
Overlapped Partitioning for Ensemble Classifiers of P300-Based Brain-Computer Interfaces
Onishi, Akinari; Natsume, Kiyohisa
2014-01-01
A P300-based brain-computer interface (BCI) enables a wide range of people to control devices that improve their quality of life. Ensemble classifiers with naive partitioning were recently applied to the P300-based BCI and these classification performances were assessed. However, they were usually trained on a large amount of training data (e.g., 15300). In this study, we evaluated ensemble linear discriminant analysis (LDA) classifiers with a newly proposed overlapped partitioning method using 900 training data. In addition, the classification performances of the ensemble classifier with naive partitioning and a single LDA classifier were compared. One of three conditions for dimension reduction was applied: the stepwise method, principal component analysis (PCA), or none. The results show that an ensemble stepwise LDA (SWLDA) classifier with overlapped partitioning achieved a better performance than the commonly used single SWLDA classifier and an ensemble SWLDA classifier with naive partitioning. This result implies that the performance of the SWLDA is improved by overlapped partitioning and the ensemble classifier with overlapped partitioning requires less training data than that with naive partitioning. This study contributes towards reducing the required amount of training data and achieving better classification performance. PMID:24695550
Overlapped partitioning for ensemble classifiers of P300-based brain-computer interfaces.
Onishi, Akinari; Natsume, Kiyohisa
2014-01-01
A P300-based brain-computer interface (BCI) enables a wide range of people to control devices that improve their quality of life. Ensemble classifiers with naive partitioning were recently applied to the P300-based BCI and these classification performances were assessed. However, they were usually trained on a large amount of training data (e.g., 15300). In this study, we evaluated ensemble linear discriminant analysis (LDA) classifiers with a newly proposed overlapped partitioning method using 900 training data. In addition, the classification performances of the ensemble classifier with naive partitioning and a single LDA classifier were compared. One of three conditions for dimension reduction was applied: the stepwise method, principal component analysis (PCA), or none. The results show that an ensemble stepwise LDA (SWLDA) classifier with overlapped partitioning achieved a better performance than the commonly used single SWLDA classifier and an ensemble SWLDA classifier with naive partitioning. This result implies that the performance of the SWLDA is improved by overlapped partitioning and the ensemble classifier with overlapped partitioning requires less training data than that with naive partitioning. This study contributes towards reducing the required amount of training data and achieving better classification performance.
NASA Astrophysics Data System (ADS)
Löw, Fabian; Schorcht, Gunther; Michel, Ulrich; Dech, Stefan; Conrad, Christopher
2012-10-01
Accurate crop identification and crop area estimation are important for studies on irrigated agricultural systems, yield and water demand modeling, and agrarian policy development. In this study a novel combination of Random Forest (RF) and Support Vector Machine (SVM) classifiers is presented that (i) enhances crop classification accuracy and (ii) provides spatial information on map uncertainty. The methodology was implemented over four distinct irrigated sites in Middle Asia using RapidEye time series data. The RF feature importance statistics was used as feature-selection strategy for the SVM to assess possible negative effects on classification accuracy caused by an oversized feature space. The results of the individual RF and SVM classifications were combined with rules based on posterior classification probability and estimates of classification probability entropy. SVM classification performance was increased by feature selection through RF. Further experimental results indicate that the hybrid classifier improves overall classification accuracy in comparison to the single classifiers as well as useŕs and produceŕs accuracy.
Cerruela García, G; García-Pedrajas, N; Luque Ruiz, I; Gómez-Nieto, M Á
2018-03-01
This paper proposes a method for molecular activity prediction in QSAR studies using ensembles of classifiers constructed by means of two supervised subspace projection methods, namely nonparametric discriminant analysis (NDA) and hybrid discriminant analysis (HDA). We studied the performance of the proposed ensembles compared to classical ensemble methods using four molecular datasets and eight different models for the representation of the molecular structure. Using several measures and statistical tests for classifier comparison, we observe that our proposal improves the classification results with respect to classical ensemble methods. Therefore, we show that ensembles constructed using supervised subspace projections offer an effective way of creating classifiers in cheminformatics.
Hahn, Micah; Jarnevich, Catherine S.; Monaghan, Andrew J.; Eisen, Rebecca J.
2016-01-01
In addition to serving as vectors of several other human pathogens, the black-legged tick, Ixodes scapularis Say, and western black-legged tick, Ixodes pacificus Cooley and Kohls, are the primary vectors of the spirochete (Borrelia burgdorferi ) that causes Lyme disease, the most common vector-borne disease in the United States. Over the past two decades, the geographic range of I. pacificus has changed modestly while, in contrast, the I. scapularis range has expanded substantially, which likely contributes to the concurrent expansion in the distribution of human Lyme disease cases in the Northeastern, North-Central and Mid-Atlantic states. Identifying counties that contain suitable habitat for these ticks that have not yet reported established vector populations can aid in targeting limited vector surveillance resources to areas where tick invasion and potential human risk are likely to occur. We used county-level vector distribution information and ensemble modeling to map the potential distribution of I. scapularis and I. pacificus in the contiguous United States as a function of climate, elevation, and forest cover. Results show that I. pacificus is currently present within much of the range classified by our model as suitable for establishment. In contrast, environmental conditions are suitable for I. scapularis to continue expanding its range into northwestern Minnesota, central and northern Michigan, within the Ohio River Valley, and inland from the southeastern and Gulf coasts. Overall, our ensemble models show suitable habitat for I. scapularis in 441 eastern counties and for I. pacificus in 11 western counties where surveillance records have not yet supported classification of the counties as established.
Image Change Detection via Ensemble Learning
DOE Office of Scientific and Technical Information (OSTI.GOV)
Martin, Benjamin W; Vatsavai, Raju
2013-01-01
The concept of geographic change detection is relevant in many areas. Changes in geography can reveal much information about a particular location. For example, analysis of changes in geography can identify regions of population growth, change in land use, and potential environmental disturbance. A common way to perform change detection is to use a simple method such as differencing to detect regions of change. Though these techniques are simple, often the application of these techniques is very limited. Recently, use of machine learning methods such as neural networks for change detection has been explored with great success. In this work,more » we explore the use of ensemble learning methodologies for detecting changes in bitemporal synthetic aperture radar (SAR) images. Ensemble learning uses a collection of weak machine learning classifiers to create a stronger classifier which has higher accuracy than the individual classifiers in the ensemble. The strength of the ensemble lies in the fact that the individual classifiers in the ensemble create a mixture of experts in which the final classification made by the ensemble classifier is calculated from the outputs of the individual classifiers. Our methodology leverages this aspect of ensemble learning by training collections of weak decision tree based classifiers to identify regions of change in SAR images collected of a region in the Staten Island, New York area during Hurricane Sandy. Preliminary studies show that the ensemble method has approximately 11.5% higher change detection accuracy than an individual classifier.« less
Improving ensemble decision tree performance using Adaboost and Bagging
NASA Astrophysics Data System (ADS)
Hasan, Md. Rajib; Siraj, Fadzilah; Sainin, Mohd Shamrie
2015-12-01
Ensemble classifier systems are considered as one of the most promising in medical data classification and the performance of deceision tree classifier can be increased by the ensemble method as it is proven to be better than single classifiers. However, in a ensemble settings the performance depends on the selection of suitable base classifier. This research employed two prominent esemble s namely Adaboost and Bagging with base classifiers such as Random Forest, Random Tree, j48, j48grafts and Logistic Model Regression (LMT) that have been selected independently. The empirical study shows that the performance varries when different base classifiers are selected and even some places overfitting issue also been noted. The evidence shows that ensemble decision tree classfiers using Adaboost and Bagging improves the performance of selected medical data sets.
Training set extension for SVM ensemble in P300-speller with familiar face paradigm.
Li, Qi; Shi, Kaiyang; Gao, Ning; Li, Jian; Bai, Ou
2018-03-27
P300-spellers are brain-computer interface (BCI)-based character input systems. Support vector machine (SVM) ensembles are trained with large-scale training sets and used as classifiers in these systems. However, the required large-scale training data necessitate a prolonged collection time for each subject, which results in data collected toward the end of the period being contaminated by the subject's fatigue. This study aimed to develop a method for acquiring more training data based on a collected small training set. A new method was developed in which two corresponding training datasets in two sequences are superposed and averaged to extend the training set. The proposed method was tested offline on a P300-speller with the familiar face paradigm. The SVM ensemble with extended training set achieved 85% classification accuracy for the averaged results of four sequences, and 100% for 11 sequences in the P300-speller. In contrast, the conventional SVM ensemble with non-extended training set achieved only 65% accuracy for four sequences, and 92% for 11 sequences. The SVM ensemble with extended training set achieves higher classification accuracies than the conventional SVM ensemble, which verifies that the proposed method effectively improves the classification performance of BCI P300-spellers, thus enhancing their practicality.
Jun, Sanghoon; Kim, Namkug; Seo, Joon Beom; Lee, Young Kyung; Lynch, David A
2017-12-01
We propose the use of ensemble classifiers to overcome inter-scanner variations in the differentiation of regional disease patterns in high-resolution computed tomography (HRCT) images of diffuse interstitial lung disease patients obtained from different scanners. A total of 600 rectangular 20 × 20-pixel regions of interest (ROIs) on HRCT images obtained from two different scanners (GE and Siemens) and the whole lung area of 92 HRCT images were classified as one of six regional pulmonary disease patterns by two expert radiologists. Textual and shape features were extracted from each ROI and the whole lung parenchyma. For automatic classification, individual and ensemble classifiers were trained and tested with the ROI dataset. We designed the following three experimental sets: an intra-scanner study in which the training and test sets were from the same scanner, an integrated scanner study in which the data from the two scanners were merged, and an inter-scanner study in which the training and test sets were acquired from different scanners. In the ROI-based classification, the ensemble classifiers showed better (p < 0.001) accuracy (89.73%, SD = 0.43) than the individual classifiers (88.38%, SD = 0.31) in the integrated scanner test. The ensemble classifiers also showed partial improvements in the intra- and inter-scanner tests. In the whole lung classification experiment, the quantification accuracies of the ensemble classifiers with integrated training (49.57%) were higher (p < 0.001) than the individual classifiers (48.19%). Furthermore, the ensemble classifiers also showed better performance in both the intra- and inter-scanner experiments. We concluded that the ensemble classifiers provide better performance when using integrated scanner images.
Exploring diversity in ensemble classification: Applications in large area land cover mapping
NASA Astrophysics Data System (ADS)
Mellor, Andrew; Boukir, Samia
2017-07-01
Ensemble classifiers, such as random forests, are now commonly applied in the field of remote sensing, and have been shown to perform better than single classifier systems, resulting in reduced generalisation error. Diversity across the members of ensemble classifiers is known to have a strong influence on classification performance - whereby classifier errors are uncorrelated and more uniformly distributed across ensemble members. The relationship between ensemble diversity and classification performance has not yet been fully explored in the fields of information science and machine learning and has never been examined in the field of remote sensing. This study is a novel exploration of ensemble diversity and its link to classification performance, applied to a multi-class canopy cover classification problem using random forests and multisource remote sensing and ancillary GIS data, across seven million hectares of diverse dry-sclerophyll dominated public forests in Victoria Australia. A particular emphasis is placed on analysing the relationship between ensemble diversity and ensemble margin - two key concepts in ensemble learning. The main novelty of our work is on boosting diversity by emphasizing the contribution of lower margin instances used in the learning process. Exploring the influence of tree pruning on diversity is also a new empirical analysis that contributes to a better understanding of ensemble performance. Results reveal insights into the trade-off between ensemble classification accuracy and diversity, and through the ensemble margin, demonstrate how inducing diversity by targeting lower margin training samples is a means of achieving better classifier performance for more difficult or rarer classes and reducing information redundancy in classification problems. Our findings inform strategies for collecting training data and designing and parameterising ensemble classifiers, such as random forests. This is particularly important in large area remote sensing applications, for which training data is costly and resource intensive to collect.
Nanni, Loris; Lumini, Alessandra
2009-01-01
The focuses of this work are: to propose a novel method for building an ensemble of classifiers for peptide classification based on substitution matrices; to show the importance to select a proper set of the parameters of the classifiers that build the ensemble of learning systems. The HIV-1 protease cleavage site prediction problem is here studied. The results obtained by a blind testing protocol are reported, the comparison with other state-of-the-art approaches, based on ensemble of classifiers, allows to quantify the performance improvement obtained by the systems proposed in this paper. The simulation based on experimentally determined protease cleavage data has demonstrated the success of these new ensemble algorithms. Particularly interesting it is to note that also if the HIV-1 protease cleavage site prediction problem is considered linearly separable we obtain the best performance using an ensemble of non-linear classifiers.
Novel layered clustering-based approach for generating ensemble of classifiers.
Rahman, Ashfaqur; Verma, Brijesh
2011-05-01
This paper introduces a novel concept for creating an ensemble of classifiers. The concept is based on generating an ensemble of classifiers through clustering of data at multiple layers. The ensemble classifier model generates a set of alternative clustering of a dataset at different layers by randomly initializing the clustering parameters and trains a set of base classifiers on the patterns at different clusters in different layers. A test pattern is classified by first finding the appropriate cluster at each layer and then using the corresponding base classifier. The decisions obtained at different layers are fused into a final verdict using majority voting. As the base classifiers are trained on overlapping patterns at different layers, the proposed approach achieves diversity among the individual classifiers. Identification of difficult-to-classify patterns through clustering as well as achievement of diversity through layering leads to better classification results as evidenced from the experimental results.
Applying data fusion techniques for benthic habitat mapping and monitoring in a coral reef ecosystem
NASA Astrophysics Data System (ADS)
Zhang, Caiyun
2015-06-01
Accurate mapping and effective monitoring of benthic habitat in the Florida Keys are critical in developing management strategies for this valuable coral reef ecosystem. For this study, a framework was designed for automated benthic habitat mapping by combining multiple data sources (hyperspectral, aerial photography, and bathymetry data) and four contemporary imagery processing techniques (data fusion, Object-based Image Analysis (OBIA), machine learning, and ensemble analysis). In the framework, 1-m digital aerial photograph was first merged with 17-m hyperspectral imagery and 10-m bathymetry data using a pixel/feature-level fusion strategy. The fused dataset was then preclassified by three machine learning algorithms (Random Forest, Support Vector Machines, and k-Nearest Neighbor). Final object-based habitat maps were produced through ensemble analysis of outcomes from three classifiers. The framework was tested for classifying a group-level (3-class) and code-level (9-class) habitats in a portion of the Florida Keys. Informative and accurate habitat maps were achieved with an overall accuracy of 88.5% and 83.5% for the group-level and code-level classifications, respectively.
Dehzangi, Abdollah; Paliwal, Kuldip; Sharma, Alok; Dehzangi, Omid; Sattar, Abdul
2013-01-01
Better understanding of structural class of a given protein reveals important information about its overall folding type and its domain. It can also be directly used to provide critical information on general tertiary structure of a protein which has a profound impact on protein function determination and drug design. Despite tremendous enhancements made by pattern recognition-based approaches to solve this problem, it still remains as an unsolved issue for bioinformatics that demands more attention and exploration. In this study, we propose a novel feature extraction model that incorporates physicochemical and evolutionary-based information simultaneously. We also propose overlapped segmented distribution and autocorrelation-based feature extraction methods to provide more local and global discriminatory information. The proposed feature extraction methods are explored for 15 most promising attributes that are selected from a wide range of physicochemical-based attributes. Finally, by applying an ensemble of different classifiers namely, Adaboost.M1, LogitBoost, naive Bayes, multilayer perceptron (MLP), and support vector machine (SVM) we show enhancement of the protein structural class prediction accuracy for four popular benchmarks.
Competitive Learning Neural Network Ensemble Weighted by Predicted Performance
ERIC Educational Resources Information Center
Ye, Qiang
2010-01-01
Ensemble approaches have been shown to enhance classification by combining the outputs from a set of voting classifiers. Diversity in error patterns among base classifiers promotes ensemble performance. Multi-task learning is an important characteristic for Neural Network classifiers. Introducing a secondary output unit that receives different…
Ramírez, J; Górriz, J M; Ortiz, A; Martínez-Murcia, F J; Segovia, F; Salas-Gonzalez, D; Castillo-Barnes, D; Illán, I A; Puntonet, C G
2018-05-15
Alzheimer's disease (AD) is the most common cause of dementia in the elderly and affects approximately 30 million individuals worldwide. Mild cognitive impairment (MCI) is very frequently a prodromal phase of AD, and existing studies have suggested that people with MCI tend to progress to AD at a rate of about 10-15% per year. However, the ability of clinicians and machine learning systems to predict AD based on MRI biomarkers at an early stage is still a challenging problem that can have a great impact in improving treatments. The proposed system, developed by the SiPBA-UGR team for this challenge, is based on feature standardization, ANOVA feature selection, partial least squares feature dimension reduction and an ensemble of One vs. Rest random forest classifiers. With the aim of improving its performance when discriminating healthy controls (HC) from MCI, a second binary classification level was introduced that reconsiders the HC and MCI predictions of the first level. The system was trained and evaluated on an ADNI datasets that consist of T1-weighted MRI morphological measurements from HC, stable MCI, converter MCI and AD subjects. The proposed system yields a 56.25% classification score on the test subset which consists of 160 real subjects. The classifier yielded the best performance when compared to: (i) One vs. One (OvO), One vs. Rest (OvR) and error correcting output codes (ECOC) as strategies for reducing the multiclass classification task to multiple binary classification problems, (ii) support vector machines, gradient boosting classifier and random forest as base binary classifiers, and (iii) bagging ensemble learning. A robust method has been proposed for the international challenge on MCI prediction based on MRI data. The system yielded the second best performance during the competition with an accuracy rate of 56.25% when evaluated on the real subjects of the test set. Copyright © 2017 Elsevier B.V. All rights reserved.
Building Diversified Multiple Trees for classification in high dimensional noisy biomedical data.
Li, Jiuyong; Liu, Lin; Liu, Jixue; Green, Ryan
2017-12-01
It is common that a trained classification model is applied to the operating data that is deviated from the training data because of noise. This paper will test an ensemble method, Diversified Multiple Tree (DMT), on its capability for classifying instances in a new laboratory using the classifier built on the instances of another laboratory. DMT is tested on three real world biomedical data sets from different laboratories in comparison with four benchmark ensemble methods, AdaBoost, Bagging, Random Forests, and Random Trees. Experiments have also been conducted on studying the limitation of DMT and its possible variations. Experimental results show that DMT is significantly more accurate than other benchmark ensemble classifiers on classifying new instances of a different laboratory from the laboratory where instances are used to build the classifier. This paper demonstrates that an ensemble classifier, DMT, is more robust in classifying noisy data than other widely used ensemble methods. DMT works on the data set that supports multiple simple trees.
Prediction of lysine ubiquitylation with ensemble classifier and feature selection.
Zhao, Xiaowei; Li, Xiangtao; Ma, Zhiqiang; Yin, Minghao
2011-01-01
Ubiquitylation is an important process of post-translational modification. Correct identification of protein lysine ubiquitylation sites is of fundamental importance to understand the molecular mechanism of lysine ubiquitylation in biological systems. This paper develops a novel computational method to effectively identify the lysine ubiquitylation sites based on the ensemble approach. In the proposed method, 468 ubiquitylation sites from 323 proteins retrieved from the Swiss-Prot database were encoded into feature vectors by using four kinds of protein sequences information. An effective feature selection method was then applied to extract informative feature subsets. After different feature subsets were obtained by setting different starting points in the search procedure, they were used to train multiple random forests classifiers and then aggregated into a consensus classifier by majority voting. Evaluated by jackknife tests and independent tests respectively, the accuracy of the proposed predictor reached 76.82% for the training dataset and 79.16% for the test dataset, indicating that this predictor is a useful tool to predict lysine ubiquitylation sites. Furthermore, site-specific feature analysis was performed and it was shown that ubiquitylation is intimately correlated with the features of its surrounding sites in addition to features derived from the lysine site itself. The feature selection method is available upon request.
Automatic Estimation of Osteoporotic Fracture Cases by Using Ensemble Learning Approaches.
Kilic, Niyazi; Hosgormez, Erkan
2016-03-01
Ensemble learning methods are one of the most powerful tools for the pattern classification problems. In this paper, the effects of ensemble learning methods and some physical bone densitometry parameters on osteoporotic fracture detection were investigated. Six feature set models were constructed including different physical parameters and they fed into the ensemble classifiers as input features. As ensemble learning techniques, bagging, gradient boosting and random subspace (RSM) were used. Instance based learning (IBk) and random forest (RF) classifiers applied to six feature set models. The patients were classified into three groups such as osteoporosis, osteopenia and control (healthy), using ensemble classifiers. Total classification accuracy and f-measure were also used to evaluate diagnostic performance of the proposed ensemble classification system. The classification accuracy has reached to 98.85 % by the combination of model 6 (five BMD + five T-score values) using RSM-RF classifier. The findings of this paper suggest that the patients will be able to be warned before a bone fracture occurred, by just examining some physical parameters that can easily be measured without invasive operations.
Protein Kinase Classification with 2866 Hidden Markov Models and One Support Vector Machine
NASA Technical Reports Server (NTRS)
Weber, Ryan; New, Michael H.; Fonda, Mark (Technical Monitor)
2002-01-01
The main application considered in this paper is predicting true kinases from randomly permuted kinases that share the same length and amino acid distributions as the true kinases. Numerous methods already exist for this classification task, such as HMMs, motif-matchers, and sequence comparison algorithms. We build on some of these efforts by creating a vector from the output of thousands of structurally based HMMs, created offline with Pfam-A seed alignments using SAM-T99, which then must be combined into an overall classification for the protein. Then we use a Support Vector Machine for classifying this large ensemble Pfam-Vector, with a polynomial and chisquared kernel. In particular, the chi-squared kernel SVM performs better than the HMMs and better than the BLAST pairwise comparisons, when predicting true from false kinases in some respects, but no one algorithm is best for all purposes or in all instances so we consider the particular strengths and weaknesses of each.
Pairwise Classifier Ensemble with Adaptive Sub-Classifiers for fMRI Pattern Analysis.
Kim, Eunwoo; Park, HyunWook
2017-02-01
The multi-voxel pattern analysis technique is applied to fMRI data for classification of high-level brain functions using pattern information distributed over multiple voxels. In this paper, we propose a classifier ensemble for multiclass classification in fMRI analysis, exploiting the fact that specific neighboring voxels can contain spatial pattern information. The proposed method converts the multiclass classification to a pairwise classifier ensemble, and each pairwise classifier consists of multiple sub-classifiers using an adaptive feature set for each class-pair. Simulated and real fMRI data were used to verify the proposed method. Intra- and inter-subject analyses were performed to compare the proposed method with several well-known classifiers, including single and ensemble classifiers. The comparison results showed that the proposed method can be generally applied to multiclass classification in both simulations and real fMRI analyses.
Comparing ensemble learning methods based on decision tree classifiers for protein fold recognition.
Bardsiri, Mahshid Khatibi; Eftekhari, Mahdi
2014-01-01
In this paper, some methods for ensemble learning of protein fold recognition based on a decision tree (DT) are compared and contrasted against each other over three datasets taken from the literature. According to previously reported studies, the features of the datasets are divided into some groups. Then, for each of these groups, three ensemble classifiers, namely, random forest, rotation forest and AdaBoost.M1 are employed. Also, some fusion methods are introduced for combining the ensemble classifiers obtained in the previous step. After this step, three classifiers are produced based on the combination of classifiers of types random forest, rotation forest and AdaBoost.M1. Finally, the three different classifiers achieved are combined to make an overall classifier. Experimental results show that the overall classifier obtained by the genetic algorithm (GA) weighting fusion method, is the best one in comparison to previously applied methods in terms of classification accuracy.
Identifying interactions between chemical entities in biomedical text.
Lamurias, Andre; Ferreira, João D; Couto, Francisco M
2014-10-23
Interactions between chemical compounds described in biomedical text can be of great importance to drug discovery and design, as well as pharmacovigilance. We developed a novel system, \\"Identifying Interactions between Chemical Entities\\" (IICE), to identify chemical interactions described in text. Kernel-based Support Vector Machines first identify the interactions and then an ensemble classifier validates and classifies the type of each interaction. This relation extraction module was evaluated with the corpus released for the DDI Extraction task of SemEval 2013, obtaining results comparable to state-of-the-art methods for this type of task. We integrated this module with our chemical named entity recognition module and made the whole system available as a web tool at www.lasige.di.fc.ul.pt/webtools/iice.
Identifying interactions between chemical entities in biomedical text.
Lamurias, Andre; Ferreira, João D; Couto, Francisco M
2014-12-01
Interactions between chemical compounds described in biomedical text can be of great importance to drug discovery and design, as well as pharmacovigilance. We developed a novel system, "Identifying Interactions between Chemical Entities" (IICE), to identify chemical interactions described in text. Kernel-based Support Vector Machines first identify the interactions and then an ensemble classifier validates and classifies the type of each interaction. This relation extraction module was evaluated with the corpus released for the DDI Extraction task of SemEval 2013, obtaining results comparable to stateof- the-art methods for this type of task. We integrated this module with our chemical named entity recognition module and made the whole system available as a web tool at www.lasige.di.fc.ul.pt/webtools/iice.
EnsembleGASVR: a novel ensemble method for classifying missense single nucleotide polymorphisms.
Rapakoulia, Trisevgeni; Theofilatos, Konstantinos; Kleftogiannis, Dimitrios; Likothanasis, Spiros; Tsakalidis, Athanasios; Mavroudi, Seferina
2014-08-15
Single nucleotide polymorphisms (SNPs) are considered the most frequently occurring DNA sequence variations. Several computational methods have been proposed for the classification of missense SNPs to neutral and disease associated. However, existing computational approaches fail to select relevant features by choosing them arbitrarily without sufficient documentation. Moreover, they are limited to the problem of missing values, imbalance between the learning datasets and most of them do not support their predictions with confidence scores. To overcome these limitations, a novel ensemble computational methodology is proposed. EnsembleGASVR facilitates a two-step algorithm, which in its first step applies a novel evolutionary embedded algorithm to locate close to optimal Support Vector Regression models. In its second step, these models are combined to extract a universal predictor, which is less prone to overfitting issues, systematizes the rebalancing of the learning sets and uses an internal approach for solving the missing values problem without loss of information. Confidence scores support all the predictions and the model becomes tunable by modifying the classification thresholds. An extensive study was performed for collecting the most relevant features for the problem of classifying SNPs, and a superset of 88 features was constructed. Experimental results show that the proposed framework outperforms well-known algorithms in terms of classification performance in the examined datasets. Finally, the proposed algorithmic framework was able to uncover the significant role of certain features such as the solvent accessibility feature, and the top-scored predictions were further validated by linking them with disease phenotypes. Datasets and codes are freely available on the Web at http://prlab.ceid.upatras.gr/EnsembleGASVR/dataset-codes.zip. All the required information about the article is available through http://prlab.ceid.upatras.gr/EnsembleGASVR/site.html. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Ensemble of classifiers for confidence-rated classification of NDE signal
NASA Astrophysics Data System (ADS)
Banerjee, Portia; Safdarnejad, Seyed; Udpa, Lalita; Udpa, Satish
2016-02-01
Ensemble of classifiers in general, aims to improve classification accuracy by combining results from multiple weak hypotheses into a single strong classifier through weighted majority voting. Improved versions of ensemble of classifiers generate self-rated confidence scores which estimate the reliability of each of its prediction and boost the classifier using these confidence-rated predictions. However, such a confidence metric is based only on the rate of correct classification. In existing works, although ensemble of classifiers has been widely used in computational intelligence, the effect of all factors of unreliability on the confidence of classification is highly overlooked. With relevance to NDE, classification results are affected by inherent ambiguity of classifica-tion, non-discriminative features, inadequate training samples and noise due to measurement. In this paper, we extend the existing ensemble classification by maximizing confidence of every classification decision in addition to minimizing the classification error. Initial results of the approach on data from eddy current inspection show improvement in classification performance of defect and non-defect indications.
Huang, Nantian; Qi, Jiajin; Li, Fuqing; Yang, Dongfeng; Cai, Guowei; Huang, Guilin; Zheng, Jian; Li, Zhenxin
2017-09-16
In order to improve the classification accuracy of recognizing short-circuit faults in electric transmission lines, a novel detection and diagnosis method based on empirical wavelet transform (EWT) and local energy (LE) is proposed. First, EWT is used to deal with the original short-circuit fault signals from photoelectric voltage transformers, before the amplitude modulated-frequency modulated (AM-FM) mode with a compactly supported Fourier spectrum is extracted. Subsequently, the fault occurrence time is detected according to the modulus maxima of intrinsic mode function (IMF₂) from three-phase voltage signals processed by EWT. After this process, the feature vectors are constructed by calculating the LE of the fundamental frequency based on the three-phase voltage signals of one period after the fault occurred. Finally, the classifier based on support vector machine (SVM) which was constructed with the LE feature vectors is used to classify 10 types of short-circuit fault signals. Compared with complementary ensemble empirical mode decomposition with adaptive noise (CEEMDAN) and improved CEEMDAN methods, the new method using EWT has a better ability to present the frequency in time. The difference in the characteristics of the energy distribution in the time domain between different types of short-circuit faults can be presented by the feature vectors of LE. Together, simulation and real signals experiment demonstrate the validity and effectiveness of the new approach.
Huang, Nantian; Qi, Jiajin; Li, Fuqing; Yang, Dongfeng; Cai, Guowei; Huang, Guilin; Zheng, Jian; Li, Zhenxin
2017-01-01
In order to improve the classification accuracy of recognizing short-circuit faults in electric transmission lines, a novel detection and diagnosis method based on empirical wavelet transform (EWT) and local energy (LE) is proposed. First, EWT is used to deal with the original short-circuit fault signals from photoelectric voltage transformers, before the amplitude modulated-frequency modulated (AM-FM) mode with a compactly supported Fourier spectrum is extracted. Subsequently, the fault occurrence time is detected according to the modulus maxima of intrinsic mode function (IMF2) from three-phase voltage signals processed by EWT. After this process, the feature vectors are constructed by calculating the LE of the fundamental frequency based on the three-phase voltage signals of one period after the fault occurred. Finally, the classifier based on support vector machine (SVM) which was constructed with the LE feature vectors is used to classify 10 types of short-circuit fault signals. Compared with complementary ensemble empirical mode decomposition with adaptive noise (CEEMDAN) and improved CEEMDAN methods, the new method using EWT has a better ability to present the frequency in time. The difference in the characteristics of the energy distribution in the time domain between different types of short-circuit faults can be presented by the feature vectors of LE. Together, simulation and real signals experiment demonstrate the validity and effectiveness of the new approach. PMID:28926953
Ozcift, Akin; Gulten, Arif
2011-12-01
Improving accuracies of machine learning algorithms is vital in designing high performance computer-aided diagnosis (CADx) systems. Researches have shown that a base classifier performance might be enhanced by ensemble classification strategies. In this study, we construct rotation forest (RF) ensemble classifiers of 30 machine learning algorithms to evaluate their classification performances using Parkinson's, diabetes and heart diseases from literature. While making experiments, first the feature dimension of three datasets is reduced using correlation based feature selection (CFS) algorithm. Second, classification performances of 30 machine learning algorithms are calculated for three datasets. Third, 30 classifier ensembles are constructed based on RF algorithm to assess performances of respective classifiers with the same disease data. All the experiments are carried out with leave-one-out validation strategy and the performances of the 60 algorithms are evaluated using three metrics; classification accuracy (ACC), kappa error (KE) and area under the receiver operating characteristic (ROC) curve (AUC). Base classifiers succeeded 72.15%, 77.52% and 84.43% average accuracies for diabetes, heart and Parkinson's datasets, respectively. As for RF classifier ensembles, they produced average accuracies of 74.47%, 80.49% and 87.13% for respective diseases. RF, a newly proposed classifier ensemble algorithm, might be used to improve accuracy of miscellaneous machine learning algorithms to design advanced CADx systems. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
Automated detection of pulmonary nodules in CT images with support vector machines
NASA Astrophysics Data System (ADS)
Liu, Lu; Liu, Wanyu; Sun, Xiaoming
2008-10-01
Many methods have been proposed to avoid radiologists fail to diagnose small pulmonary nodules. Recently, support vector machines (SVMs) had received an increasing attention for pattern recognition. In this paper, we present a computerized system aimed at pulmonary nodules detection; it identifies the lung field, extracts a set of candidate regions with a high sensitivity ratio and then classifies candidates by the use of SVMs. The Computer Aided Diagnosis (CAD) system presented in this paper supports the diagnosis of pulmonary nodules from Computed Tomography (CT) images as inflammation, tuberculoma, granuloma..sclerosing hemangioma, and malignant tumor. Five texture feature sets were extracted for each lesion, while a genetic algorithm based feature selection method was applied to identify the most robust features. The selected feature set was fed into an ensemble of SVMs classifiers. The achieved classification performance was 100%, 92.75% and 90.23% in the training, validation and testing set, respectively. It is concluded that computerized analysis of medical images in combination with artificial intelligence can be used in clinical practice and may contribute to more efficient diagnosis.
NASA Technical Reports Server (NTRS)
Tumer, Kagan; Oza, Nikunj C.; Clancy, Daniel (Technical Monitor)
2001-01-01
Using an ensemble of classifiers instead of a single classifier has been shown to improve generalization performance in many pattern recognition problems. However, the extent of such improvement depends greatly on the amount of correlation among the errors of the base classifiers. Therefore, reducing those correlations while keeping the classifiers' performance levels high is an important area of research. In this article, we explore input decimation (ID), a method which selects feature subsets for their ability to discriminate among the classes and uses them to decouple the base classifiers. We provide a summary of the theoretical benefits of correlation reduction, along with results of our method on two underwater sonar data sets, three benchmarks from the Probenl/UCI repositories, and two synthetic data sets. The results indicate that input decimated ensembles (IDEs) outperform ensembles whose base classifiers use all the input features; randomly selected subsets of features; and features created using principal components analysis, on a wide range of domains.
Abraham, Leandro; Bromberg, Facundo; Forradellas, Raymundo
2018-04-01
Muscle activation level is currently being captured using impractical and expensive devices which make their use in telemedicine settings extremely difficult. To address this issue, a prototype is presented of a non-invasive, easy-to-install system for the estimation of a discrete level of muscle activation of the biceps muscle from 3D point clouds captured with RGB-D cameras. A methodology is proposed that uses the ensemble of shape functions point cloud descriptor for the geometric characterization of 3D point clouds, together with support vector machines to learn a classifier that, based on this geometric characterization for some points of view of the biceps, provides a model for the estimation of muscle activation for all neighboring points of view. This results in a classifier that is robust to small perturbations in the point of view of the capturing device, greatly simplifying the installation process for end-users. In the discrimination of five levels of effort with values up to the maximum voluntary contraction (MVC) of the biceps muscle (3800 g), the best variant of the proposed methodology achieved mean absolute errors of about 9.21% MVC - an acceptable performance for telemedicine settings where the electric measurement of muscle activation is impractical. The results prove that the correlations between the external geometry of the arm and biceps muscle activation are strong enough to consider computer vision and supervised learning an alternative with great potential for practical applications in tele-physiotherapy. Copyright © 2018 Elsevier Ltd. All rights reserved.
Bashir, Saba; Qamar, Usman; Khan, Farhan Hassan
2016-02-01
Accuracy plays a vital role in the medical field as it concerns with the life of an individual. Extensive research has been conducted on disease classification and prediction using machine learning techniques. However, there is no agreement on which classifier produces the best results. A specific classifier may be better than others for a specific dataset, but another classifier could perform better for some other dataset. Ensemble of classifiers has been proved to be an effective way to improve classification accuracy. In this research we present an ensemble framework with multi-layer classification using enhanced bagging and optimized weighting. The proposed model called "HM-BagMoov" overcomes the limitations of conventional performance bottlenecks by utilizing an ensemble of seven heterogeneous classifiers. The framework is evaluated on five different heart disease datasets, four breast cancer datasets, two diabetes datasets, two liver disease datasets and one hepatitis dataset obtained from public repositories. The analysis of the results show that ensemble framework achieved the highest accuracy, sensitivity and F-Measure when compared with individual classifiers for all the diseases. In addition to this, the ensemble framework also achieved the highest accuracy when compared with the state of the art techniques. An application named "IntelliHealth" is also developed based on proposed model that may be used by hospitals/doctors for diagnostic advice. Copyright © 2015 Elsevier Inc. All rights reserved.
MSEBAG: a dynamic classifier ensemble generation based on `minimum-sufficient ensemble' and bagging
NASA Astrophysics Data System (ADS)
Chen, Lei; Kamel, Mohamed S.
2016-01-01
In this paper, we propose a dynamic classifier system, MSEBAG, which is characterised by searching for the 'minimum-sufficient ensemble' and bagging at the ensemble level. It adopts an 'over-generation and selection' strategy and aims to achieve a good bias-variance trade-off. In the training phase, MSEBAG first searches for the 'minimum-sufficient ensemble', which maximises the in-sample fitness with the minimal number of base classifiers. Then, starting from the 'minimum-sufficient ensemble', a backward stepwise algorithm is employed to generate a collection of ensembles. The objective is to create a collection of ensembles with a descending fitness on the data, as well as a descending complexity in the structure. MSEBAG dynamically selects the ensembles from the collection for the decision aggregation. The extended adaptive aggregation (EAA) approach, a bagging-style algorithm performed at the ensemble level, is employed for this task. EAA searches for the competent ensembles using a score function, which takes into consideration both the in-sample fitness and the confidence of the statistical inference, and averages the decisions of the selected ensembles to label the test pattern. The experimental results show that the proposed MSEBAG outperforms the benchmarks on average.
Ensemble stump classifiers and gene expression signatures in lung cancer.
Frey, Lewis; Edgerton, Mary; Fisher, Douglas; Levy, Shawn
2007-01-01
Microarray data sets for cancer tumor tissue generally have very few samples, each sample having thousands of probes (i.e., continuous variables). The sparsity of samples makes it difficult for machine learning techniques to discover probes relevant to the classification of tumor tissue. By combining data from different platforms (i.e., data sources), data sparsity is reduced, but this typically requires normalizing data from the different platforms, which can be non-trivial. This paper proposes a variant on the idea of ensemble learners to circumvent the need for normalization. To facilitate comprehension we build ensembles of very simple classifiers known as decision stumps--decision trees of one test each. The Ensemble Stump Classifier (ESC) identifies an mRNA signature having three probes and high accuracy for distinguishing between adenocarcinoma and squamous cell carcinoma of the lung across four data sets. In terms of accuracy, ESC outperforms a decision tree classifier on all four data sets, outperforms ensemble decision trees on three data sets, and simple stump classifiers on two data sets.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Timofeev, Andrey V.; Egorov, Dmitry V.
This paper presents new results concerning selection of an optimal information fusion formula for an ensemble of Lipschitz classifiers. The goal of information fusion is to create an integral classificatory which could provide better generalization ability of the ensemble while achieving a practically acceptable level of effectiveness. The problem of information fusion is very relevant for data processing in multi-channel C-OTDR-monitoring systems. In this case we have to effectively classify targeted events which appear in the vicinity of the monitored object. Solution of this problem is based on usage of an ensemble of Lipschitz classifiers each of which corresponds tomore » a respective channel. We suggest a brand new method for information fusion in case of ensemble of Lipschitz classifiers. This method is called “The Weighing of Inversely as Lipschitz Constants” (WILC). Results of WILC-method practical usage in multichannel C-OTDR monitoring systems are presented.« less
Yu, Hualong; Hong, Shufang; Yang, Xibei; Ni, Jun; Dan, Yuanyuan; Qin, Bin
2013-01-01
DNA microarray technology can measure the activities of tens of thousands of genes simultaneously, which provides an efficient way to diagnose cancer at the molecular level. Although this strategy has attracted significant research attention, most studies neglect an important problem, namely, that most DNA microarray datasets are skewed, which causes traditional learning algorithms to produce inaccurate results. Some studies have considered this problem, yet they merely focus on binary-class problem. In this paper, we dealt with multiclass imbalanced classification problem, as encountered in cancer DNA microarray, by using ensemble learning. We utilized one-against-all coding strategy to transform multiclass to multiple binary classes, each of them carrying out feature subspace, which is an evolving version of random subspace that generates multiple diverse training subsets. Next, we introduced one of two different correction technologies, namely, decision threshold adjustment or random undersampling, into each training subset to alleviate the damage of class imbalance. Specifically, support vector machine was used as base classifier, and a novel voting rule called counter voting was presented for making a final decision. Experimental results on eight skewed multiclass cancer microarray datasets indicate that unlike many traditional classification approaches, our methods are insensitive to class imbalance.
Haque, Mohammad Nazmul; Noman, Nasimul; Berretta, Regina; Moscato, Pablo
2016-01-01
Classification of datasets with imbalanced sample distributions has always been a challenge. In general, a popular approach for enhancing classification performance is the construction of an ensemble of classifiers. However, the performance of an ensemble is dependent on the choice of constituent base classifiers. Therefore, we propose a genetic algorithm-based search method for finding the optimum combination from a pool of base classifiers to form a heterogeneous ensemble. The algorithm, called GA-EoC, utilises 10 fold-cross validation on training data for evaluating the quality of each candidate ensembles. In order to combine the base classifiers decision into ensemble's output, we used the simple and widely used majority voting approach. The proposed algorithm, along with the random sub-sampling approach to balance the class distribution, has been used for classifying class-imbalanced datasets. Additionally, if a feature set was not available, we used the (α, β) - k Feature Set method to select a better subset of features for classification. We have tested GA-EoC with three benchmarking datasets from the UCI-Machine Learning repository, one Alzheimer's disease dataset and a subset of the PubFig database of Columbia University. In general, the performance of the proposed method on the chosen datasets is robust and better than that of the constituent base classifiers and many other well-known ensembles. Based on our empirical study we claim that a genetic algorithm is a superior and reliable approach to heterogeneous ensemble construction and we expect that the proposed GA-EoC would perform consistently in other cases.
Argumentation Based Joint Learning: A Novel Ensemble Learning Approach
Xu, Junyi; Yao, Li; Li, Le
2015-01-01
Recently, ensemble learning methods have been widely used to improve classification performance in machine learning. In this paper, we present a novel ensemble learning method: argumentation based multi-agent joint learning (AMAJL), which integrates ideas from multi-agent argumentation, ensemble learning, and association rule mining. In AMAJL, argumentation technology is introduced as an ensemble strategy to integrate multiple base classifiers and generate a high performance ensemble classifier. We design an argumentation framework named Arena as a communication platform for knowledge integration. Through argumentation based joint learning, high quality individual knowledge can be extracted, and thus a refined global knowledge base can be generated and used independently for classification. We perform numerous experiments on multiple public datasets using AMAJL and other benchmark methods. The results demonstrate that our method can effectively extract high quality knowledge for ensemble classifier and improve the performance of classification. PMID:25966359
Gridded Calibration of Ensemble Wind Vector Forecasts Using Ensemble Model Output Statistics
NASA Astrophysics Data System (ADS)
Lazarus, S. M.; Holman, B. P.; Splitt, M. E.
2017-12-01
A computationally efficient method is developed that performs gridded post processing of ensemble wind vector forecasts. An expansive set of idealized WRF model simulations are generated to provide physically consistent high resolution winds over a coastal domain characterized by an intricate land / water mask. Ensemble model output statistics (EMOS) is used to calibrate the ensemble wind vector forecasts at observation locations. The local EMOS predictive parameters (mean and variance) are then spread throughout the grid utilizing flow-dependent statistical relationships extracted from the downscaled WRF winds. Using data withdrawal and 28 east central Florida stations, the method is applied to one year of 24 h wind forecasts from the Global Ensemble Forecast System (GEFS). Compared to the raw GEFS, the approach improves both the deterministic and probabilistic forecast skill. Analysis of multivariate rank histograms indicate the post processed forecasts are calibrated. Two downscaling case studies are presented, a quiescent easterly flow event and a frontal passage. Strengths and weaknesses of the approach are presented and discussed.
Wang, Ying; Coiera, Enrico; Runciman, William; Magrabi, Farah
2017-06-12
Approximately 10% of admissions to acute-care hospitals are associated with an adverse event. Analysis of incident reports helps to understand how and why incidents occur and can inform policy and practice for safer care. Unfortunately our capacity to monitor and respond to incident reports in a timely manner is limited by the sheer volumes of data collected. In this study, we aim to evaluate the feasibility of using multiclass classification to automate the identification of patient safety incidents in hospitals. Text based classifiers were applied to identify 10 incident types and 4 severity levels. Using the one-versus-one (OvsO) and one-versus-all (OvsA) ensemble strategies, we evaluated regularized logistic regression, linear support vector machine (SVM) and SVM with a radial-basis function (RBF) kernel. Classifiers were trained and tested with "balanced" datasets (n_ Type = 2860, n_ SeverityLevel = 1160) from a state-wide incident reporting system. Testing was also undertaken with imbalanced "stratified" datasets (n_ Type = 6000, n_ SeverityLevel =5950) from the state-wide system and an independent hospital reporting system. Classifier performance was evaluated using a confusion matrix, as well as F-score, precision and recall. The most effective combination was a OvsO ensemble of binary SVM RBF classifiers with binary count feature extraction. For incident type, classifiers performed well on balanced and stratified datasets (F-score: 78.3, 73.9%), but were worse on independent datasets (68.5%). Reports about falls, medications, pressure injury, aggression and blood products were identified with high recall and precision. "Documentation" was the hardest type to identify. For severity level, F-score for severity assessment code (SAC) 1 (extreme risk) was 87.3 and 64% for SAC4 (low risk) on balanced data. With stratified data, high recall was achieved for SAC1 (82.8-84%) but precision was poor (6.8-11.2%). High risk incidents (SAC2) were confused with medium risk incidents (SAC3). Binary classifier ensembles appear to be a feasible method for identifying incidents by type and severity level. Automated identification should enable safety problems to be detected and addressed in a more timely manner. Multi-label classifiers may be necessary for reports that relate to more than one incident type.
Wang, Xueyi; Davidson, Nicholas J.
2011-01-01
Ensemble methods have been widely used to improve prediction accuracy over individual classifiers. In this paper, we achieve a few results about the prediction accuracies of ensemble methods for binary classification that are missed or misinterpreted in previous literature. First we show the upper and lower bounds of the prediction accuracies (i.e. the best and worst possible prediction accuracies) of ensemble methods. Next we show that an ensemble method can achieve > 0.5 prediction accuracy, while individual classifiers have < 0.5 prediction accuracies. Furthermore, for individual classifiers with different prediction accuracies, the average of the individual accuracies determines the upper and lower bounds. We perform two experiments to verify the results and show that it is hard to achieve the upper and lower bounds accuracies by random individual classifiers and better algorithms need to be developed. PMID:21853162
Decimated Input Ensembles for Improved Generalization
NASA Technical Reports Server (NTRS)
Tumer, Kagan; Oza, Nikunj C.; Norvig, Peter (Technical Monitor)
1999-01-01
Recently, many researchers have demonstrated that using classifier ensembles (e.g., averaging the outputs of multiple classifiers before reaching a classification decision) leads to improved performance for many difficult generalization problems. However, in many domains there are serious impediments to such "turnkey" classification accuracy improvements. Most notable among these is the deleterious effect of highly correlated classifiers on the ensemble performance. One particular solution to this problem is generating "new" training sets by sampling the original one. However, with finite number of patterns, this causes a reduction in the training patterns each classifier sees, often resulting in considerably worsened generalization performance (particularly for high dimensional data domains) for each individual classifier. Generally, this drop in the accuracy of the individual classifier performance more than offsets any potential gains due to combining, unless diversity among classifiers is actively promoted. In this work, we introduce a method that: (1) reduces the correlation among the classifiers; (2) reduces the dimensionality of the data, thus lessening the impact of the 'curse of dimensionality'; and (3) improves the classification performance of the ensemble.
A mechatronics platform to study prosthetic hand control using EMG signals.
Geethanjali, P
2016-09-01
In this paper, a low-cost mechatronics platform for the design and development of robotic hands as well as a surface electromyogram (EMG) pattern recognition system is proposed. This paper also explores various EMG classification techniques using a low-cost electronics system in prosthetic hand applications. The proposed platform involves the development of a four channel EMG signal acquisition system; pattern recognition of acquired EMG signals; and development of a digital controller for a robotic hand. Four-channel surface EMG signals, acquired from ten healthy subjects for six different movements of the hand, were used to analyse pattern recognition in prosthetic hand control. Various time domain features were extracted and grouped into five ensembles to compare the influence of features in feature-selective classifiers (SLR) with widely considered non-feature-selective classifiers, such as neural networks (NN), linear discriminant analysis (LDA) and support vector machines (SVM) applied with different kernels. The results divulged that the average classification accuracy of the SVM, with a linear kernel function, outperforms other classifiers with feature ensembles, Hudgin's feature set and auto regression (AR) coefficients. However, the slight improvement in classification accuracy of SVM incurs more processing time and memory space in the low-level controller. The Kruskal-Wallis (KW) test also shows that there is no significant difference in the classification performance of SLR with Hudgin's feature set to that of SVM with Hudgin's features along with AR coefficients. In addition, the KW test shows that SLR was found to be better in respect to computation time and memory space, which is vital in a low-level controller. Similar to SVM, with a linear kernel function, other non-feature selective LDA and NN classifiers also show a slight improvement in performance using twice the features but with the drawback of increased memory space requirement and time. This prototype facilitated the study of various issues of pattern recognition and identified an efficient classifier, along with a feature ensemble, in the implementation of EMG controlled prosthetic hands in a laboratory setting at low-cost. This platform may help to motivate and facilitate prosthetic hand research in developing countries.
NASA Astrophysics Data System (ADS)
Pazó, Diego; Rodríguez, Miguel A.; López, Juan M.
2010-05-01
We study the evolution of finite perturbations in the Lorenz ‘96 model, a meteorological toy model of the atmosphere. The initial perturbations are chosen to be aligned along different dynamic vectors: bred, Lyapunov, and singular vectors. Using a particular vector determines not only the amplification rate of the perturbation but also the spatial structure of the perturbation and its stability under the evolution of the flow. The evolution of perturbations is systematically studied by means of the so-called mean-variance of logarithms diagram that provides in a very compact way the basic information to analyse the spatial structure. We discuss the corresponding advantages of using those different vectors for preparing initial perturbations to be used in ensemble prediction systems, focusing on key properties: dynamic adaptation to the flow, robustness, equivalence between members of the ensemble, etc. Among all the vectors considered here, the so-called characteristic Lyapunov vectors are possibly optimal, in the sense that they are both perfectly adapted to the flow and extremely robust.
NASA Astrophysics Data System (ADS)
Pazó, Diego; Rodríguez, Miguel A.; López, Juan M.
2010-01-01
We study the evolution of finite perturbations in the Lorenz `96 model, a meteorological toy model of the atmosphere. The initial perturbations are chosen to be aligned along different dynamic vectors: bred, Lyapunov, and singular vectors. Using a particular vector determines not only the amplification rate of the perturbation but also the spatial structure of the perturbation and its stability under the evolution of the flow. The evolution of perturbations is systematically studied by means of the so-called mean-variance of logarithms diagram that provides in a very compact way the basic information to analyse the spatial structure. We discuss the corresponding advantages of using those different vectors for preparing initial perturbations to be used in ensemble prediction systems, focusing on key properties: dynamic adaptation to the flow, robustness, equivalence between members of the ensemble, etc. Among all the vectors considered here, the so-called characteristic Lyapunov vectors are possibly optimal, in the sense that they are both perfectly adapted to the flow and extremely robust.
Hybrid Disease Diagnosis Using Multiobjective Optimization with Evolutionary Parameter Optimization
Nalluri, MadhuSudana Rao; K., Kannan; M., Manisha
2017-01-01
With the widespread adoption of e-Healthcare and telemedicine applications, accurate, intelligent disease diagnosis systems have been profoundly coveted. In recent years, numerous individual machine learning-based classifiers have been proposed and tested, and the fact that a single classifier cannot effectively classify and diagnose all diseases has been almost accorded with. This has seen a number of recent research attempts to arrive at a consensus using ensemble classification techniques. In this paper, a hybrid system is proposed to diagnose ailments using optimizing individual classifier parameters for two classifier techniques, namely, support vector machine (SVM) and multilayer perceptron (MLP) technique. We employ three recent evolutionary algorithms to optimize the parameters of the classifiers above, leading to six alternative hybrid disease diagnosis systems, also referred to as hybrid intelligent systems (HISs). Multiple objectives, namely, prediction accuracy, sensitivity, and specificity, have been considered to assess the efficacy of the proposed hybrid systems with existing ones. The proposed model is evaluated on 11 benchmark datasets, and the obtained results demonstrate that our proposed hybrid diagnosis systems perform better in terms of disease prediction accuracy, sensitivity, and specificity. Pertinent statistical tests were carried out to substantiate the efficacy of the obtained results. PMID:29065626
Elsawy, Amr S; Eldawlatly, Seif; Taher, Mohamed; Aly, Gamal M
2014-01-01
The current trend to use Brain-Computer Interfaces (BCIs) with mobile devices mandates the development of efficient EEG data processing methods. In this paper, we demonstrate the performance of a Principal Component Analysis (PCA) ensemble classifier for P300-based spellers. We recorded EEG data from multiple subjects using the Emotiv neuroheadset in the context of a classical oddball P300 speller paradigm. We compare the performance of the proposed ensemble classifier to the performance of traditional feature extraction and classifier methods. Our results demonstrate the capability of the PCA ensemble classifier to classify P300 data recorded using the Emotiv neuroheadset with an average accuracy of 86.29% on cross-validation data. In addition, offline testing of the recorded data reveals an average classification accuracy of 73.3% that is significantly higher than that achieved using traditional methods. Finally, we demonstrate the effect of the parameters of the P300 speller paradigm on the performance of the method.
Ensemble Classifier Strategy Based on Transient Feature Fusion in Electronic Nose
NASA Astrophysics Data System (ADS)
Bagheri, Mohammad Ali; Montazer, Gholam Ali
2011-09-01
In this paper, we test the performance of several ensembles of classifiers and each base learner has been trained on different types of extracted features. Experimental results show the potential benefits introduced by the usage of simple ensemble classification systems for the integration of different types of transient features.
Ensemble Sparse Classification of Alzheimer’s Disease
Liu, Manhua; Zhang, Daoqiang; Shen, Dinggang
2012-01-01
The high-dimensional pattern classification methods, e.g., support vector machines (SVM), have been widely investigated for analysis of structural and functional brain images (such as magnetic resonance imaging (MRI)) to assist the diagnosis of Alzheimer’s disease (AD) including its prodromal stage, i.e., mild cognitive impairment (MCI). Most existing classification methods extract features from neuroimaging data and then construct a single classifier to perform classification. However, due to noise and small sample size of neuroimaging data, it is challenging to train only a global classifier that can be robust enough to achieve good classification performance. In this paper, instead of building a single global classifier, we propose a local patch-based subspace ensemble method which builds multiple individual classifiers based on different subsets of local patches and then combines them for more accurate and robust classification. Specifically, to capture the local spatial consistency, each brain image is partitioned into a number of local patches and a subset of patches is randomly selected from the patch pool to build a weak classifier. Here, the sparse representation-based classification (SRC) method, which has shown effective for classification of image data (e.g., face), is used to construct each weak classifier. Then, multiple weak classifiers are combined to make the final decision. We evaluate our method on 652 subjects (including 198 AD patients, 225 MCI and 229 normal controls) from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database using MR images. The experimental results show that our method achieves an accuracy of 90.8% and an area under the ROC curve (AUC) of 94.86% for AD classification and an accuracy of 87.85% and an AUC of 92.90% for MCI classification, respectively, demonstrating a very promising performance of our method compared with the state-of-the-art methods for AD/MCI classification using MR images. PMID:22270352
Cacha, L A; Parida, S; Dehuri, S; Cho, S-B; Poznanski, R R
2016-12-01
The huge number of voxels in fMRI over time poses a major challenge to for effective analysis. Fast, accurate, and reliable classifiers are required for estimating the decoding accuracy of brain activities. Although machine-learning classifiers seem promising, individual classifiers have their own limitations. To address this limitation, the present paper proposes a method based on the ensemble of neural networks to analyze fMRI data for cognitive state classification for application across multiple subjects. Similarly, the fuzzy integral (FI) approach has been employed as an efficient tool for combining different classifiers. The FI approach led to the development of a classifiers ensemble technique that performs better than any of the single classifier by reducing the misclassification, the bias, and the variance. The proposed method successfully classified the different cognitive states for multiple subjects with high accuracy of classification. Comparison of the performance improvement, while applying ensemble neural networks method, vs. that of the individual neural network strongly points toward the usefulness of the proposed method.
Neural system for heartbeats recognition using genetically integrated ensemble of classifiers.
Osowski, Stanislaw; Siwek, Krzysztof; Siroic, Robert
2011-03-01
This paper presents the application of genetic algorithm for the integration of neural classifiers combined in the ensemble for the accurate recognition of heartbeat types on the basis of ECG registration. The idea presented in this paper is that using many classifiers arranged in the form of ensemble leads to the increased accuracy of the recognition. In such ensemble the important problem is the integration of all classifiers into one effective classification system. This paper proposes the use of genetic algorithm. It was shown that application of the genetic algorithm is very efficient and allows to reduce significantly the total error of heartbeat recognition. This was confirmed by the numerical experiments performed on the MIT BIH Arrhythmia Database. Copyright © 2011 Elsevier Ltd. All rights reserved.
Haque, Mohammad Nazmul; Noman, Nasimul; Berretta, Regina; Moscato, Pablo
2016-01-01
Classification of datasets with imbalanced sample distributions has always been a challenge. In general, a popular approach for enhancing classification performance is the construction of an ensemble of classifiers. However, the performance of an ensemble is dependent on the choice of constituent base classifiers. Therefore, we propose a genetic algorithm-based search method for finding the optimum combination from a pool of base classifiers to form a heterogeneous ensemble. The algorithm, called GA-EoC, utilises 10 fold-cross validation on training data for evaluating the quality of each candidate ensembles. In order to combine the base classifiers decision into ensemble’s output, we used the simple and widely used majority voting approach. The proposed algorithm, along with the random sub-sampling approach to balance the class distribution, has been used for classifying class-imbalanced datasets. Additionally, if a feature set was not available, we used the (α, β) − k Feature Set method to select a better subset of features for classification. We have tested GA-EoC with three benchmarking datasets from the UCI-Machine Learning repository, one Alzheimer’s disease dataset and a subset of the PubFig database of Columbia University. In general, the performance of the proposed method on the chosen datasets is robust and better than that of the constituent base classifiers and many other well-known ensembles. Based on our empirical study we claim that a genetic algorithm is a superior and reliable approach to heterogeneous ensemble construction and we expect that the proposed GA-EoC would perform consistently in other cases. PMID:26764911
Genetic programming based ensemble system for microarray data classification.
Liu, Kun-Hong; Tong, Muchenxuan; Xie, Shu-Tong; Yee Ng, Vincent To
2015-01-01
Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP) based new ensemble system (named GPES), which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.
Genetic Programming Based Ensemble System for Microarray Data Classification
Liu, Kun-Hong; Tong, Muchenxuan; Xie, Shu-Tong; Yee Ng, Vincent To
2015-01-01
Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP) based new ensemble system (named GPES), which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved. PMID:25810748
NASA Astrophysics Data System (ADS)
Roberge, S.; Chokmani, K.; De Sève, D.
2012-04-01
The snow cover plays an important role in the hydrological cycle of Quebec (Eastern Canada). Consequently, evaluating its spatial extent interests the authorities responsible for the management of water resources, especially hydropower companies. The main objective of this study is the development of a snow-cover mapping strategy using remote sensing data and ensemble based systems techniques. Planned to be tested in a near real-time operational mode, this snow-cover mapping strategy has the advantage to provide the probability of a pixel to be snow covered and its uncertainty. Ensemble systems are made of two key components. First, a method is needed to build an ensemble of classifiers that is diverse as much as possible. Second, an approach is required to combine the outputs of individual classifiers that make up the ensemble in such a way that correct decisions are amplified, and incorrect ones are cancelled out. In this study, we demonstrate the potential of ensemble systems to snow-cover mapping using remote sensing data. The chosen classifier is a sequential thresholds algorithm using NOAA-AVHRR data adapted to conditions over Eastern Canada. Its special feature is the use of a combination of six sequential thresholds varying according to the day in the winter season. Two versions of the snow-cover mapping algorithm have been developed: one is specific for autumn (from October 1st to December 31st) and the other for spring (from March 16th to May 31st). In order to build the ensemble based system, different versions of the algorithm are created by varying randomly its parameters. One hundred of the versions are included in the ensemble. The probability of a pixel to be snow, no-snow or cloud covered corresponds to the amount of votes the pixel has been classified as such by all classifiers. The overall performance of ensemble based mapping is compared to the overall performance of the chosen classifier, and also with ground observations at meteorological stations.
NASA Astrophysics Data System (ADS)
Young, Jonathan; Ridgway, Gerard; Leung, Kelvin; Ourselin, Sebastien
2012-02-01
It is well known that hippocampal atrophy is a marker of the onset of Alzheimer's disease (AD) and as a result hippocampal volumetry has been used in a number of studies to provide early diagnosis of AD and predict conversion of mild cognitive impairment patients to AD. However, rates of atrophy are not uniform across the hippocampus making shape analysis a potentially more accurate biomarker. This study studies the hippocampi from 226 healthy controls, 148 AD patients and 330 MCI patients obtained from T1 weighted structural MRI images from the ADNI database. The hippocampi are anatomically segmented using the MAPS multi-atlas segmentation method, and the resulting binary images are then processed with SPHARM software to decompose their shapes as a weighted sum of spherical harmonic basis functions. The resulting parameterizations are then used as feature vectors in Support Vector Machine (SVM) classification. A wrapper based feature selection method was used as this considers the utility of features in discriminating classes in combination, fully exploiting the multivariate nature of the data and optimizing the selected set of features for the type of classifier that is used. The leave-one-out cross validated accuracy obtained on training data is 88.6% for classifying AD vs controls and 74% for classifying MCI-converters vs MCI-stable with very compact feature sets, showing that this is a highly promising method. There is currently a considerable fall in accuracy on unseen data indicating that the feature selection is sensitive to the data used, however feature ensemble methods may overcome this.
2014-01-01
Background Left bundle branch block (LBBB) and right bundle branch block (RBBB) not only mask electrocardiogram (ECG) changes that reflect diseases but also indicate important underlying pathology. The timely detection of LBBB and RBBB is critical in the treatment of cardiac diseases. Inter-patient heartbeat classification is based on independent training and testing sets to construct and evaluate a heartbeat classification system. Therefore, a heartbeat classification system with a high performance evaluation possesses a strong predictive capability for unknown data. The aim of this study was to propose a method for inter-patient classification of heartbeats to accurately detect LBBB and RBBB from the normal beat (NORM). Methods This study proposed a heartbeat classification method through a combination of three different types of classifiers: a minimum distance classifier constructed between NORM and LBBB; a weighted linear discriminant classifier between NORM and RBBB based on Bayesian decision making using posterior probabilities; and a linear support vector machine (SVM) between LBBB and RBBB. Each classifier was used with matching features to obtain better classification performance. The final types of the test heartbeats were determined using a majority voting strategy through the combination of class labels from the three classifiers. The optimal parameters for the classifiers were selected using cross-validation on the training set. The effects of different lead configurations on the classification results were assessed, and the performance of these three classifiers was compared for the detection of each pair of heartbeat types. Results The study results showed that a two-lead configuration exhibited better classification results compared with a single-lead configuration. The construction of a classifier with good performance between each pair of heartbeat types significantly improved the heartbeat classification performance. The results showed a sensitivity of 91.4% and a positive predictive value of 37.3% for LBBB and a sensitivity of 92.8% and a positive predictive value of 88.8% for RBBB. Conclusions A multi-classifier ensemble method was proposed based on inter-patient data and demonstrated a satisfactory classification performance. This approach has the potential for application in clinical practice to distinguish LBBB and RBBB from NORM of unknown patients. PMID:24903422
Huang, Huifang; Liu, Jie; Zhu, Qiang; Wang, Ruiping; Hu, Guangshu
2014-06-05
Left bundle branch block (LBBB) and right bundle branch block (RBBB) not only mask electrocardiogram (ECG) changes that reflect diseases but also indicate important underlying pathology. The timely detection of LBBB and RBBB is critical in the treatment of cardiac diseases. Inter-patient heartbeat classification is based on independent training and testing sets to construct and evaluate a heartbeat classification system. Therefore, a heartbeat classification system with a high performance evaluation possesses a strong predictive capability for unknown data. The aim of this study was to propose a method for inter-patient classification of heartbeats to accurately detect LBBB and RBBB from the normal beat (NORM). This study proposed a heartbeat classification method through a combination of three different types of classifiers: a minimum distance classifier constructed between NORM and LBBB; a weighted linear discriminant classifier between NORM and RBBB based on Bayesian decision making using posterior probabilities; and a linear support vector machine (SVM) between LBBB and RBBB. Each classifier was used with matching features to obtain better classification performance. The final types of the test heartbeats were determined using a majority voting strategy through the combination of class labels from the three classifiers. The optimal parameters for the classifiers were selected using cross-validation on the training set. The effects of different lead configurations on the classification results were assessed, and the performance of these three classifiers was compared for the detection of each pair of heartbeat types. The study results showed that a two-lead configuration exhibited better classification results compared with a single-lead configuration. The construction of a classifier with good performance between each pair of heartbeat types significantly improved the heartbeat classification performance. The results showed a sensitivity of 91.4% and a positive predictive value of 37.3% for LBBB and a sensitivity of 92.8% and a positive predictive value of 88.8% for RBBB. A multi-classifier ensemble method was proposed based on inter-patient data and demonstrated a satisfactory classification performance. This approach has the potential for application in clinical practice to distinguish LBBB and RBBB from NORM of unknown patients.
Can single classifiers be as useful as model ensembles to produce benthic seabed substratum maps?
NASA Astrophysics Data System (ADS)
Turner, Joseph A.; Babcock, Russell C.; Hovey, Renae; Kendrick, Gary A.
2018-05-01
Numerous machine-learning classifiers are available for benthic habitat map production, which can lead to different results. This study highlights the performance of the Random Forest (RF) classifier, which was significantly better than Classification Trees (CT), Naïve Bayes (NB), and a multi-model ensemble in terms of overall accuracy, Balanced Error Rate (BER), Kappa, and area under the curve (AUC) values. RF accuracy was often higher than 90% for each substratum class, even at the most detailed level of the substratum classification and AUC values also indicated excellent performance (0.8-1). Total agreement between classifiers was high at the broadest level of classification (75-80%) when differentiating between hard and soft substratum. However, this sharply declined as the number of substratum categories increased (19-45%) including a mix of rock, gravel, pebbles, and sand. The model ensemble, produced from the results of all three classifiers by majority voting, did not show any increase in predictive performance when compared to the single RF classifier. This study shows how a single classifier may be sufficient to produce benthic seabed maps and model ensembles of multiple classifiers.
Constructing better classifier ensemble based on weighted accuracy and diversity measure.
Zeng, Xiaodong; Wong, Derek F; Chao, Lidia S
2014-01-01
A weighted accuracy and diversity (WAD) method is presented, a novel measure used to evaluate the quality of the classifier ensemble, assisting in the ensemble selection task. The proposed measure is motivated by a commonly accepted hypothesis; that is, a robust classifier ensemble should not only be accurate but also different from every other member. In fact, accuracy and diversity are mutual restraint factors; that is, an ensemble with high accuracy may have low diversity, and an overly diverse ensemble may negatively affect accuracy. This study proposes a method to find the balance between accuracy and diversity that enhances the predictive ability of an ensemble for unknown data. The quality assessment for an ensemble is performed such that the final score is achieved by computing the harmonic mean of accuracy and diversity, where two weight parameters are used to balance them. The measure is compared to two representative measures, Kappa-Error and GenDiv, and two threshold measures that consider only accuracy or diversity, with two heuristic search algorithms, genetic algorithm, and forward hill-climbing algorithm, in ensemble selection tasks performed on 15 UCI benchmark datasets. The empirical results demonstrate that the WAD measure is superior to others in most cases.
Abuassba, Adnan O M; Zhang, Dezheng; Luo, Xiong; Shaheryar, Ahmad; Ali, Hazrat
2017-01-01
Extreme Learning Machine (ELM) is a fast-learning algorithm for a single-hidden layer feedforward neural network (SLFN). It often has good generalization performance. However, there are chances that it might overfit the training data due to having more hidden nodes than needed. To address the generalization performance, we use a heterogeneous ensemble approach. We propose an Advanced ELM Ensemble (AELME) for classification, which includes Regularized-ELM, L 2 -norm-optimized ELM (ELML2), and Kernel-ELM. The ensemble is constructed by training a randomly chosen ELM classifier on a subset of training data selected through random resampling. The proposed AELM-Ensemble is evolved by employing an objective function of increasing diversity and accuracy among the final ensemble. Finally, the class label of unseen data is predicted using majority vote approach. Splitting the training data into subsets and incorporation of heterogeneous ELM classifiers result in higher prediction accuracy, better generalization, and a lower number of base classifiers, as compared to other models (Adaboost, Bagging, Dynamic ELM ensemble, data splitting ELM ensemble, and ELM ensemble). The validity of AELME is confirmed through classification on several real-world benchmark datasets.
Abuassba, Adnan O. M.; Ali, Hazrat
2017-01-01
Extreme Learning Machine (ELM) is a fast-learning algorithm for a single-hidden layer feedforward neural network (SLFN). It often has good generalization performance. However, there are chances that it might overfit the training data due to having more hidden nodes than needed. To address the generalization performance, we use a heterogeneous ensemble approach. We propose an Advanced ELM Ensemble (AELME) for classification, which includes Regularized-ELM, L2-norm-optimized ELM (ELML2), and Kernel-ELM. The ensemble is constructed by training a randomly chosen ELM classifier on a subset of training data selected through random resampling. The proposed AELM-Ensemble is evolved by employing an objective function of increasing diversity and accuracy among the final ensemble. Finally, the class label of unseen data is predicted using majority vote approach. Splitting the training data into subsets and incorporation of heterogeneous ELM classifiers result in higher prediction accuracy, better generalization, and a lower number of base classifiers, as compared to other models (Adaboost, Bagging, Dynamic ELM ensemble, data splitting ELM ensemble, and ELM ensemble). The validity of AELME is confirmed through classification on several real-world benchmark datasets. PMID:28546808
Constructing Better Classifier Ensemble Based on Weighted Accuracy and Diversity Measure
Chao, Lidia S.
2014-01-01
A weighted accuracy and diversity (WAD) method is presented, a novel measure used to evaluate the quality of the classifier ensemble, assisting in the ensemble selection task. The proposed measure is motivated by a commonly accepted hypothesis; that is, a robust classifier ensemble should not only be accurate but also different from every other member. In fact, accuracy and diversity are mutual restraint factors; that is, an ensemble with high accuracy may have low diversity, and an overly diverse ensemble may negatively affect accuracy. This study proposes a method to find the balance between accuracy and diversity that enhances the predictive ability of an ensemble for unknown data. The quality assessment for an ensemble is performed such that the final score is achieved by computing the harmonic mean of accuracy and diversity, where two weight parameters are used to balance them. The measure is compared to two representative measures, Kappa-Error and GenDiv, and two threshold measures that consider only accuracy or diversity, with two heuristic search algorithms, genetic algorithm, and forward hill-climbing algorithm, in ensemble selection tasks performed on 15 UCI benchmark datasets. The empirical results demonstrate that the WAD measure is superior to others in most cases. PMID:24672402
Sankari, E Siva; Manimegalai, D
2017-12-21
Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Uslu, Faruk Sukru
2017-07-01
Oil spills on the ocean surface cause serious environmental, political, and economic problems. Therefore, these catastrophic threats to marine ecosystems require detection and monitoring. Hyperspectral sensors are powerful optical sensors used for oil spill detection with the help of detailed spectral information of materials. However, huge amounts of data in hyperspectral imaging (HSI) require fast and accurate computation methods for detection problems. Support vector data description (SVDD) is one of the most suitable methods for detection, especially for large data sets. Nevertheless, the selection of kernel parameters is one of the main problems in SVDD. This paper presents a method, inspired by ensemble learning, for improving performance of SVDD without tuning its kernel parameters. Additionally, a classifier selection technique is proposed to get more gain. The proposed approach also aims to solve the small sample size problem, which is very important for processing high-dimensional data in HSI. The algorithm is applied to two HSI data sets for detection problems. In the first HSI data set, various targets are detected; in the second HSI data set, oil spill detection in situ is realized. The experimental results demonstrate the feasibility and performance improvement of the proposed algorithm for oil spill detection problems.
A Theoretical Analysis of Why Hybrid Ensembles Work.
Hsu, Kuo-Wei
2017-01-01
Inspired by the group decision making process, ensembles or combinations of classifiers have been found favorable in a wide variety of application domains. Some researchers propose to use the mixture of two different types of classification algorithms to create a hybrid ensemble. Why does such an ensemble work? The question remains. Following the concept of diversity, which is one of the fundamental elements of the success of ensembles, we conduct a theoretical analysis of why hybrid ensembles work, connecting using different algorithms to accuracy gain. We also conduct experiments on classification performance of hybrid ensembles of classifiers created by decision tree and naïve Bayes classification algorithms, each of which is a top data mining algorithm and often used to create non-hybrid ensembles. Therefore, through this paper, we provide a complement to the theoretical foundation of creating and using hybrid ensembles.
Accurate determination of imaging modality using an ensemble of text- and image-based classifiers.
Kahn, Charles E; Kalpathy-Cramer, Jayashree; Lam, Cesar A; Eldredge, Christina E
2012-02-01
Imaging modality can aid retrieval of medical images for clinical practice, research, and education. We evaluated whether an ensemble classifier could outperform its constituent individual classifiers in determining the modality of figures from radiology journals. Seventeen automated classifiers analyzed 77,495 images from two radiology journals. Each classifier assigned one of eight imaging modalities--computed tomography, graphic, magnetic resonance imaging, nuclear medicine, positron emission tomography, photograph, ultrasound, or radiograph-to each image based on visual and/or textual information. Three physicians determined the modality of 5,000 randomly selected images as a reference standard. A "Simple Vote" ensemble classifier assigned each image to the modality that received the greatest number of individual classifiers' votes. A "Weighted Vote" classifier weighted each individual classifier's vote based on performance over a training set. For each image, this classifier's output was the imaging modality that received the greatest weighted vote score. We measured precision, recall, and F score (the harmonic mean of precision and recall) for each classifier. Individual classifiers' F scores ranged from 0.184 to 0.892. The simple vote and weighted vote classifiers correctly assigned 4,565 images (F score, 0.913; 95% confidence interval, 0.905-0.921) and 4,672 images (F score, 0.934; 95% confidence interval, 0.927-0.941), respectively. The weighted vote classifier performed significantly better than all individual classifiers. An ensemble classifier correctly determined the imaging modality of 93% of figures in our sample. The imaging modality of figures published in radiology journals can be determined with high accuracy, which will improve systems for image retrieval.
A Novel Data-Driven Learning Method for Radar Target Detection in Nonstationary Environments
2016-05-01
Classifier ensembles for changing environments,” in Multiple Classifier Systems, vol. 3077, F. Roli, J. Kittler and T. Windeatt, Eds. New York, NY...Dec. 2006, pp. 1113–1118. [21] J. Z. Kolter and M. A. Maloof, “Dynamic weighted majority: An ensemble method for drifting concepts,” J. Mach. Learn...Trans. Neural Netw., vol. 22, no. 10, pp. 1517–1531, Oct. 2011. [23] R. Polikar, “ Ensemble learning,” in Ensemble Machine Learning: Methods and
Bennet, Jaison; Ganaprakasam, Chilambuchelvan Arul; Arputharaj, Kannan
2014-01-01
Cancer classification by doctors and radiologists was based on morphological and clinical features and had limited diagnostic ability in olden days. The recent arrival of DNA microarray technology has led to the concurrent monitoring of thousands of gene expressions in a single chip which stimulates the progress in cancer classification. In this paper, we have proposed a hybrid approach for microarray data classification based on nearest neighbor (KNN), naive Bayes, and support vector machine (SVM). Feature selection prior to classification plays a vital role and a feature selection technique which combines discrete wavelet transform (DWT) and moving window technique (MWT) is used. The performance of the proposed method is compared with the conventional classifiers like support vector machine, nearest neighbor, and naive Bayes. Experiments have been conducted on both real and benchmark datasets and the results indicate that the ensemble approach produces higher classification accuracy than conventional classifiers. This paper serves as an automated system for the classification of cancer and can be applied by doctors in real cases which serve as a boon to the medical community. This work further reduces the misclassification of cancers which is highly not allowed in cancer detection.
Jin, Mingwu; Deng, Weishu
2018-05-15
There is a spectrum of the progression from healthy control (HC) to mild cognitive impairment (MCI) without conversion to Alzheimer's disease (AD), to MCI with conversion to AD (cMCI), and to AD. This study aims to predict the different disease stages using brain structural information provided by magnetic resonance imaging (MRI) data. The neighborhood component analysis (NCA) is applied to select most powerful features for prediction. The ensemble decision tree classifier is built to predict which group the subject belongs to. The best features and model parameters are determined by cross validation of the training data. Our results show that 16 out of a total of 429 features were selected by NCA using 240 training subjects, including MMSE score and structural measures in memory-related regions. The boosting tree model with NCA features can achieve prediction accuracy of 56.25% on 160 test subjects. Principal component analysis (PCA) and sequential feature selection (SFS) are used for feature selection, while support vector machine (SVM) is used for classification. The boosting tree model with NCA features outperforms all other combinations of feature selection and classification methods. The results suggest that NCA be a better feature selection strategy than PCA and SFS for the data used in this study. Ensemble tree classifier with boosting is more powerful than SVM to predict the subject group. However, more advanced feature selection and classification methods or additional measures besides structural MRI may be needed to improve the prediction performance. Copyright © 2018 Elsevier B.V. All rights reserved.
A Theoretical Analysis of Why Hybrid Ensembles Work
2017-01-01
Inspired by the group decision making process, ensembles or combinations of classifiers have been found favorable in a wide variety of application domains. Some researchers propose to use the mixture of two different types of classification algorithms to create a hybrid ensemble. Why does such an ensemble work? The question remains. Following the concept of diversity, which is one of the fundamental elements of the success of ensembles, we conduct a theoretical analysis of why hybrid ensembles work, connecting using different algorithms to accuracy gain. We also conduct experiments on classification performance of hybrid ensembles of classifiers created by decision tree and naïve Bayes classification algorithms, each of which is a top data mining algorithm and often used to create non-hybrid ensembles. Therefore, through this paper, we provide a complement to the theoretical foundation of creating and using hybrid ensembles. PMID:28255296
A Random Forest-based ensemble method for activity recognition.
Feng, Zengtao; Mo, Lingfei; Li, Meng
2015-01-01
This paper presents a multi-sensor ensemble approach to human physical activity (PA) recognition, using random forest. We designed an ensemble learning algorithm, which integrates several independent Random Forest classifiers based on different sensor feature sets to build a more stable, more accurate and faster classifier for human activity recognition. To evaluate the algorithm, PA data collected from the PAMAP (Physical Activity Monitoring for Aging People), which is a standard, publicly available database, was utilized to train and test. The experimental results show that the algorithm is able to correctly recognize 19 PA types with an accuracy of 93.44%, while the training is faster than others. The ensemble classifier system based on the RF (Random Forest) algorithm can achieve high recognition accuracy and fast calculation.
2017-01-01
In this paper, we propose a new automatic hyperparameter selection approach for determining the optimal network configuration (network structure and hyperparameters) for deep neural networks using particle swarm optimization (PSO) in combination with a steepest gradient descent algorithm. In the proposed approach, network configurations were coded as a set of real-number m-dimensional vectors as the individuals of the PSO algorithm in the search procedure. During the search procedure, the PSO algorithm is employed to search for optimal network configurations via the particles moving in a finite search space, and the steepest gradient descent algorithm is used to train the DNN classifier with a few training epochs (to find a local optimal solution) during the population evaluation of PSO. After the optimization scheme, the steepest gradient descent algorithm is performed with more epochs and the final solutions (pbest and gbest) of the PSO algorithm to train a final ensemble model and individual DNN classifiers, respectively. The local search ability of the steepest gradient descent algorithm and the global search capabilities of the PSO algorithm are exploited to determine an optimal solution that is close to the global optimum. We constructed several experiments on hand-written characters and biological activity prediction datasets to show that the DNN classifiers trained by the network configurations expressed by the final solutions of the PSO algorithm, employed to construct an ensemble model and individual classifier, outperform the random approach in terms of the generalization performance. Therefore, the proposed approach can be regarded an alternative tool for automatic network structure and parameter selection for deep neural networks. PMID:29236718
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa
2018-07-01
Automatic text classification techniques are useful for classifying plaintext medical documents. This study aims to automatically predict the cause of death from free text forensic autopsy reports by comparing various schemes for feature extraction, term weighing or feature value representation, text classification, and feature reduction. For experiments, the autopsy reports belonging to eight different causes of death were collected, preprocessed and converted into 43 master feature vectors using various schemes for feature extraction, representation, and reduction. The six different text classification techniques were applied on these 43 master feature vectors to construct a classification model that can predict the cause of death. Finally, classification model performance was evaluated using four performance measures i.e. overall accuracy, macro precision, macro-F-measure, and macro recall. From experiments, it was found that that unigram features obtained the highest performance compared to bigram, trigram, and hybrid-gram features. Furthermore, in feature representation schemes, term frequency, and term frequency with inverse document frequency obtained similar and better results when compared with binary frequency, and normalized term frequency with inverse document frequency. Furthermore, the chi-square feature reduction approach outperformed Pearson correlation, and information gain approaches. Finally, in text classification algorithms, support vector machine classifier outperforms random forest, Naive Bayes, k-nearest neighbor, decision tree, and ensemble-voted classifier. Our results and comparisons hold practical importance and serve as references for future works. Moreover, the comparison outputs will act as state-of-art techniques to compare future proposals with existing automated text classification techniques. Copyright © 2017 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.
A new approach to human microRNA target prediction using ensemble pruning and rotation forest.
Mousavi, Reza; Eftekhari, Mahdi; Haghighi, Mehdi Ghezelbash
2015-12-01
MicroRNAs (miRNAs) are small non-coding RNAs that have important functions in gene regulation. Since finding miRNA target experimentally is costly and needs spending much time, the use of machine learning methods is a growing research area for miRNA target prediction. In this paper, a new approach is proposed by using two popular ensemble strategies, i.e. Ensemble Pruning and Rotation Forest (EP-RTF), to predict human miRNA target. For EP, the approach utilizes Genetic Algorithm (GA). In other words, a subset of classifiers from the heterogeneous ensemble is first selected by GA. Next, the selected classifiers are trained based on the RTF method and then are combined using weighted majority voting. In addition to seeking a better subset of classifiers, the parameter of RTF is also optimized by GA. Findings of the present study confirm that the newly developed EP-RTF outperforms (in terms of classification accuracy, sensitivity, and specificity) the previously applied methods over four datasets in the field of human miRNA target. Diversity-error diagrams reveal that the proposed ensemble approach constructs individual classifiers which are more accurate and usually diverse than the other ensemble approaches. Given these experimental results, we highly recommend EP-RTF for improving the performance of miRNA target prediction.
Large unbalanced credit scoring using Lasso-logistic regression ensemble.
Wang, Hong; Xu, Qingsong; Zhou, Lifeng
2015-01-01
Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data.
Ranking and combining multiple predictors without labeled data
Parisi, Fabio; Strino, Francesco; Nadler, Boaz; Kluger, Yuval
2014-01-01
In a broad range of classification and decision-making problems, one is given the advice or predictions of several classifiers, of unknown reliability, over multiple questions or queries. This scenario is different from the standard supervised setting, where each classifier’s accuracy can be assessed using available labeled data, and raises two questions: Given only the predictions of several classifiers over a large set of unlabeled test data, is it possible to (i) reliably rank them and (ii) construct a metaclassifier more accurate than most classifiers in the ensemble? Here we present a spectral approach to address these questions. First, assuming conditional independence between classifiers, we show that the off-diagonal entries of their covariance matrix correspond to a rank-one matrix. Moreover, the classifiers can be ranked using the leading eigenvector of this covariance matrix, because its entries are proportional to their balanced accuracies. Second, via a linear approximation to the maximum likelihood estimator, we derive the Spectral Meta-Learner (SML), an unsupervised ensemble classifier whose weights are equal to these eigenvector entries. On both simulated and real data, SML typically achieves a higher accuracy than most classifiers in the ensemble and can provide a better starting point than majority voting for estimating the maximum likelihood solution. Furthermore, SML is robust to the presence of small malicious groups of classifiers designed to veer the ensemble prediction away from the (unknown) ground truth. PMID:24474744
You, Zhu-Hong; Lei, Ying-Ke; Zhu, Lin; Xia, Junfeng; Wang, Bing
2013-01-01
Protein-protein interactions (PPIs) play crucial roles in the execution of various cellular processes and form the basis of biological mechanisms. Although large amount of PPIs data for different species has been generated by high-throughput experimental techniques, current PPI pairs obtained with experimental methods cover only a fraction of the complete PPI networks, and further, the experimental methods for identifying PPIs are both time-consuming and expensive. Hence, it is urgent and challenging to develop automated computational methods to efficiently and accurately predict PPIs. We present here a novel hierarchical PCA-EELM (principal component analysis-ensemble extreme learning machine) model to predict protein-protein interactions only using the information of protein sequences. In the proposed method, 11188 protein pairs retrieved from the DIP database were encoded into feature vectors by using four kinds of protein sequences information. Focusing on dimension reduction, an effective feature extraction method PCA was then employed to construct the most discriminative new feature set. Finally, multiple extreme learning machines were trained and then aggregated into a consensus classifier by majority voting. The ensembling of extreme learning machine removes the dependence of results on initial random weights and improves the prediction performance. When performed on the PPI data of Saccharomyces cerevisiae, the proposed method achieved 87.00% prediction accuracy with 86.15% sensitivity at the precision of 87.59%. Extensive experiments are performed to compare our method with state-of-the-art techniques Support Vector Machine (SVM). Experimental results demonstrate that proposed PCA-EELM outperforms the SVM method by 5-fold cross-validation. Besides, PCA-EELM performs faster than PCA-SVM based method. Consequently, the proposed approach can be considered as a new promising and powerful tools for predicting PPI with excellent performance and less time.
NASA Astrophysics Data System (ADS)
Zheng, Jinde; Pan, Haiyang; Cheng, Junsheng
2017-02-01
To timely detect the incipient failure of rolling bearing and find out the accurate fault location, a novel rolling bearing fault diagnosis method is proposed based on the composite multiscale fuzzy entropy (CMFE) and ensemble support vector machines (ESVMs). Fuzzy entropy (FuzzyEn), as an improvement of sample entropy (SampEn), is a new nonlinear method for measuring the complexity of time series. Since FuzzyEn (or SampEn) in single scale can not reflect the complexity effectively, multiscale fuzzy entropy (MFE) is developed by defining the FuzzyEns of coarse-grained time series, which represents the system dynamics in different scales. However, the MFE values will be affected by the data length, especially when the data are not long enough. By combining information of multiple coarse-grained time series in the same scale, the CMFE algorithm is proposed in this paper to enhance MFE, as well as FuzzyEn. Compared with MFE, with the increasing of scale factor, CMFE obtains much more stable and consistent values for a short-term time series. In this paper CMFE is employed to measure the complexity of vibration signals of rolling bearings and is applied to extract the nonlinear features hidden in the vibration signals. Also the physically meanings of CMFE being suitable for rolling bearing fault diagnosis are explored. Based on these, to fulfill an automatic fault diagnosis, the ensemble SVMs based multi-classifier is constructed for the intelligent classification of fault features. Finally, the proposed fault diagnosis method of rolling bearing is applied to experimental data analysis and the results indicate that the proposed method could effectively distinguish different fault categories and severities of rolling bearings.
Zou, Lingyun; Nan, Chonghan; Hu, Fuquan
2013-12-15
Various human pathogens secret effector proteins into hosts cells via the type IV secretion system (T4SS). These proteins play important roles in the interaction between bacteria and hosts. Computational methods for T4SS effector prediction have been developed for screening experimental targets in several isolated bacterial species; however, widely applicable prediction approaches are still unavailable In this work, four types of distinctive features, namely, amino acid composition, dipeptide composition, .position-specific scoring matrix composition and auto covariance transformation of position-specific scoring matrix, were calculated from primary sequences. A classifier, T4EffPred, was developed using the support vector machine with these features and their different combinations for effector prediction. Various theoretical tests were performed in a newly established dataset, and the results were measured with four indexes. We demonstrated that T4EffPred can discriminate IVA and IVB effectors in benchmark datasets with positive rates of 76.7% and 89.7%, respectively. The overall accuracy of 95.9% shows that the present method is accurate for distinguishing the T4SS effector in unidentified sequences. A classifier ensemble was designed to synthesize all single classifiers. Notable performance improvement was observed using this ensemble system in benchmark tests. To demonstrate the model's application, a genome-scale prediction of effectors was performed in Bartonella henselae, an important zoonotic pathogen. A number of putative candidates were distinguished. A web server implementing the prediction method and the source code are both available at http://bioinfo.tmmu.edu.cn/T4EffPred.
Glyph-based analysis of multimodal directional distributions in vector field ensembles
NASA Astrophysics Data System (ADS)
Jarema, Mihaela; Demir, Ismail; Kehrer, Johannes; Westermann, Rüdiger
2015-04-01
Ensemble simulations are increasingly often performed in the geosciences in order to study the uncertainty and variability of model predictions. Describing ensemble data by mean and standard deviation can be misleading in case of multimodal distributions. We present first results of a glyph-based visualization of multimodal directional distributions in 2D and 3D vector ensemble data. Directional information on the circle/sphere is modeled using mixtures of probability density functions (pdfs), which enables us to characterize the distributions with relatively few parameters. The resulting mixture models are represented by 2D and 3D lobular glyphs showing direction, spread and strength of each principal mode of the distributions. A 3D extension of our approach is realized by means of an efficient GPU rendering technique. We demonstrate our method in the context of ensemble weather simulations.
Cai, Suxian; Yang, Shanshan; Zheng, Fang; Lu, Meng; Wu, Yunfeng; Krishnan, Sridhar
2013-01-01
Analysis of knee joint vibration (VAG) signals can provide quantitative indices for detection of knee joint pathology at an early stage. In addition to the statistical features developed in the related previous studies, we extracted two separable features, that is, the number of atoms derived from the wavelet matching pursuit decomposition and the number of significant signal turns detected with the fixed threshold in the time domain. To perform a better classification over the data set of 89 VAG signals, we applied a novel classifier fusion system based on the dynamic weighted fusion (DWF) method to ameliorate the classification performance. For comparison, a single leastsquares support vector machine (LS-SVM) and the Bagging ensemble were used for the classification task as well. The results in terms of overall accuracy in percentage and area under the receiver operating characteristic curve obtained with the DWF-based classifier fusion method reached 88.76% and 0.9515, respectively, which demonstrated the effectiveness and superiority of the DWF method with two distinct features for the VAG signal analysis. PMID:23573175
Using ensemble of classifiers for predicting HIV protease cleavage sites in proteins.
Nanni, Loris; Lumini, Alessandra
2009-03-01
The focus of this work is the use of ensembles of classifiers for predicting HIV protease cleavage sites in proteins. Due to the complex relationships in the biological data, several recent works show that often ensembles of learning algorithms outperform stand-alone methods. We show that the fusion of approaches based on different encoding models can be useful for improving the performance of this classification problem. In particular, in this work four different feature encodings for peptides are described and tested. An extensive evaluation on a large dataset according to a blind testing protocol is reported which demonstrates how different feature extraction methods and classifiers can be combined for obtaining a robust and reliable system. The comparison with other stand-alone approaches allows quantifying the performance improvement obtained by the ensembles proposed in this work.
Large Unbalanced Credit Scoring Using Lasso-Logistic Regression Ensemble
Wang, Hong; Xu, Qingsong; Zhou, Lifeng
2015-01-01
Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data. PMID:25706988
Feature selection and classification of multiparametric medical images using bagging and SVM
NASA Astrophysics Data System (ADS)
Fan, Yong; Resnick, Susan M.; Davatzikos, Christos
2008-03-01
This paper presents a framework for brain classification based on multi-parametric medical images. This method takes advantage of multi-parametric imaging to provide a set of discriminative features for classifier construction by using a regional feature extraction method which takes into account joint correlations among different image parameters; in the experiments herein, MRI and PET images of the brain are used. Support vector machine classifiers are then trained based on the most discriminative features selected from the feature set. To facilitate robust classification and optimal selection of parameters involved in classification, in view of the well-known "curse of dimensionality", base classifiers are constructed in a bagging (bootstrap aggregating) framework for building an ensemble classifier and the classification parameters of these base classifiers are optimized by means of maximizing the area under the ROC (receiver operating characteristic) curve estimated from their prediction performance on left-out samples of bootstrap sampling. This classification system is tested on a sex classification problem, where it yields over 90% classification rates for unseen subjects. The proposed classification method is also compared with other commonly used classification algorithms, with favorable results. These results illustrate that the methods built upon information jointly extracted from multi-parametric images have the potential to perform individual classification with high sensitivity and specificity.
A Machine Learning Ensemble Classifier for Early Prediction of Diabetic Retinopathy.
S K, Somasundaram; P, Alli
2017-11-09
The main complication of diabetes is Diabetic retinopathy (DR), retinal vascular disease and it leads to the blindness. Regular screening for early DR disease detection is considered as an intensive labor and resource oriented task. Therefore, automatic detection of DR diseases is performed only by using the computational technique is the great solution. An automatic method is more reliable to determine the presence of an abnormality in Fundus images (FI) but, the classification process is poorly performed. Recently, few research works have been designed for analyzing texture discrimination capacity in FI to distinguish the healthy images. However, the feature extraction (FE) process was not performed well, due to the high dimensionality. Therefore, to identify retinal features for DR disease diagnosis and early detection using Machine Learning and Ensemble Classification method, called, Machine Learning Bagging Ensemble Classifier (ML-BEC) is designed. The ML-BEC method comprises of two stages. The first stage in ML-BEC method comprises extraction of the candidate objects from Retinal Images (RI). The candidate objects or the features for DR disease diagnosis include blood vessels, optic nerve, neural tissue, neuroretinal rim, optic disc size, thickness and variance. These features are initially extracted by applying Machine Learning technique called, t-distributed Stochastic Neighbor Embedding (t-SNE). Besides, t-SNE generates a probability distribution across high-dimensional images where the images are separated into similar and dissimilar pairs. Then, t-SNE describes a similar probability distribution across the points in the low-dimensional map. This lessens the Kullback-Leibler divergence among two distributions regarding the locations of the points on the map. The second stage comprises of application of ensemble classifiers to the extracted features for providing accurate analysis of digital FI using machine learning. In this stage, an automatic detection of DR screening system using Bagging Ensemble Classifier (BEC) is investigated. With the help of voting the process in ML-BEC, bagging minimizes the error due to variance of the base classifier. With the publicly available retinal image databases, our classifier is trained with 25% of RI. Results show that the ensemble classifier can achieve better classification accuracy (CA) than single classification models. Empirical experiments suggest that the machine learning-based ensemble classifier is efficient for further reducing DR classification time (CT).
Farhan, Saima; Fahiem, Muhammad Abuzar; Tauseef, Huma
2014-01-01
Structural brain imaging is playing a vital role in identification of changes that occur in brain associated with Alzheimer's disease. This paper proposes an automated image processing based approach for the identification of AD from MRI of the brain. The proposed approach is novel in a sense that it has higher specificity/accuracy values despite the use of smaller feature set as compared to existing approaches. Moreover, the proposed approach is capable of identifying AD patients in early stages. The dataset selected consists of 85 age and gender matched individuals from OASIS database. The features selected are volume of GM, WM, and CSF and size of hippocampus. Three different classification models (SVM, MLP, and J48) are used for identification of patients and controls. In addition, an ensemble of classifiers, based on majority voting, is adopted to overcome the error caused by an independent base classifier. Ten-fold cross validation strategy is applied for the evaluation of our scheme. Moreover, to evaluate the performance of proposed approach, individual features and combination of features are fed to individual classifiers and ensemble based classifier. Using size of left hippocampus as feature, the accuracy achieved with ensemble of classifiers is 93.75%, with 100% specificity and 87.5% sensitivity.
Classification of large-sized hyperspectral imagery using fast machine learning algorithms
NASA Astrophysics Data System (ADS)
Xia, Junshi; Yokoya, Naoto; Iwasaki, Akira
2017-07-01
We present a framework of fast machine learning algorithms in the context of large-sized hyperspectral images classification from the theoretical to a practical viewpoint. In particular, we assess the performance of random forest (RF), rotation forest (RoF), and extreme learning machine (ELM) and the ensembles of RF and ELM. These classifiers are applied to two large-sized hyperspectral images and compared to the support vector machines. To give the quantitative analysis, we pay attention to comparing these methods when working with high input dimensions and a limited/sufficient training set. Moreover, other important issues such as the computational cost and robustness against the noise are also discussed.
Online breakage detection of multitooth tools using classifier ensembles for imbalanced data
NASA Astrophysics Data System (ADS)
Bustillo, Andrés; Rodríguez, Juan J.
2014-12-01
Cutting tool breakage detection is an important task, due to its economic impact on mass production lines in the automobile industry. This task presents a central limitation: real data-sets are extremely imbalanced because breakage occurs in very few cases compared with normal operation of the cutting process. In this paper, we present an analysis of different data-mining techniques applied to the detection of insert breakage in multitooth tools. The analysis applies only one experimental variable: the electrical power consumption of the tool drive. This restriction profiles real industrial conditions more accurately than other physical variables, such as acoustic or vibration signals, which are not so easily measured. Many efforts have been made to design a method that is able to identify breakages with a high degree of reliability within a short period of time. The solution is based on classifier ensembles for imbalanced data-sets. Classifier ensembles are combinations of classifiers, which in many situations are more accurate than individual classifiers. Six different base classifiers are tested: Decision Trees, Rules, Naïve Bayes, Nearest Neighbour, Multilayer Perceptrons and Logistic Regression. Three different balancing strategies are tested with each of the classifier ensembles and compared to their performance with the original data-set: Synthetic Minority Over-Sampling Technique (SMOTE), undersampling and a combination of SMOTE and undersampling. To identify the most suitable data-mining solution, Receiver Operating Characteristics (ROC) graph and Recall-precision graph are generated and discussed. The performance of logistic regression ensembles on the balanced data-set using the combination of SMOTE and undersampling turned out to be the most suitable technique. Finally a comparison using industrial performance measures is presented, which concludes that this technique is also more suited to this industrial problem than the other techniques presented in the bibliography.
Simultaneous-Fault Diagnosis of Gearboxes Using Probabilistic Committee Machine
Zhong, Jian-Hua; Wong, Pak Kin; Yang, Zhi-Xin
2016-01-01
This study combines signal de-noising, feature extraction, two pairwise-coupled relevance vector machines (PCRVMs) and particle swarm optimization (PSO) for parameter optimization to form an intelligent diagnostic framework for gearbox fault detection. Firstly, the noises of sensor signals are de-noised by using the wavelet threshold method to lower the noise level. Then, the Hilbert-Huang transform (HHT) and energy pattern calculation are applied to extract the fault features from de-noised signals. After that, an eleven-dimension vector, which consists of the energies of nine intrinsic mode functions (IMFs), maximum value of HHT marginal spectrum and its corresponding frequency component, is obtained to represent the features of each gearbox fault. The two PCRVMs serve as two different fault detection committee members, and they are trained by using vibration and sound signals, respectively. The individual diagnostic result from each committee member is then combined by applying a new probabilistic ensemble method, which can improve the overall diagnostic accuracy and increase the number of detectable faults as compared to individual classifiers acting alone. The effectiveness of the proposed framework is experimentally verified by using test cases. The experimental results show the proposed framework is superior to existing single classifiers in terms of diagnostic accuracies for both single- and simultaneous-faults in the gearbox. PMID:26848665
Simultaneous-Fault Diagnosis of Gearboxes Using Probabilistic Committee Machine.
Zhong, Jian-Hua; Wong, Pak Kin; Yang, Zhi-Xin
2016-02-02
This study combines signal de-noising, feature extraction, two pairwise-coupled relevance vector machines (PCRVMs) and particle swarm optimization (PSO) for parameter optimization to form an intelligent diagnostic framework for gearbox fault detection. Firstly, the noises of sensor signals are de-noised by using the wavelet threshold method to lower the noise level. Then, the Hilbert-Huang transform (HHT) and energy pattern calculation are applied to extract the fault features from de-noised signals. After that, an eleven-dimension vector, which consists of the energies of nine intrinsic mode functions (IMFs), maximum value of HHT marginal spectrum and its corresponding frequency component, is obtained to represent the features of each gearbox fault. The two PCRVMs serve as two different fault detection committee members, and they are trained by using vibration and sound signals, respectively. The individual diagnostic result from each committee member is then combined by applying a new probabilistic ensemble method, which can improve the overall diagnostic accuracy and increase the number of detectable faults as compared to individual classifiers acting alone. The effectiveness of the proposed framework is experimentally verified by using test cases. The experimental results show the proposed framework is superior to existing single classifiers in terms of diagnostic accuracies for both single- and simultaneous-faults in the gearbox.
Caetano dos Santos, Florentino Luciano; Skottman, Heli; Juuti-Uusitalo, Kati; Hyttinen, Jari
2016-01-01
Aims A fast, non-invasive and observer-independent method to analyze the homogeneity and maturity of human pluripotent stem cell (hPSC) derived retinal pigment epithelial (RPE) cells is warranted to assess the suitability of hPSC-RPE cells for implantation or in vitro use. The aim of this work was to develop and validate methods to create ensembles of state-of-the-art texture descriptors and to provide a robust classification tool to separate three different maturation stages of RPE cells by using phase contrast microscopy images. The same methods were also validated on a wide variety of biological image classification problems, such as histological or virus image classification. Methods For image classification we used different texture descriptors, descriptor ensembles and preprocessing techniques. Also, three new methods were tested. The first approach was an ensemble of preprocessing methods, to create an additional set of images. The second was the region-based approach, where saliency detection and wavelet decomposition divide each image in two different regions, from which features were extracted through different descriptors. The third method was an ensemble of Binarized Statistical Image Features, based on different sizes and thresholds. A Support Vector Machine (SVM) was trained for each descriptor histogram and the set of SVMs combined by sum rule. The accuracy of the computer vision tool was verified in classifying the hPSC-RPE cell maturation level. Dataset and Results The RPE dataset contains 1862 subwindows from 195 phase contrast images. The final descriptor ensemble outperformed the most recent stand-alone texture descriptors, obtaining, for the RPE dataset, an area under ROC curve (AUC) of 86.49% with the 10-fold cross validation and 91.98% with the leave-one-image-out protocol. The generality of the three proposed approaches was ascertained with 10 more biological image datasets, obtaining an average AUC greater than 97%. Conclusions Here we showed that the developed ensembles of texture descriptors are able to classify the RPE cell maturation stage. Moreover, we proved that preprocessing and region-based decomposition improves many descriptors’ accuracy in biological dataset classification. Finally, we built the first public dataset of stem cell-derived RPE cells, which is publicly available to the scientific community for classification studies. The proposed tool is available at https://www.dei.unipd.it/node/2357 and the RPE dataset at http://www.biomeditech.fi/data/RPE_dataset/. Both are available at https://figshare.com/s/d6fb591f1beb4f8efa6f. PMID:26895509
Recognition and classification of colon cells applying the ensemble of classifiers.
Kruk, M; Osowski, S; Koktysz, R
2009-02-01
The paper presents the application of an ensemble of classifiers for the recognition of colon cells on the basis of the microscope colon image. The solved task include: segmentation of the individual cells from the image using the morphological operations, the preprocessing stages, leading to the extraction of features, selection of the most important features, and the classification stage applying the classifiers arranged in the form of ensemble. The paper presents and discusses the results concerning the recognition of four most important colon cell types: eosinophylic granulocyte, neutrophilic granulocyte, lymphocyte and plasmocyte. The proposed system is able to recognize the cells with the accuracy comparable to the human expert (around 5% of discrepancy of both results).
The wisdom of the commons: ensemble tree classifiers for prostate cancer prognosis.
Koziol, James A; Feng, Anne C; Jia, Zhenyu; Wang, Yipeng; Goodison, Seven; McClelland, Michael; Mercola, Dan
2009-01-01
Classification and regression trees have long been used for cancer diagnosis and prognosis. Nevertheless, instability and variable selection bias, as well as overfitting, are well-known problems of tree-based methods. In this article, we investigate whether ensemble tree classifiers can ameliorate these difficulties, using data from two recent studies of radical prostatectomy in prostate cancer. Using time to progression following prostatectomy as the relevant clinical endpoint, we found that ensemble tree classifiers robustly and reproducibly identified three subgroups of patients in the two clinical datasets: non-progressors, early progressors and late progressors. Moreover, the consensus classifications were independent predictors of time to progression compared to known clinical prognostic factors.
The role of model dynamics in ensemble Kalman filter performance for chaotic systems
Ng, G.-H.C.; McLaughlin, D.; Entekhabi, D.; Ahanin, A.
2011-01-01
The ensemble Kalman filter (EnKF) is susceptible to losing track of observations, or 'diverging', when applied to large chaotic systems such as atmospheric and ocean models. Past studies have demonstrated the adverse impact of sampling error during the filter's update step. We examine how system dynamics affect EnKF performance, and whether the absence of certain dynamic features in the ensemble may lead to divergence. The EnKF is applied to a simple chaotic model, and ensembles are checked against singular vectors of the tangent linear model, corresponding to short-term growth and Lyapunov vectors, corresponding to long-term growth. Results show that the ensemble strongly aligns itself with the subspace spanned by unstable Lyapunov vectors. Furthermore, the filter avoids divergence only if the full linearized long-term unstable subspace is spanned. However, short-term dynamics also become important as non-linearity in the system increases. Non-linear movement prevents errors in the long-term stable subspace from decaying indefinitely. If these errors then undergo linear intermittent growth, a small ensemble may fail to properly represent all important modes, causing filter divergence. A combination of long and short-term growth dynamics are thus critical to EnKF performance. These findings can help in developing practical robust filters based on model dynamics. ?? 2011 The Authors Tellus A ?? 2011 John Wiley & Sons A/S.
Embedded feature ranking for ensemble MLP classifiers.
Windeatt, Terry; Duangsoithong, Rakkrit; Smith, Raymond
2011-06-01
A feature ranking scheme for multilayer perceptron (MLP) ensembles is proposed, along with a stopping criterion based upon the out-of-bootstrap estimate. To solve multi-class problems feature ranking is combined with modified error-correcting output coding. Experimental results on benchmark data demonstrate the versatility of the MLP base classifier in removing irrelevant features.
NASA Astrophysics Data System (ADS)
Durner, Maximilian; Márton, Zoltán.; Hillenbrand, Ulrich; Ali, Haider; Kleinsteuber, Martin
2017-03-01
In this work, a new ensemble method for the task of category recognition in different environments is presented. The focus is on service robotic perception in an open environment, where the robot's task is to recognize previously unseen objects of predefined categories, based on training on a public dataset. We propose an ensemble learning approach to be able to flexibly combine complementary sources of information (different state-of-the-art descriptors computed on color and depth images), based on a Markov Random Field (MRF). By exploiting its specific characteristics, the MRF ensemble method can also be executed as a Dynamic Classifier Selection (DCS) system. In the experiments, the committee- and topology-dependent performance boost of our ensemble is shown. Despite reduced computational costs and using less information, our strategy performs on the same level as common ensemble approaches. Finally, the impact of large differences between datasets is analyzed.
An ensemble predictive modeling framework for breast cancer classification.
Nagarajan, Radhakrishnan; Upreti, Meenakshi
2017-12-01
Molecular changes often precede clinical presentation of diseases and can be useful surrogates with potential to assist in informed clinical decision making. Recent studies have demonstrated the usefulness of modeling approaches such as classification that can predict the clinical outcomes from molecular expression profiles. While useful, a majority of these approaches implicitly use all molecular markers as features in the classification process often resulting in sparse high-dimensional projection of the samples often comparable to that of the sample size. In this study, a variant of the recently proposed ensemble classification approach is used for predicting good and poor-prognosis breast cancer samples from their molecular expression profiles. In contrast to traditional single and ensemble classifiers, the proposed approach uses multiple base classifiers with varying feature sets obtained from two-dimensional projection of the samples in conjunction with a majority voting strategy for predicting the class labels. In contrast to our earlier implementation, base classifiers in the ensembles are chosen based on maximal sensitivity and minimal redundancy by choosing only those with low average cosine distance. The resulting ensemble sets are subsequently modeled as undirected graphs. Performance of four different classification algorithms is shown to be better within the proposed ensemble framework in contrast to using them as traditional single classifier systems. Significance of a subset of genes with high-degree centrality in the network abstractions across the poor-prognosis samples is also discussed. Copyright © 2017 Elsevier Inc. All rights reserved.
Xia, Jiaqi; Peng, Zhenling; Qi, Dawei; Mu, Hongbo; Yang, Jianyi
2017-03-15
Protein fold classification is a critical step in protein structure prediction. There are two possible ways to classify protein folds. One is through template-based fold assignment and the other is ab-initio prediction using machine learning algorithms. Combination of both solutions to improve the prediction accuracy was never explored before. We developed two algorithms, HH-fold and SVM-fold for protein fold classification. HH-fold is a template-based fold assignment algorithm using the HHsearch program. SVM-fold is a support vector machine-based ab-initio classification algorithm, in which a comprehensive set of features are extracted from three complementary sequence profiles. These two algorithms are then combined, resulting to the ensemble approach TA-fold. We performed a comprehensive assessment for the proposed methods by comparing with ab-initio methods and template-based threading methods on six benchmark datasets. An accuracy of 0.799 was achieved by TA-fold on the DD dataset that consists of proteins from 27 folds. This represents improvement of 5.4-11.7% over ab-initio methods. After updating this dataset to include more proteins in the same folds, the accuracy increased to 0.971. In addition, TA-fold achieved >0.9 accuracy on a large dataset consisting of 6451 proteins from 184 folds. Experiments on the LE dataset show that TA-fold consistently outperforms other threading methods at the family, superfamily and fold levels. The success of TA-fold is attributed to the combination of template-based fold assignment and ab-initio classification using features from complementary sequence profiles that contain rich evolution information. http://yanglab.nankai.edu.cn/TA-fold/. yangjy@nankai.edu.cn or mhb-506@163.com. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Selecting a Classification Ensemble and Detecting Process Drift in an Evolving Data Stream
DOE Office of Scientific and Technical Information (OSTI.GOV)
Heredia-Langner, Alejandro; Rodriguez, Luke R.; Lin, Andy
2015-09-30
We characterize the commercial behavior of a group of companies in a common line of business using a small ensemble of classifiers on a stream of records containing commercial activity information. This approach is able to effectively find a subset of classifiers that can be used to predict company labels with reasonable accuracy. Performance of the ensemble, its error rate under stable conditions, can be characterized using an exponentially weighted moving average (EWMA) statistic. The behavior of the EWMA statistic can be used to monitor a record stream from the commercial network and determine when significant changes have occurred. Resultsmore » indicate that larger classification ensembles may not necessarily be optimal, pointing to the need to search the combinatorial classifier space in a systematic way. Results also show that current and past performance of an ensemble can be used to detect when statistically significant changes in the activity of the network have occurred. The dataset used in this work contains tens of thousands of high level commercial activity records with continuous and categorical variables and hundreds of labels, making classification challenging.« less
Designing boosting ensemble of relational fuzzy systems.
Scherer, Rafał
2010-10-01
A method frequently used in classification systems for improving classification accuracy is to combine outputs of several classifiers. Among various types of classifiers, fuzzy ones are tempting because of using intelligible fuzzy if-then rules. In the paper we build an AdaBoost ensemble of relational neuro-fuzzy classifiers. Relational fuzzy systems bond input and output fuzzy linguistic values by a binary relation; thus, fuzzy rules have additional, comparing to traditional fuzzy systems, weights - elements of a fuzzy relation matrix. Thanks to this the system is better adjustable to data during learning. In the paper an ensemble of relational fuzzy systems is proposed. The problem is that such an ensemble contains separate rule bases which cannot be directly merged. As systems are separate, we cannot treat fuzzy rules coming from different systems as rules from the same (single) system. In the paper, the problem is addressed by a novel design of fuzzy systems constituting the ensemble, resulting in normalization of individual rule bases during learning. The method described in the paper is tested on several known benchmarks and compared with other machine learning solutions from the literature.
Mitosis detection using generic features and an ensemble of cascade adaboosts.
Tek, F Boray
2013-01-01
Mitosis count is one of the factors that pathologists use to assess the risk of metastasis and survival of the patients, which are affected by the breast cancer. We investigate an application of a set of generic features and an ensemble of cascade adaboosts to the automated mitosis detection. Calculation of the features rely minimally on object-level descriptions and thus require minimal segmentation. The proposed work was developed and tested on International Conference on Pattern Recognition (ICPR) 2012 mitosis detection contest data. We plotted receiver operating characteristics curves of true positive versus false positive rates; calculated recall, precision, F-measure, and region overlap ratio measures. WE TESTED OUR FEATURES WITH TWO DIFFERENT CLASSIFIER CONFIGURATIONS: 1) An ensemble of single adaboosts, 2) an ensemble of cascade adaboosts. On the ICPR 2012 mitosis detection contest evaluation, the cascade ensemble scored 54, 62.7, and 58, whereas the non-cascade version scored 68, 28.1, and 39.7 for the recall, precision, and F-measure measures, respectively. Mostly used features in the adaboost classifier rules were a shape-based feature, which counted granularity and a color-based feature, which relied on Red, Green, and Blue channel statistics. The features, which express the granular structure and color variations, are found useful for mitosis detection. The ensemble of adaboosts performs better than the individual adaboost classifiers. Moreover, the ensemble of cascaded adaboosts was better than the ensemble of single adaboosts for mitosis detection.
Pairwise diversity ranking of polychotomous features for ensemble physiological signal classifiers.
Gupta, Lalit; Kota, Srinivas; Molfese, Dennis L; Vaidyanathan, Ravi
2013-06-01
It is well known that fusion classifiers for physiological signal classification with diverse components (classifiers or data sets) outperform those with less diverse components. Determining component diversity, therefore, is of the utmost importance in the design of fusion classifiers that are often employed in clinical diagnostic and numerous other pattern recognition problems. In this article, a new pairwise diversity-based ranking strategy is introduced to select a subset of ensemble components, which when combined will be more diverse than any other component subset of the same size. The strategy is unified in the sense that the components can be classifiers or data sets. Moreover, the classifiers and data sets can be polychotomous. Classifier-fusion and data-fusion systems are formulated based on the diversity-based selection strategy, and the application of the two fusion strategies are demonstrated through the classification of multichannel event-related potentials. It is observed that for both classifier and data fusion, the classification accuracy tends to increase/decrease when the diversity of the component ensemble increases/decreases. For the four sets of 14-channel event-related potentials considered, it is shown that data fusion outperforms classifier fusion. Furthermore, it is demonstrated that the combination of data components that yield the best performance, in a relative sense, can be determined through the diversity-based selection strategy.
Vision-based posture recognition using an ensemble classifier and a vote filter
NASA Astrophysics Data System (ADS)
Ji, Peng; Wu, Changcheng; Xu, Xiaonong; Song, Aiguo; Li, Huijun
2016-10-01
Posture recognition is a very important Human-Robot Interaction (HRI) way. To segment effective posture from an image, we propose an improved region grow algorithm which combining with the Single Gauss Color Model. The experiment shows that the improved region grow algorithm can get the complete and accurate posture than traditional Single Gauss Model and region grow algorithm, and it can eliminate the similar region from the background at the same time. In the posture recognition part, and in order to improve the recognition rate, we propose a CNN ensemble classifier, and in order to reduce the misjudgments during a continuous gesture control, a vote filter is proposed and applied to the sequence of recognition results. Comparing with CNN classifier, the CNN ensemble classifier we proposed can yield a 96.27% recognition rate, which is better than that of CNN classifier, and the proposed vote filter can improve the recognition result and reduce the misjudgments during the consecutive gesture switch.
New machine-learning algorithms for prediction of Parkinson's disease
NASA Astrophysics Data System (ADS)
Mandal, Indrajit; Sairam, N.
2014-03-01
This article presents an enhanced prediction accuracy of diagnosis of Parkinson's disease (PD) to prevent the delay and misdiagnosis of patients using the proposed robust inference system. New machine-learning methods are proposed and performance comparisons are based on specificity, sensitivity, accuracy and other measurable parameters. The robust methods of treating Parkinson's disease (PD) includes sparse multinomial logistic regression, rotation forest ensemble with support vector machines and principal components analysis, artificial neural networks, boosting methods. A new ensemble method comprising of the Bayesian network optimised by Tabu search algorithm as classifier and Haar wavelets as projection filter is used for relevant feature selection and ranking. The highest accuracy obtained by linear logistic regression and sparse multinomial logistic regression is 100% and sensitivity, specificity of 0.983 and 0.996, respectively. All the experiments are conducted over 95% and 99% confidence levels and establish the results with corrected t-tests. This work shows a high degree of advancement in software reliability and quality of the computer-aided diagnosis system and experimentally shows best results with supportive statistical inference.
2017-01-01
Decoding neural activities related to voluntary and involuntary movements is fundamental to understanding human brain motor circuits and neuromotor disorders and can lead to the development of neuromotor prosthetic devices for neurorehabilitation. This study explores using recorded deep brain local field potentials (LFPs) for robust movement decoding of Parkinson's disease (PD) and Dystonia patients. The LFP data from voluntary movement activities such as left and right hand index finger clicking were recorded from patients who underwent surgeries for implantation of deep brain stimulation electrodes. Movement-related LFP signal features were extracted by computing instantaneous power related to motor response in different neural frequency bands. An innovative neural network ensemble classifier has been proposed and developed for accurate prediction of finger movement and its forthcoming laterality. The ensemble classifier contains three base neural network classifiers, namely, feedforward, radial basis, and probabilistic neural networks. The majority voting rule is used to fuse the decisions of the three base classifiers to generate the final decision of the ensemble classifier. The overall decoding performance reaches a level of agreement (kappa value) at about 0.729 ± 0.16 for decoding movement from the resting state and about 0.671 ± 0.14 for decoding left and right visually cued movements. PMID:29201041
The wisdom of the commons: ensemble tree classifiers for prostate cancer prognosis
Koziol, James A.; Feng, Anne C.; Jia, Zhenyu; Wang, Yipeng; Goodison, Seven; McClelland, Michael; Mercola, Dan
2009-01-01
Motivation: Classification and regression trees have long been used for cancer diagnosis and prognosis. Nevertheless, instability and variable selection bias, as well as overfitting, are well-known problems of tree-based methods. In this article, we investigate whether ensemble tree classifiers can ameliorate these difficulties, using data from two recent studies of radical prostatectomy in prostate cancer. Results: Using time to progression following prostatectomy as the relevant clinical endpoint, we found that ensemble tree classifiers robustly and reproducibly identified three subgroups of patients in the two clinical datasets: non-progressors, early progressors and late progressors. Moreover, the consensus classifications were independent predictors of time to progression compared to known clinical prognostic factors. Contact: dmercola@uci.edu PMID:18628288
Wavelet images and Chou's pseudo amino acid composition for protein classification.
Nanni, Loris; Brahnam, Sheryl; Lumini, Alessandra
2012-08-01
The last decade has seen an explosion in the collection of protein data. To actualize the potential offered by this wealth of data, it is important to develop machine systems capable of classifying and extracting features from proteins. Reliable machine systems for protein classification offer many benefits, including the promise of finding novel drugs and vaccines. In developing our system, we analyze and compare several feature extraction methods used in protein classification that are based on the calculation of texture descriptors starting from a wavelet representation of the protein. We then feed these texture-based representations of the protein into an Adaboost ensemble of neural network or a support vector machine classifier. In addition, we perform experiments that combine our feature extraction methods with a standard method that is based on the Chou's pseudo amino acid composition. Using several datasets, we show that our best approach outperforms standard methods. The Matlab code of the proposed protein descriptors is available at http://bias.csr.unibo.it/nanni/wave.rar .
Mazurowski, Maciej A; Zurada, Jacek M; Tourassi, Georgia D
2009-07-01
Ensemble classifiers have been shown efficient in multiple applications. In this article, the authors explore the effectiveness of ensemble classifiers in a case-based computer-aided diagnosis system for detection of masses in mammograms. They evaluate two general ways of constructing subclassifiers by resampling of the available development dataset: Random division and random selection. Furthermore, they discuss the problem of selecting the ensemble size and propose two adaptive incremental techniques that automatically select the size for the problem at hand. All the techniques are evaluated with respect to a previously proposed information-theoretic CAD system (IT-CAD). The experimental results show that the examined ensemble techniques provide a statistically significant improvement (AUC = 0.905 +/- 0.024) in performance as compared to the original IT-CAD system (AUC = 0.865 +/- 0.029). Some of the techniques allow for a notable reduction in the total number of examples stored in the case base (to 1.3% of the original size), which, in turn, results in lower storage requirements and a shorter response time of the system. Among the methods examined in this article, the two proposed adaptive techniques are by far the most effective for this purpose. Furthermore, the authors provide some discussion and guidance for choosing the ensemble parameters.
Predicting hepatotoxicity using ToxCast in vitro bioactivity and ...
Background: The U.S. EPA ToxCastTM program is screening thousands of environmental chemicals for bioactivity using hundreds of high-throughput in vitro assays to build predictive models of toxicity. We represented chemicals based on bioactivity and chemical structure descriptors then used supervised machine learning to predict their hepatotoxic effects.Results: A set of 677 chemicals were represented by 711 in vitro bioactivity descriptors (from ToxCast assays), 4,376 chemical structure descriptors (from QikProp, OpenBabel, PADEL, and PubChem), and three hepatotoxicity categories (from animal studies). Hepatotoxicants were defined by rat liver histopathology observed after chronic chemical testing and grouped into hypertrophy (161), injury (101) and proliferative lesions (99). Classifiers were built using six machine learning algorithms: linear discriminant analysis (LDA), Naïve Bayes (NB), support vector classification (SVM), classification and regression trees (CART), k-nearest neighbors (KNN) and an ensemble of classifiers (ENSMB). Classifiers of hepatotoxicity were built using chemical structure, ToxCast bioactivity, and a hybrid representation. Predictive performance was evaluated using 10-fold cross-validation testing and in-loop, filter-based, feature subset selection. Hybrid classifiers had the best balanced accuracy for predicting hypertrophy (0.78±0.08), injury (0.73±0.10) and proliferative lesions (0.72±0.09). Though chemical and bioactivity class
Accurate Identification of Cancerlectins through Hybrid Machine Learning Technology.
Zhang, Jieru; Ju, Ying; Lu, Huijuan; Xuan, Ping; Zou, Quan
2016-01-01
Cancerlectins are cancer-related proteins that function as lectins. They have been identified through computational identification techniques, but these techniques have sometimes failed to identify proteins because of sequence diversity among the cancerlectins. Advanced machine learning identification methods, such as support vector machine and basic sequence features (n-gram), have also been used to identify cancerlectins. In this study, various protein fingerprint features and advanced classifiers, including ensemble learning techniques, were utilized to identify this group of proteins. We improved the prediction accuracy of the original feature extraction methods and classification algorithms by more than 10% on average. Our work provides a basis for the computational identification of cancerlectins and reveals the power of hybrid machine learning techniques in computational proteomics.
Class-specific Error Bounds for Ensemble Classifiers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prenger, R; Lemmond, T; Varshney, K
2009-10-06
The generalization error, or probability of misclassification, of ensemble classifiers has been shown to be bounded above by a function of the mean correlation between the constituent (i.e., base) classifiers and their average strength. This bound suggests that increasing the strength and/or decreasing the correlation of an ensemble's base classifiers may yield improved performance under the assumption of equal error costs. However, this and other existing bounds do not directly address application spaces in which error costs are inherently unequal. For applications involving binary classification, Receiver Operating Characteristic (ROC) curves, performance curves that explicitly trade off false alarms and missedmore » detections, are often utilized to support decision making. To address performance optimization in this context, we have developed a lower bound for the entire ROC curve that can be expressed in terms of the class-specific strength and correlation of the base classifiers. We present empirical analyses demonstrating the efficacy of these bounds in predicting relative classifier performance. In addition, we specify performance regions of the ROC curve that are naturally delineated by the class-specific strengths of the base classifiers and show that each of these regions can be associated with a unique set of guidelines for performance optimization of binary classifiers within unequal error cost regimes.« less
Silva, Fabrício R; Vidotti, Vanessa G; Cremasco, Fernanda; Dias, Marcelo; Gomi, Edson S; Costa, Vital P
2013-01-01
To evaluate the sensitivity and specificity of machine learning classifiers (MLCs) for glaucoma diagnosis using Spectral Domain OCT (SD-OCT) and standard automated perimetry (SAP). Observational cross-sectional study. Sixty two glaucoma patients and 48 healthy individuals were included. All patients underwent a complete ophthalmologic examination, achromatic standard automated perimetry (SAP) and retinal nerve fiber layer (RNFL) imaging with SD-OCT (Cirrus HD-OCT; Carl Zeiss Meditec Inc., Dublin, California). Receiver operating characteristic (ROC) curves were obtained for all SD-OCT parameters and global indices of SAP. Subsequently, the following MLCs were tested using parameters from the SD-OCT and SAP: Bagging (BAG), Naive-Bayes (NB), Multilayer Perceptron (MLP), Radial Basis Function (RBF), Random Forest (RAN), Ensemble Selection (ENS), Classification Tree (CTREE), Ada Boost M1(ADA),Support Vector Machine Linear (SVML) and Support Vector Machine Gaussian (SVMG). Areas under the receiver operating characteristic curves (aROC) obtained for isolated SAP and OCT parameters were compared with MLCs using OCT+SAP data. Combining OCT and SAP data, MLCs' aROCs varied from 0.777(CTREE) to 0.946 (RAN).The best OCT+SAP aROC obtained with RAN (0.946) was significantly larger the best single OCT parameter (p<0.05), but was not significantly different from the aROC obtained with the best single SAP parameter (p=0.19). Machine learning classifiers trained on OCT and SAP data can successfully discriminate between healthy and glaucomatous eyes. The combination of OCT and SAP measurements improved the diagnostic accuracy compared with OCT data alone.
Zhang, Cuicui; Liang, Xuefeng; Matsuyama, Takashi
2014-12-08
Multi-camera networks have gained great interest in video-based surveillance systems for security monitoring, access control, etc. Person re-identification is an essential and challenging task in multi-camera networks, which aims to determine if a given individual has already appeared over the camera network. Individual recognition often uses faces as a trial and requires a large number of samples during the training phrase. This is difficult to fulfill due to the limitation of the camera hardware system and the unconstrained image capturing conditions. Conventional face recognition algorithms often encounter the "small sample size" (SSS) problem arising from the small number of training samples compared to the high dimensionality of the sample space. To overcome this problem, interest in the combination of multiple base classifiers has sparked research efforts in ensemble methods. However, existing ensemble methods still open two questions: (1) how to define diverse base classifiers from the small data; (2) how to avoid the diversity/accuracy dilemma occurring during ensemble. To address these problems, this paper proposes a novel generic learning-based ensemble framework, which augments the small data by generating new samples based on a generic distribution and introduces a tailored 0-1 knapsack algorithm to alleviate the diversity/accuracy dilemma. More diverse base classifiers can be generated from the expanded face space, and more appropriate base classifiers are selected for ensemble. Extensive experimental results on four benchmarks demonstrate the higher ability of our system to cope with the SSS problem compared to the state-of-the-art system.
Zhang, Cuicui; Liang, Xuefeng; Matsuyama, Takashi
2014-01-01
Multi-camera networks have gained great interest in video-based surveillance systems for security monitoring, access control, etc. Person re-identification is an essential and challenging task in multi-camera networks, which aims to determine if a given individual has already appeared over the camera network. Individual recognition often uses faces as a trial and requires a large number of samples during the training phrase. This is difficult to fulfill due to the limitation of the camera hardware system and the unconstrained image capturing conditions. Conventional face recognition algorithms often encounter the “small sample size” (SSS) problem arising from the small number of training samples compared to the high dimensionality of the sample space. To overcome this problem, interest in the combination of multiple base classifiers has sparked research efforts in ensemble methods. However, existing ensemble methods still open two questions: (1) how to define diverse base classifiers from the small data; (2) how to avoid the diversity/accuracy dilemma occurring during ensemble. To address these problems, this paper proposes a novel generic learning-based ensemble framework, which augments the small data by generating new samples based on a generic distribution and introduces a tailored 0–1 knapsack algorithm to alleviate the diversity/accuracy dilemma. More diverse base classifiers can be generated from the expanded face space, and more appropriate base classifiers are selected for ensemble. Extensive experimental results on four benchmarks demonstrate the higher ability of our system to cope with the SSS problem compared to the state-of-the-art system. PMID:25494350
Detection of chewing from piezoelectric film sensor signals using ensemble classifiers.
Farooq, Muhammad; Sazonov, Edward
2016-08-01
Selection and use of pattern recognition algorithms is application dependent. In this work, we explored the use of several ensembles of weak classifiers to classify signals captured from a wearable sensor system to detect food intake based on chewing. Three sensor signals (Piezoelectric sensor, accelerometer, and hand to mouth gesture) were collected from 12 subjects in free-living conditions for 24 hrs. Sensor signals were divided into 10 seconds epochs and for each epoch combination of time and frequency domain features were computed. In this work, we present a comparison of three different ensemble techniques: boosting (AdaBoost), bootstrap aggregation (bagging) and stacking, each trained with 3 different weak classifiers (Decision Trees, Linear Discriminant Analysis (LDA) and Logistic Regression). Type of feature normalization used can also impact the classification results. For each ensemble method, three feature normalization techniques: (no-normalization, z-score normalization, and minmax normalization) were tested. A 12 fold cross-validation scheme was used to evaluate the performance of each model where the performance was evaluated in terms of precision, recall, and accuracy. Best results achieved here show an improvement of about 4% over our previous algorithms.
Piao, Yongjun; Piao, Minghao; Ryu, Keun Ho
2017-01-01
Cancer classification has been a crucial topic of research in cancer treatment. In the last decade, messenger RNA (mRNA) expression profiles have been widely used to classify different types of cancers. With the discovery of a new class of small non-coding RNAs; known as microRNAs (miRNAs), various studies have shown that the expression patterns of miRNA can also accurately classify human cancers. Therefore, there is a great demand for the development of machine learning approaches to accurately classify various types of cancers using miRNA expression data. In this article, we propose a feature subset-based ensemble method in which each model is learned from a different projection of the original feature space to classify multiple cancers. In our method, the feature relevance and redundancy are considered to generate multiple feature subsets, the base classifiers are learned from each independent miRNA subset, and the average posterior probability is used to combine the base classifiers. To test the performance of our method, we used bead-based and sequence-based miRNA expression datasets and conducted 10-fold and leave-one-out cross validations. The experimental results show that the proposed method yields good results and has higher prediction accuracy than popular ensemble methods. The Java program and source code of the proposed method and the datasets in the experiments are freely available at https://sourceforge.net/projects/mirna-ensemble/. Copyright © 2016 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Onojeghuo, Alex Okiemute; Onojeghuo, Ajoke Ruth
2017-07-01
This study investigated the combined use of multispectral/hyperspectral imagery and LiDAR data for habitat mapping across parts of south Cumbria, North West England. The methodology adopted in this study integrated spectral information contained in pansharp QuickBird multispectral/AISA Eagle hyperspectral imagery and LiDAR-derived measures with object-based machine learning classifiers and ensemble analysis techniques. Using the LiDAR point cloud data, elevation models (such as the Digital Surface Model and Digital Terrain Model raster) and intensity features were extracted directly. The LiDAR-derived measures exploited in this study included Canopy Height Model, intensity and topographic information (i.e. mean, maximum and standard deviation). These three LiDAR measures were combined with spectral information contained in the pansharp QuickBird and Eagle MNF transformed imagery for image classification experiments. A fusion of pansharp QuickBird multispectral and Eagle MNF hyperspectral imagery with all LiDAR-derived measures generated the best classification accuracies, 89.8 and 92.6% respectively. These results were generated with the Support Vector Machine and Random Forest machine learning algorithms respectively. The ensemble analysis of all three learning machine classifiers for the pansharp QuickBird and Eagle MNF fused data outputs did not significantly increase the overall classification accuracy. Results of the study demonstrate the potential of combining either very high spatial resolution multispectral or hyperspectral imagery with LiDAR data for habitat mapping.
Random ensemble learning for EEG classification.
Hosseini, Mohammad-Parsa; Pompili, Dario; Elisevich, Kost; Soltanian-Zadeh, Hamid
2018-01-01
Real-time detection of seizure activity in epilepsy patients is critical in averting seizure activity and improving patients' quality of life. Accurate evaluation, presurgical assessment, seizure prevention, and emergency alerts all depend on the rapid detection of seizure onset. A new method of feature selection and classification for rapid and precise seizure detection is discussed wherein informative components of electroencephalogram (EEG)-derived data are extracted and an automatic method is presented using infinite independent component analysis (I-ICA) to select independent features. The feature space is divided into subspaces via random selection and multichannel support vector machines (SVMs) are used to classify these subspaces. The result of each classifier is then combined by majority voting to establish the final output. In addition, a random subspace ensemble using a combination of SVM, multilayer perceptron (MLP) neural network and an extended k-nearest neighbors (k-NN), called extended nearest neighbor (ENN), is developed for the EEG and electrocorticography (ECoG) big data problem. To evaluate the solution, a benchmark ECoG of eight patients with temporal and extratemporal epilepsy was implemented in a distributed computing framework as a multitier cloud-computing architecture. Using leave-one-out cross-validation, the accuracy, sensitivity, specificity, and both false positive and false negative ratios of the proposed method were found to be 0.97, 0.98, 0.96, 0.04, and 0.02, respectively. Application of the solution to cases under investigation with ECoG has also been effected to demonstrate its utility. Copyright © 2017 Elsevier B.V. All rights reserved.
A new Method for the Estimation of Initial Condition Uncertainty Structures in Mesoscale Models
NASA Astrophysics Data System (ADS)
Keller, J. D.; Bach, L.; Hense, A.
2012-12-01
The estimation of fast growing error modes of a system is a key interest of ensemble data assimilation when assessing uncertainty in initial conditions. Over the last two decades three methods (and variations of these methods) have evolved for global numerical weather prediction models: ensemble Kalman filter, singular vectors and breeding of growing modes (or now ensemble transform). While the former incorporates a priori model error information and observation error estimates to determine ensemble initial conditions, the latter two techniques directly address the error structures associated with Lyapunov vectors. However, in global models these structures are mainly associated with transient global wave patterns. When assessing initial condition uncertainty in mesoscale limited area models, several problems regarding the aforementioned techniques arise: (a) additional sources of uncertainty on the smaller scales contribute to the error and (b) error structures from the global scale may quickly move through the model domain (depending on the size of the domain). To address the latter problem, perturbation structures from global models are often included in the mesoscale predictions as perturbed boundary conditions. However, the initial perturbations (when used) are often generated with a variant of an ensemble Kalman filter which does not necessarily focus on the large scale error patterns. In the framework of the European regional reanalysis project of the Hans-Ertel-Center for Weather Research we use a mesoscale model with an implemented nudging data assimilation scheme which does not support ensemble data assimilation at all. In preparation of an ensemble-based regional reanalysis and for the estimation of three-dimensional atmospheric covariance structures, we implemented a new method for the assessment of fast growing error modes for mesoscale limited area models. The so-called self-breeding is development based on the breeding of growing modes technique. Initial perturbations are integrated forward for a short time period and then rescaled and added to the initial state again. Iterating this rapid breeding cycle provides estimates for the initial uncertainty structure (or local Lyapunov vectors) given a specific norm. To avoid that all ensemble perturbations converge towards the leading local Lyapunov vector we apply an ensemble transform variant to orthogonalize the perturbations in the sub-space spanned by the ensemble. By choosing different kind of norms to measure perturbation growth, this technique allows for estimating uncertainty patterns targeted at specific sources of errors (e.g. convection, turbulence). With case study experiments we show applications of the self-breeding method for different sources of uncertainty and different horizontal scales.
Unsupervised Learning in an Ensemble of Spiking Neural Networks Mediated by ITDP.
Shim, Yoonsik; Philippides, Andrew; Staras, Kevin; Husbands, Phil
2016-10-01
We propose a biologically plausible architecture for unsupervised ensemble learning in a population of spiking neural network classifiers. A mixture of experts type organisation is shown to be effective, with the individual classifier outputs combined via a gating network whose operation is driven by input timing dependent plasticity (ITDP). The ITDP gating mechanism is based on recent experimental findings. An abstract, analytically tractable model of the ITDP driven ensemble architecture is derived from a logical model based on the probabilities of neural firing events. A detailed analysis of this model provides insights that allow it to be extended into a full, biologically plausible, computational implementation of the architecture which is demonstrated on a visual classification task. The extended model makes use of a style of spiking network, first introduced as a model of cortical microcircuits, that is capable of Bayesian inference, effectively performing expectation maximization. The unsupervised ensemble learning mechanism, based around such spiking expectation maximization (SEM) networks whose combined outputs are mediated by ITDP, is shown to perform the visual classification task well and to generalize to unseen data. The combined ensemble performance is significantly better than that of the individual classifiers, validating the ensemble architecture and learning mechanisms. The properties of the full model are analysed in the light of extensive experiments with the classification task, including an investigation into the influence of different input feature selection schemes and a comparison with a hierarchical STDP based ensemble architecture.
Unsupervised Learning in an Ensemble of Spiking Neural Networks Mediated by ITDP
Staras, Kevin
2016-01-01
We propose a biologically plausible architecture for unsupervised ensemble learning in a population of spiking neural network classifiers. A mixture of experts type organisation is shown to be effective, with the individual classifier outputs combined via a gating network whose operation is driven by input timing dependent plasticity (ITDP). The ITDP gating mechanism is based on recent experimental findings. An abstract, analytically tractable model of the ITDP driven ensemble architecture is derived from a logical model based on the probabilities of neural firing events. A detailed analysis of this model provides insights that allow it to be extended into a full, biologically plausible, computational implementation of the architecture which is demonstrated on a visual classification task. The extended model makes use of a style of spiking network, first introduced as a model of cortical microcircuits, that is capable of Bayesian inference, effectively performing expectation maximization. The unsupervised ensemble learning mechanism, based around such spiking expectation maximization (SEM) networks whose combined outputs are mediated by ITDP, is shown to perform the visual classification task well and to generalize to unseen data. The combined ensemble performance is significantly better than that of the individual classifiers, validating the ensemble architecture and learning mechanisms. The properties of the full model are analysed in the light of extensive experiments with the classification task, including an investigation into the influence of different input feature selection schemes and a comparison with a hierarchical STDP based ensemble architecture. PMID:27760125
NASA Astrophysics Data System (ADS)
Ceylan Koydemir, Hatice; Feng, Steve; Liang, Kyle; Nadkarni, Rohan; Tseng, Derek; Benien, Parul; Ozcan, Aydogan
2017-03-01
Giardia lamblia causes a disease known as giardiasis, which results in diarrhea, abdominal cramps, and bloating. Although conventional pathogen detection methods used in water analysis laboratories offer high sensitivity and specificity, they are time consuming, and need experts to operate bulky equipment and analyze the samples. Here we present a field-portable and cost-effective smartphone-based waterborne pathogen detection platform that can automatically classify Giardia cysts using machine learning. Our platform enables the detection and quantification of Giardia cysts in one hour, including sample collection, labeling, filtration, and automated counting steps. We evaluated the performance of three prototypes using Giardia-spiked water samples from different sources (e.g., reagent-grade, tap, non-potable, and pond water samples). We populated a training database with >30,000 cysts and estimated our detection sensitivity and specificity using 20 different classifier models, including decision trees, nearest neighbor classifiers, support vector machines (SVMs), and ensemble classifiers, and compared their speed of training and classification, as well as predicted accuracies. Among them, cubic SVM, medium Gaussian SVM, and bagged-trees were the most promising classifier types with accuracies of 94.1%, 94.2%, and 95%, respectively; we selected the latter as our preferred classifier for the detection and enumeration of Giardia cysts that are imaged using our mobile-phone fluorescence microscope. Without the need for any experts or microbiologists, this field-portable pathogen detection platform can present a useful tool for water quality monitoring in resource-limited-settings.
Minimalist ensemble algorithms for genome-wide protein localization prediction.
Lin, Jhih-Rong; Mondal, Ananda Mohan; Liu, Rong; Hu, Jianjun
2012-07-03
Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. This paper proposed a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. It was found that the high performance ensemble algorithms are usually composed of the predictors that together cover most of available features. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popular weighted voting based ensemble algorithms, our classifier-based ensemble algorithms achieved much better performance without suffering from inclusion of too many individual predictors. We proposed a method for rational design of minimalist ensemble algorithms using feature selection and classifiers. The proposed minimalist ensemble algorithm based on logistic regression can achieve equal or better prediction performance while using only half or one-third of individual predictors compared to other ensemble algorithms. The results also suggested that meta-predictors that take advantage of a variety of features by combining individual predictors tend to achieve the best performance. The LR ensemble server and related benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgi-bin/predict.cgi.
Minimalist ensemble algorithms for genome-wide protein localization prediction
2012-01-01
Background Computational prediction of protein subcellular localization can greatly help to elucidate its functions. Despite the existence of dozens of protein localization prediction algorithms, the prediction accuracy and coverage are still low. Several ensemble algorithms have been proposed to improve the prediction performance, which usually include as many as 10 or more individual localization algorithms. However, their performance is still limited by the running complexity and redundancy among individual prediction algorithms. Results This paper proposed a novel method for rational design of minimalist ensemble algorithms for practical genome-wide protein subcellular localization prediction. The algorithm is based on combining a feature selection based filter and a logistic regression classifier. Using a novel concept of contribution scores, we analyzed issues of algorithm redundancy, consensus mistakes, and algorithm complementarity in designing ensemble algorithms. We applied the proposed minimalist logistic regression (LR) ensemble algorithm to two genome-wide datasets of Yeast and Human and compared its performance with current ensemble algorithms. Experimental results showed that the minimalist ensemble algorithm can achieve high prediction accuracy with only 1/3 to 1/2 of individual predictors of current ensemble algorithms, which greatly reduces computational complexity and running time. It was found that the high performance ensemble algorithms are usually composed of the predictors that together cover most of available features. Compared to the best individual predictor, our ensemble algorithm improved the prediction accuracy from AUC score of 0.558 to 0.707 for the Yeast dataset and from 0.628 to 0.646 for the Human dataset. Compared with popular weighted voting based ensemble algorithms, our classifier-based ensemble algorithms achieved much better performance without suffering from inclusion of too many individual predictors. Conclusions We proposed a method for rational design of minimalist ensemble algorithms using feature selection and classifiers. The proposed minimalist ensemble algorithm based on logistic regression can achieve equal or better prediction performance while using only half or one-third of individual predictors compared to other ensemble algorithms. The results also suggested that meta-predictors that take advantage of a variety of features by combining individual predictors tend to achieve the best performance. The LR ensemble server and related benchmark datasets are available at http://mleg.cse.sc.edu/LRensemble/cgi-bin/predict.cgi. PMID:22759391
Multiple-instance ensemble learning for hyperspectral images
NASA Astrophysics Data System (ADS)
Ergul, Ugur; Bilgin, Gokhan
2017-10-01
An ensemble framework for multiple-instance (MI) learning (MIL) is introduced for use in hyperspectral images (HSIs) by inspiring the bagging (bootstrap aggregation) method in ensemble learning. Ensemble-based bagging is performed by a small percentage of training samples, and MI bags are formed by a local windowing process with variable window sizes on selected instances. In addition to bootstrap aggregation, random subspace is another method used to diversify base classifiers. The proposed method is implemented using four MIL classification algorithms. The classifier model learning phase is carried out with MI bags, and the estimation phase is performed over single-test instances. In the experimental part of the study, two different HSIs that have ground-truth information are used, and comparative results are demonstrated with state-of-the-art classification methods. In general, the MI ensemble approach produces more compact results in terms of both diversity and error compared to equipollent non-MIL algorithms.
A target recognition method for maritime surveillance radars based on hybrid ensemble selection
NASA Astrophysics Data System (ADS)
Fan, Xueman; Hu, Shengliang; He, Jingbo
2017-11-01
In order to improve the generalisation ability of the maritime surveillance radar, a novel ensemble selection technique, termed Optimisation and Dynamic Selection (ODS), is proposed. During the optimisation phase, the non-dominated sorting genetic algorithm II for multi-objective optimisation is used to find the Pareto front, i.e. a set of ensembles of classifiers representing different tradeoffs between the classification error and diversity. During the dynamic selection phase, the meta-learning method is used to predict whether a candidate ensemble is competent enough to classify a query instance based on three different aspects, namely, feature space, decision space and the extent of consensus. The classification performance and time complexity of ODS are compared against nine other ensemble methods using a self-built full polarimetric high resolution range profile data-set. The experimental results clearly show the effectiveness of ODS. In addition, the influence of the selection of diversity measures is studied concurrently.
NASA Astrophysics Data System (ADS)
Taniguchi, Kenji
2018-04-01
To investigate future variations in high-impact weather events, numerous samples are required. For the detailed assessment in a specific region, a high spatial resolution is also required. A simple ensemble simulation technique is proposed in this paper. In the proposed technique, new ensemble members were generated from one basic state vector and two perturbation vectors, which were obtained by lagged average forecasting simulations. Sensitivity experiments with different numbers of ensemble members, different simulation lengths, and different perturbation magnitudes were performed. Experimental application to a global warming study was also implemented for a typhoon event. Ensemble-mean results and ensemble spreads of total precipitation, atmospheric conditions showed similar characteristics across the sensitivity experiments. The frequencies of the maximum total and hourly precipitation also showed similar distributions. These results indicate the robustness of the proposed technique. On the other hand, considerable ensemble spread was found in each ensemble experiment. In addition, the results of the application to a global warming study showed possible variations in the future. These results indicate that the proposed technique is useful for investigating various meteorological phenomena and the impacts of global warming. The results of the ensemble simulations also enable the stochastic evaluation of differences in high-impact weather events. In addition, the impacts of a spectral nudging technique were also examined. The tracks of a typhoon were quite different between cases with and without spectral nudging; however, the ranges of the tracks among ensemble members were comparable. It indicates that spectral nudging does not necessarily suppress ensemble spread.
A Proposed Methodology to Classify Frontier Capital Markets
2011-07-31
but because it is the surest route to our common good.” -Inaugural Speech by President Barack Obama, Jan 2009 This project involves basic...machine learning. The algorithm consists of a unique binary classifier mechanism that combines three methods: k-Nearest Neighbors ( kNN ), ensemble...Through kNN Ensemble Classification Techniques E. Capital Market Classification Based on Capital Flows and Trading Architecture F. Horizontal
Mazurowski, Maciej A.; Zurada, Jacek M.; Tourassi, Georgia D.
2009-01-01
Ensemble classifiers have been shown efficient in multiple applications. In this article, the authors explore the effectiveness of ensemble classifiers in a case-based computer-aided diagnosis system for detection of masses in mammograms. They evaluate two general ways of constructing subclassifiers by resampling of the available development dataset: Random division and random selection. Furthermore, they discuss the problem of selecting the ensemble size and propose two adaptive incremental techniques that automatically select the size for the problem at hand. All the techniques are evaluated with respect to a previously proposed information-theoretic CAD system (IT-CAD). The experimental results show that the examined ensemble techniques provide a statistically significant improvement (AUC=0.905±0.024) in performance as compared to the original IT-CAD system (AUC=0.865±0.029). Some of the techniques allow for a notable reduction in the total number of examples stored in the case base (to 1.3% of the original size), which, in turn, results in lower storage requirements and a shorter response time of the system. Among the methods examined in this article, the two proposed adaptive techniques are by far the most effective for this purpose. Furthermore, the authors provide some discussion and guidance for choosing the ensemble parameters. PMID:19673196
2013-01-01
Background Protein-protein interactions (PPIs) play crucial roles in the execution of various cellular processes and form the basis of biological mechanisms. Although large amount of PPIs data for different species has been generated by high-throughput experimental techniques, current PPI pairs obtained with experimental methods cover only a fraction of the complete PPI networks, and further, the experimental methods for identifying PPIs are both time-consuming and expensive. Hence, it is urgent and challenging to develop automated computational methods to efficiently and accurately predict PPIs. Results We present here a novel hierarchical PCA-EELM (principal component analysis-ensemble extreme learning machine) model to predict protein-protein interactions only using the information of protein sequences. In the proposed method, 11188 protein pairs retrieved from the DIP database were encoded into feature vectors by using four kinds of protein sequences information. Focusing on dimension reduction, an effective feature extraction method PCA was then employed to construct the most discriminative new feature set. Finally, multiple extreme learning machines were trained and then aggregated into a consensus classifier by majority voting. The ensembling of extreme learning machine removes the dependence of results on initial random weights and improves the prediction performance. Conclusions When performed on the PPI data of Saccharomyces cerevisiae, the proposed method achieved 87.00% prediction accuracy with 86.15% sensitivity at the precision of 87.59%. Extensive experiments are performed to compare our method with state-of-the-art techniques Support Vector Machine (SVM). Experimental results demonstrate that proposed PCA-EELM outperforms the SVM method by 5-fold cross-validation. Besides, PCA-EELM performs faster than PCA-SVM based method. Consequently, the proposed approach can be considered as a new promising and powerful tools for predicting PPI with excellent performance and less time. PMID:23815620
Bayesian Hierarchical Model Characterization of Model Error in Ocean Data Assimilation and Forecasts
2013-09-30
wind ensemble with the increments in the surface momentum flux control vector in a four-dimensional variational (4dvar) assimilation system. The...stability effects? surface stress Surface Momentum Flux Ensembles from Summaries of BHM Winds (Mediterranean...surface wind speed given ensemble winds from a Bayesian Hierarchical Model to provide surface momentum flux ensembles. 3 Figure 2: Domain of
Ensemble positive unlabeled learning for disease gene identification.
Yang, Peng; Li, Xiaoli; Chua, Hon-Nian; Kwoh, Chee-Keong; Ng, See-Kiong
2014-01-01
An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.
Ozçift, Akin
2011-05-01
Supervised classification algorithms are commonly used in the designing of computer-aided diagnosis systems. In this study, we present a resampling strategy based Random Forests (RF) ensemble classifier to improve diagnosis of cardiac arrhythmia. Random forests is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. In this way, an RF ensemble classifier performs better than a single tree from classification performance point of view. In general, multiclass datasets having unbalanced distribution of sample sizes are difficult to analyze in terms of class discrimination. Cardiac arrhythmia is such a dataset that has multiple classes with small sample sizes and it is therefore adequate to test our resampling based training strategy. The dataset contains 452 samples in fourteen types of arrhythmias and eleven of these classes have sample sizes less than 15. Our diagnosis strategy consists of two parts: (i) a correlation based feature selection algorithm is used to select relevant features from cardiac arrhythmia dataset. (ii) RF machine learning algorithm is used to evaluate the performance of selected features with and without simple random sampling to evaluate the efficiency of proposed training strategy. The resultant accuracy of the classifier is found to be 90.0% and this is a quite high diagnosis performance for cardiac arrhythmia. Furthermore, three case studies, i.e., thyroid, cardiotocography and audiology, are used to benchmark the effectiveness of the proposed method. The results of experiments demonstrated the efficiency of random sampling strategy in training RF ensemble classification algorithm. Copyright © 2011 Elsevier Ltd. All rights reserved.
Stanescu, Ana; Caragea, Doina
2015-01-01
Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.
2015-01-01
Background Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. Results Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. Conclusions In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework. PMID:26356316
NASA Astrophysics Data System (ADS)
Imamura, Seigo; Ono, Kenji; Yokokawa, Mitsuo
2016-07-01
Ensemble computing, which is an instance of capacity computing, is an effective computing scenario for exascale parallel supercomputers. In ensemble computing, there are multiple linear systems associated with a common coefficient matrix. We improve the performance of iterative solvers for multiple vectors by solving them at the same time, that is, by solving for the product of the matrices. We implemented several iterative methods and compared their performance. The maximum performance on Sparc VIIIfx was 7.6 times higher than that of a naïve implementation. Finally, to deal with the different convergence processes of linear systems, we introduced a control method to eliminate the calculation of already converged vectors.
ElGokhy, Sherin M; ElHefnawi, Mahmoud; Shoukry, Amin
2014-05-06
MicroRNAs (miRNAs) are endogenous ∼22 nt RNAs that are identified in many species as powerful regulators of gene expressions. Experimental identification of miRNAs is still slow since miRNAs are difficult to isolate by cloning due to their low expression, low stability, tissue specificity and the high cost of the cloning procedure. Thus, computational identification of miRNAs from genomic sequences provide a valuable complement to cloning. Different approaches for identification of miRNAs have been proposed based on homology, thermodynamic parameters, and cross-species comparisons. The present paper focuses on the integration of miRNA classifiers in a meta-classifier and the identification of miRNAs from metagenomic sequences collected from different environments. An ensemble of classifiers is proposed for miRNA hairpin prediction based on four well-known classifiers (Triplet SVM, Mipred, Virgo and EumiR), with non-identical features, and which have been trained on different data. Their decisions are combined using a single hidden layer neural network to increase the accuracy of the predictions. Our ensemble classifier achieved 89.3% accuracy, 82.2% f-measure, 74% sensitivity, 97% specificity, 92.5% precision and 88.2% negative predictive value when tested on real miRNA and pseudo sequence data. The area under the receiver operating characteristic curve of our classifier is 0.9 which represents a high performance index.The proposed classifier yields a significant performance improvement relative to Triplet-SVM, Virgo and EumiR and a minor refinement over MiPred.The developed ensemble classifier is used for miRNA prediction in mine drainage, groundwater and marine metagenomic sequences downloaded from the NCBI sequence reed archive. By consulting the miRBase repository, 179 miRNAs have been identified as highly probable miRNAs. Our new approach could thus be used for mining metagenomic sequences and finding new and homologous miRNAs. The paper investigates a computational tool for miRNA prediction in genomic or metagenomic data. It has been applied on three metagenomic samples from different environments (mine drainage, groundwater and marine metagenomic sequences). The prediction results provide a set of extremely potential miRNA hairpins for cloning prediction methods. Among the ensemble prediction obtained results there are pre-miRNA candidates that have been validated using miRbase while they have not been recognized by some of the base classifiers.
Dimensionality Reduction Through Classifier Ensembles
NASA Technical Reports Server (NTRS)
Oza, Nikunj C.; Tumer, Kagan; Norwig, Peter (Technical Monitor)
1999-01-01
In data mining, one often needs to analyze datasets with a very large number of attributes. Performing machine learning directly on such data sets is often impractical because of extensive run times, excessive complexity of the fitted model (often leading to overfitting), and the well-known "curse of dimensionality." In practice, to avoid such problems, feature selection and/or extraction are often used to reduce data dimensionality prior to the learning step. However, existing feature selection/extraction algorithms either evaluate features by their effectiveness across the entire data set or simply disregard class information altogether (e.g., principal component analysis). Furthermore, feature extraction algorithms such as principal components analysis create new features that are often meaningless to human users. In this article, we present input decimation, a method that provides "feature subsets" that are selected for their ability to discriminate among the classes. These features are subsequently used in ensembles of classifiers, yielding results superior to single classifiers, ensembles that use the full set of features, and ensembles based on principal component analysis on both real and synthetic datasets.
Xu, Zhiming; So, Rosa Q; Toe, Kyaw Kyar; Ang, Kai Keng; Guan, Cuntai
2014-01-01
This paper presents an asynchronously intracortical brain-computer interface (BCI) which allows the subject to continuously drive a mobile robot. This system has a great implication for disabled patients to move around. By carefully designing a multiclass support vector machine (SVM), the subject's self-paced instantaneous movement intents are continuously decoded to control the mobile robot. In particular, we studied the stability of the neural representation of the movement directions. Experimental results on the nonhuman primate showed that the overt movement directions were stably represented in ensemble of recorded units, and our SVM classifier could successfully decode such movements continuously along the desired movement path. However, the neural representation of the stop state for the self-paced control was not stably represented and could drift.
Tartar, A; Akan, A; Kilic, N
2014-01-01
Computer-aided detection systems can help radiologists to detect pulmonary nodules at an early stage. In this paper, a novel Computer-Aided Diagnosis system (CAD) is proposed for the classification of pulmonary nodules as malignant and benign. The proposed CAD system using ensemble learning classifiers, provides an important support to radiologists at the diagnosis process of the disease, achieves high classification performance. The proposed approach with bagging classifier results in 94.7 %, 90.0 % and 77.8 % classification sensitivities for benign, malignant and undetermined classes (89.5 % accuracy), respectively.
Yousef, Malik; Khalifa, Waleed; AbedAllah, Loai
2016-12-22
The performance of many learning and data mining algorithms depends critically on suitable metrics to assess efficiency over the input space. Learning a suitable metric from examples may, therefore, be the key to successful application of these algorithms. We have demonstrated that the k-nearest neighbor (kNN) classification can be significantly improved by learning a distance metric from labeled examples. The clustering ensemble is used to define the distance between points in respect to how they co-cluster. This distance is then used within the framework of the kNN algorithm to define a classifier named ensemble clustering kNN classifier (EC-kNN). In many instances in our experiments we achieved highest accuracy while SVM failed to perform as well. In this study, we compare the performance of a two-class classifier using EC-kNN with different one-class and two-class classifiers. The comparison was applied to seven different plant microRNA species considering eight feature selection methods. In this study, the averaged results show that ECkNN outperforms all other methods employed here and previously published results for the same data. In conclusion, this study shows that the chosen classifier shows high performance when the distance metric is carefully chosen.
Yousef, Malik; Khalifa, Waleed; AbdAllah, Loai
2016-12-01
The performance of many learning and data mining algorithms depends critically on suitable metrics to assess efficiency over the input space. Learning a suitable metric from examples may, therefore, be the key to successful application of these algorithms. We have demonstrated that the k-nearest neighbor (kNN) classification can be significantly improved by learning a distance metric from labeled examples. The clustering ensemble is used to define the distance between points in respect to how they co-cluster. This distance is then used within the framework of the kNN algorithm to define a classifier named ensemble clustering kNN classifier (EC-kNN). In many instances in our experiments we achieved highest accuracy while SVM failed to perform as well. In this study, we compare the performance of a two-class classifier using EC-kNN with different one-class and two-class classifiers. The comparison was applied to seven different plant microRNA species considering eight feature selection methods. In this study, the averaged results show that EC-kNN outperforms all other methods employed here and previously published results for the same data. In conclusion, this study shows that the chosen classifier shows high performance when the distance metric is carefully chosen.
Ensemble candidate classification for the LOTAAS pulsar survey
NASA Astrophysics Data System (ADS)
Tan, C. M.; Lyon, R. J.; Stappers, B. W.; Cooper, S.; Hessels, J. W. T.; Kondratiev, V. I.; Michilli, D.; Sanidas, S.
2018-03-01
One of the biggest challenges arising from modern large-scale pulsar surveys is the number of candidates generated. Here, we implemented several improvements to the machine learning (ML) classifier previously used by the LOFAR Tied-Array All-Sky Survey (LOTAAS) to look for new pulsars via filtering the candidates obtained during periodicity searches. To assist the ML algorithm, we have introduced new features which capture the frequency and time evolution of the signal and improved the signal-to-noise calculation accounting for broad profiles. We enhanced the ML classifier by including a third class characterizing RFI instances, allowing candidates arising from RFI to be isolated, reducing the false positive return rate. We also introduced a new training data set used by the ML algorithm that includes a large sample of pulsars misclassified by the previous classifier. Lastly, we developed an ensemble classifier comprised of five different Decision Trees. Taken together these updates improve the pulsar recall rate by 2.5 per cent, while also improving the ability to identify pulsars with wide pulse profiles, often misclassified by the previous classifier. The new ensemble classifier is also able to reduce the percentage of false positive candidates identified from each LOTAAS pointing from 2.5 per cent (˜500 candidates) to 1.1 per cent (˜220 candidates).
Raposo, Letícia M; Nobre, Flavio F
2017-08-30
Resistance to antiretrovirals (ARVs) is a major problem faced by HIV-infected individuals. Different rule-based algorithms were developed to infer HIV-1 susceptibility to antiretrovirals from genotypic data. However, there is discordance between them, resulting in difficulties for clinical decisions about which treatment to use. Here, we developed ensemble classifiers integrating three interpretation algorithms: Agence Nationale de Recherche sur le SIDA (ANRS), Rega, and the genotypic resistance interpretation system from Stanford HIV Drug Resistance Database (HIVdb). Three approaches were applied to develop a classifier with a single resistance profile: stacked generalization, a simple plurality vote scheme and the selection of the interpretation system with the best performance. The strategies were compared with the Friedman's test and the performance of the classifiers was evaluated using the F-measure, sensitivity and specificity values. We found that the three strategies had similar performances for the selected antiretrovirals. For some cases, the stacking technique with naïve Bayes as the learning algorithm showed a statistically superior F-measure. This study demonstrates that ensemble classifiers can be an alternative tool for clinical decision-making since they provide a single resistance profile from the most commonly used resistance interpretation systems.
DrugECs: An Ensemble System with Feature Subspaces for Accurate Drug-Target Interaction Prediction
Jiang, Jinjian; Wang, Nian; Zhang, Jun
2017-01-01
Background Drug-target interaction is key in drug discovery, especially in the design of new lead compound. However, the work to find a new lead compound for a specific target is complicated and hard, and it always leads to many mistakes. Therefore computational techniques are commonly adopted in drug design, which can save time and costs to a significant extent. Results To address the issue, a new prediction system is proposed in this work to identify drug-target interaction. First, drug-target pairs are encoded with a fragment technique and the software “PaDEL-Descriptor.” The fragment technique is for encoding target proteins, which divides each protein sequence into several fragments in order and encodes each fragment with several physiochemical properties of amino acids. The software “PaDEL-Descriptor” creates encoding vectors for drug molecules. Second, the dataset of drug-target pairs is resampled and several overlapped subsets are obtained, which are then input into kNN (k-Nearest Neighbor) classifier to build an ensemble system. Conclusion Experimental results on the drug-target dataset showed that our method performs better and runs faster than the state-of-the-art predictors. PMID:28744468
An ensemble-based approach for breast mass classification in mammography images
NASA Astrophysics Data System (ADS)
Ribeiro, Patricia B.; Papa, João. P.; Romero, Roseli A. F.
2017-03-01
Mammography analysis is an important tool that helps detecting breast cancer at the very early stages of the disease, thus increasing the quality of life of hundreds of thousands of patients worldwide. In Computer-Aided Detection systems, the identification of mammograms with and without masses (without clinical findings) is highly needed to reduce the false positive rates regarding the automatic selection of regions of interest that may contain some suspicious content. In this work, the introduce a variant of the Optimum-Path Forest (OPF) classifier for breast mass identification, as well as we employed an ensemble-based approach that can enhance the effectiveness of individual classifiers aiming at dealing with the aforementioned purpose. The experimental results also comprise the naïve OPF and a traditional neural network, being the most accurate results obtained through the ensemble of classifiers, with an accuracy nearly to 86%.
iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach.
Liu, Bin; Li, Kai; Huang, De-Shuang; Chou, Kuo-Chen
2018-06-07
Identification of enhancers and their strength is important because they play a critical role in controlling gene expression. Although some bioinformatics tools were developed, they are limited in discriminating enhancers from non-enhancers only. Recently, a two-layer predictor called "iEnhancer-2L" was developed that can be used to predict the enhancer's strength as well. However, its prediction quality needs further improvement to enhance the practical application value. A new predictor called "iEnhancer-EL" was proposed that contains two layer predictors: the first one (for identifying enhancers) is formed by fusing an array of six key individual classifiers, and the second one (for their strength) formed by fusing an array of ten key individual classifiers. All these key classifiers were selected from 171 elementary classifiers formed by SVM (Support Vector Machine) based on kmer, subsequence profile, and PseKNC (Pseudo K-tuple Nucleotide Composition), respectively. Rigorous cross-validations have indicated that the proposed predictor is remarkably superior to the existing state-of-the-art one in this area. A web server for the iEnhancer-EL has been established at http://bioinformatics.hitsz.edu.cn/iEnhancer-EL/, by which users can easily get their desired results without the need to go through the mathematical details. bliu@hit.edu.cn, dshuang@tongji.edu.cn or kcchou@gordonlifescience.org. Supplementary data are available at Bioinformatics online.
An Ensemble Framework Coping with Instability in the Gene Selection Process.
Castellanos-Garzón, José A; Ramos, Juan; López-Sánchez, Daniel; de Paz, Juan F; Corchado, Juan M
2018-03-01
This paper proposes an ensemble framework for gene selection, which is aimed at addressing instability problems presented in the gene filtering task. The complex process of gene selection from gene expression data faces different instability problems from the informative gene subsets found by different filter methods. This makes the identification of significant genes by the experts difficult. The instability of results can come from filter methods, gene classifier methods, different datasets of the same disease and multiple valid groups of biomarkers. Even though there is a wide number of proposals, the complexity imposed by this problem remains a challenge today. This work proposes a framework involving five stages of gene filtering to discover biomarkers for diagnosis and classification tasks. This framework performs a process of stable feature selection, facing the problems above and, thus, providing a more suitable and reliable solution for clinical and research purposes. Our proposal involves a process of multistage gene filtering, in which several ensemble strategies for gene selection were added in such a way that different classifiers simultaneously assess gene subsets to face instability. Firstly, we apply an ensemble of recent gene selection methods to obtain diversity in the genes found (stability according to filter methods). Next, we apply an ensemble of known classifiers to filter genes relevant to all classifiers at a time (stability according to classification methods). The achieved results were evaluated in two different datasets of the same disease (pancreatic ductal adenocarcinoma), in search of stability according to the disease, for which promising results were achieved.
Pai, Priyadarshini P; Mondal, Sukanta
2016-10-01
Proteins interact with carbohydrates to perform various cellular interactions. Of the many carbohydrate ligands that proteins bind with, mannose constitute an important class, playing important roles in host defense mechanisms. Accurate identification of mannose-interacting residues (MIR) may provide important clues to decipher the underlying mechanisms of protein-mannose interactions during infections. This study proposes an approach using an ensemble of base classifiers for prediction of MIR using their evolutionary information in the form of position-specific scoring matrix. The base classifiers are random forests trained by different subsets of training data set Dset128 using 10-fold cross-validation. The optimized ensemble of base classifiers, MOWGLI, is then used to predict MIR on protein chains of the test data set Dtestset29 which showed a promising performance with 92.0% accurate prediction. An overall improvement of 26.6% in precision was observed upon comparison with the state-of-art. It is hoped that this approach, yielding enhanced predictions, could be eventually used for applications in drug design and vaccine development.
Perceptron ensemble of graph-based positive-unlabeled learning for disease gene identification.
Jowkar, Gholam-Hossein; Mansoori, Eghbal G
2016-10-01
Identification of disease genes, using computational methods, is an important issue in biomedical and bioinformatics research. According to observations that diseases with the same or similar phenotype have the same biological characteristics, researchers have tried to identify genes by using machine learning tools. In recent attempts, some semi-supervised learning methods, called positive-unlabeled learning, is used for disease gene identification. In this paper, we present a Perceptron ensemble of graph-based positive-unlabeled learning (PEGPUL) on three types of biological attributes: gene ontologies, protein domains and protein-protein interaction networks. In our method, a reliable set of positive and negative genes are extracted using co-training schema. Then, the similarity graph of genes is built using metric learning by concentrating on multi-rank-walk method to perform inference from labeled genes. At last, a Perceptron ensemble is learned from three weighted classifiers: multilevel support vector machine, k-nearest neighbor and decision tree. The main contributions of this paper are: (i) incorporating the statistical properties of gene data through choosing proper metrics, (ii) statistical evaluation of biological features, and (iii) noise robustness characteristic of PEGPUL via using multilevel schema. In order to assess PEGPUL, we have applied it on 12950 disease genes with 949 positive genes from six class of diseases and 12001 unlabeled genes. Compared with some popular disease gene identification methods, the experimental results show that PEGPUL has reasonable performance. Copyright © 2016 Elsevier Ltd. All rights reserved.
Good Models Gone Bad: Quantifying and Predicting Parameter-Induced Climate Model Simulation Failures
NASA Astrophysics Data System (ADS)
Lucas, D. D.; Klein, R.; Tannahill, J.; Brandon, S.; Covey, C. C.; Domyancic, D.; Ivanova, D. P.
2012-12-01
Simulations using IPCC-class climate models are subject to fail or crash for a variety of reasons. Statistical analysis of the failures can yield useful insights to better understand and improve the models. During the course of uncertainty quantification (UQ) ensemble simulations to assess the effects of ocean model parameter uncertainties on climate simulations, we experienced a series of simulation failures of the Parallel Ocean Program (POP2). About 8.5% of our POP2 runs failed for numerical reasons at certain combinations of parameter values. We apply support vector machine (SVM) classification from the fields of pattern recognition and machine learning to quantify and predict the probability of failure as a function of the values of 18 POP2 parameters. The SVM classifiers readily predict POP2 failures in an independent validation ensemble, and are subsequently used to determine the causes of the failures via a global sensitivity analysis. Four parameters related to ocean mixing and viscosity are identified as the major sources of POP2 failures. Our method can be used to improve the robustness of complex scientific models to parameter perturbations and to better steer UQ ensembles. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 and was funded by the Uncertainty Quantification Strategic Initiative Laboratory Directed Research and Development Project at LLNL under project tracking code 10-SI-013 (UCRL LLNL-ABS-569112).
NASA Astrophysics Data System (ADS)
Caminade, Cyril; Morse, Andy
2010-05-01
Climate variability is an important component in determining the incidence of a number of diseases with significant human/animal health and socioeconomic impacts. The most important diseases affecting health are vector-borne, such as malaria, Rift Valley Fever and including those that are tick borne, with over 3 billion of the world population at risk. Malaria alone is responsible for at least one million deaths annually, with 80% of malaria deaths occurring in sub-Saharan Africa. The climate has a large impact upon the incidence of vector-borne diseases; directly via the development rates and survival of both the pathogen and the vector, and indirectly through changes in the environmental conditions. A large ensemble of regional climate model simulations has been produced within the ENSEMBLES project framework for both the European and African continent. This work will present recent progress in human and animal disease modelling, based on high resolution climate observations and regional climate simulations. Preliminary results will be given as an illustration, including the impact of climate change upon bluetongue (disease affecting the cattle) over Europe and upon malaria and Rift Valley Fever over Africa. Malaria scenarios based on RCM ensemble simulations have been produced for West Africa. These simulations have been carried out using the Liverpool Malaria Model. Future projections highlight that the malaria incidence decreases at the northern edge of the Sahel and that the epidemic belt is shifted southward in autumn. This could lead to significant public health problems in the future as the demography is expected to dramatically rise over Africa for the 21st century.
Cascianelli, Silvia; Scialpi, Michele; Amici, Serena; Forini, Nevio; Minestrini, Matteo; Fravolini, Mario Luca; Sinzinger, Helmut; Schillaci, Orazio; Palumbo, Barbara
2017-01-01
Artificial Intelligence (AI) is a very active Computer Science research field aiming to develop systems that mimic human intelligence and is helpful in many human activities, including Medicine. In this review we presented some examples of the exploiting of AI techniques, in particular automatic classifiers such as Artificial Neural Network (ANN), Support Vector Machine (SVM), Classification Tree (ClT) and ensemble methods like Random Forest (RF), able to analyze findings obtained by positron emission tomography (PET) or single-photon emission tomography (SPECT) scans of patients with Neurodegenerative Diseases, in particular Alzheimer's Disease. We also focused our attention on techniques applied in order to preprocess data and reduce their dimensionality via feature selection or projection in a more representative domain (Principal Component Analysis - PCA - or Partial Least Squares - PLS - are examples of such methods); this is a crucial step while dealing with medical data, since it is necessary to compress patient information and retain only the most useful in order to discriminate subjects into normal and pathological classes. Main literature papers on the application of these techniques to classify patients with neurodegenerative disease extracting data from molecular imaging modalities are reported, showing that the increasing development of computer aided diagnosis systems is very promising to contribute to the diagnostic process.
DCS-SVM: a novel semi-automated method for human brain MR image segmentation.
Ahmadvand, Ali; Daliri, Mohammad Reza; Hajiali, Mohammadtaghi
2017-11-27
In this paper, a novel method is proposed which appropriately segments magnetic resonance (MR) brain images into three main tissues. This paper proposes an extension of our previous work in which we suggested a combination of multiple classifiers (CMC)-based methods named dynamic classifier selection-dynamic local training local Tanimoto index (DCS-DLTLTI) for MR brain image segmentation into three main cerebral tissues. This idea is used here and a novel method is developed that tries to use more complex and accurate classifiers like support vector machine (SVM) in the ensemble. This work is challenging because the CMC-based methods are time consuming, especially on huge datasets like three-dimensional (3D) brain MR images. Moreover, SVM is a powerful method that is used for modeling datasets with complex feature space, but it also has huge computational cost for big datasets, especially those with strong interclass variability problems and with more than two classes such as 3D brain images; therefore, we cannot use SVM in DCS-DLTLTI. Therefore, we propose a novel approach named "DCS-SVM" to use SVM in DCS-DLTLTI to improve the accuracy of segmentation results. The proposed method is applied on well-known datasets of the Internet Brain Segmentation Repository (IBSR) and promising results are obtained.
Ali, Safdar; Majid, Abdul; Javed, Syed Gibran; Sattar, Mohsin
2016-06-01
Early prediction of breast cancer is important for effective treatment and survival. We developed an effective Cost-Sensitive Classifier with GentleBoost Ensemble (Can-CSC-GBE) for the classification of breast cancer using protein amino acid features. In this work, first, discriminant information of the protein sequences related to breast tissue is extracted. Then, the physicochemical properties hydrophobicity and hydrophilicity of amino acids are employed to generate molecule descriptors in different feature spaces. For comparison, we obtained results by combining Cost-Sensitive learning with conventional ensemble of AdaBoostM1 and Bagging. The proposed Can-CSC-GBE system has effectively reduced the misclassification costs and thereby improved the overall classification performance. Our novel approach has highlighted promising results as compared to the state-of-the-art ensemble approaches. Copyright © 2016 Elsevier Ltd. All rights reserved.
Yang, Runtao; Zhang, Chengjin; Gao, Rui; Zhang, Lina
2015-09-07
Antifreeze proteins (AFPs) play a pivotal role in the antifreeze effect of overwintering organisms. They have a wide range of applications in numerous fields, such as improving the production of crops and the quality of frozen foods. Accurate identification of AFPs may provide important clues to decipher the underlying mechanisms of AFPs in ice-binding and to facilitate the selection of the most appropriate AFPs for several applications. Based on an ensemble learning technique, this study proposes an AFP identification system called AFP-Ensemble. In this system, random forest classifiers are trained by different training subsets and then aggregated into a consensus classifier by majority voting. The resulting predictor yields a sensitivity of 0.892, a specificity of 0.940, an accuracy of 0.938 and a balanced accuracy of 0.916 on an independent dataset, which are far better than the results obtained by previous methods. These results reveal that AFP-Ensemble is an effective and promising predictor for large-scale determination of AFPs. The detailed feature analysis in this study may give useful insights into the molecular mechanisms of AFP-ice interactions and provide guidance for the related experimental validation. A web server has been designed to implement the proposed method.
Testing of the Support Vector Machine for Binary-Class Classification
NASA Technical Reports Server (NTRS)
Scholten, Matthew
2011-01-01
The Support Vector Machine is a powerful algorithm, useful in classifying data in to species. The Support Vector Machines implemented in this research were used as classifiers for the final stage in a Multistage Autonomous Target Recognition system. A single kernel SVM known as SVMlight, and a modified version known as a Support Vector Machine with K-Means Clustering were used. These SVM algorithms were tested as classifiers under varying conditions. Image noise levels varied, and the orientation of the targets changed. The classifiers were then optimized to demonstrate their maximum potential as classifiers. Results demonstrate the reliability of SMV as a method for classification. From trial to trial, SVM produces consistent results
Linear Reconstruction of Non-Stationary Image Ensembles Incorporating Blur and Noise Models
1998-03-01
for phase distortions due to noise which leads to less deblurring as noise increases [41]. In contrast, the vector Wiener filter incorporates some a...AFIT/DS/ENG/98- 06 Linear Reconstruction of Non-Stationary Image Ensembles Incorporating Blur and Noise Models DISSERTATION Stephen D. Ford Captain...Dissertation 4. TITLE AND SUBTITLE 5. FUNDING NUMBERS LINEAR RECONSTRUCTION OF NON-STATIONARY IMAGE ENSEMBLES INCORPORATING BLUR AND NOISE MODELS 6. AUTHOR(S
Ensemble Semi-supervised Frame-work for Brain Magnetic Resonance Imaging Tissue Segmentation.
Azmi, Reza; Pishgoo, Boshra; Norozi, Narges; Yeganeh, Samira
2013-04-01
Brain magnetic resonance images (MRIs) tissue segmentation is one of the most important parts of the clinical diagnostic tools. Pixel classification methods have been frequently used in the image segmentation with two supervised and unsupervised approaches up to now. Supervised segmentation methods lead to high accuracy, but they need a large amount of labeled data, which is hard, expensive, and slow to obtain. Moreover, they cannot use unlabeled data to train classifiers. On the other hand, unsupervised segmentation methods have no prior knowledge and lead to low level of performance. However, semi-supervised learning which uses a few labeled data together with a large amount of unlabeled data causes higher accuracy with less trouble. In this paper, we propose an ensemble semi-supervised frame-work for segmenting of brain magnetic resonance imaging (MRI) tissues that it has been used results of several semi-supervised classifiers simultaneously. Selecting appropriate classifiers has a significant role in the performance of this frame-work. Hence, in this paper, we present two semi-supervised algorithms expectation filtering maximization and MCo_Training that are improved versions of semi-supervised methods expectation maximization and Co_Training and increase segmentation accuracy. Afterward, we use these improved classifiers together with graph-based semi-supervised classifier as components of the ensemble frame-work. Experimental results show that performance of segmentation in this approach is higher than both supervised methods and the individual semi-supervised classifiers.
Hwang, Yoo Na; Lee, Ju Hwan; Kim, Ga Young; Shin, Eun Seok; Kim, Sung Min
2018-01-01
The purpose of this study was to propose a hybrid ensemble classifier to characterize coronary plaque regions in intravascular ultrasound (IVUS) images. Pixels were allocated to one of four tissues (fibrous tissue (FT), fibro-fatty tissue (FFT), necrotic core (NC), and dense calcium (DC)) through processes of border segmentation, feature extraction, feature selection, and classification. Grayscale IVUS images and their corresponding virtual histology images were acquired from 11 patients with known or suspected coronary artery disease using 20 MHz catheter. A total of 102 hybrid textural features including first order statistics (FOS), gray level co-occurrence matrix (GLCM), extended gray level run-length matrix (GLRLM), Laws, local binary pattern (LBP), intensity, and discrete wavelet features (DWF) were extracted from IVUS images. To select optimal feature sets, genetic algorithm was implemented. A hybrid ensemble classifier based on histogram and texture information was then used for plaque characterization in this study. The optimal feature set was used as input of this ensemble classifier. After tissue characterization, parameters including sensitivity, specificity, and accuracy were calculated to validate the proposed approach. A ten-fold cross validation approach was used to determine the statistical significance of the proposed method. Our experimental results showed that the proposed method had reliable performance for tissue characterization in IVUS images. The hybrid ensemble classification method outperformed other existing methods by achieving characterization accuracy of 81% for FFT and 75% for NC. In addition, this study showed that Laws features (SSV and SAV) were key indicators for coronary tissue characterization. The proposed method had high clinical applicability for image-based tissue characterization. Copyright © 2017 Elsevier B.V. All rights reserved.
Sequence Based Prediction of Antioxidant Proteins Using a Classifier Selection Strategy
Zhang, Lina; Zhang, Chengjin; Gao, Rui; Yang, Runtao; Song, Qing
2016-01-01
Antioxidant proteins perform significant functions in maintaining oxidation/antioxidation balance and have potential therapies for some diseases. Accurate identification of antioxidant proteins could contribute to revealing physiological processes of oxidation/antioxidation balance and developing novel antioxidation-based drugs. In this study, an ensemble method is presented to predict antioxidant proteins with hybrid features, incorporating SSI (Secondary Structure Information), PSSM (Position Specific Scoring Matrix), RSA (Relative Solvent Accessibility), and CTD (Composition, Transition, Distribution). The prediction results of the ensemble predictor are determined by an average of prediction results of multiple base classifiers. Based on a classifier selection strategy, we obtain an optimal ensemble classifier composed of RF (Random Forest), SMO (Sequential Minimal Optimization), NNA (Nearest Neighbor Algorithm), and J48 with an accuracy of 0.925. A Relief combined with IFS (Incremental Feature Selection) method is adopted to obtain optimal features from hybrid features. With the optimal features, the ensemble method achieves improved performance with a sensitivity of 0.95, a specificity of 0.93, an accuracy of 0.94, and an MCC (Matthew’s Correlation Coefficient) of 0.880, far better than the existing method. To evaluate the prediction performance objectively, the proposed method is compared with existing methods on the same independent testing dataset. Encouragingly, our method performs better than previous studies. In addition, our method achieves more balanced performance with a sensitivity of 0.878 and a specificity of 0.860. These results suggest that the proposed ensemble method can be a potential candidate for antioxidant protein prediction. For public access, we develop a user-friendly web server for antioxidant protein identification that is freely accessible at http://antioxidant.weka.cc. PMID:27662651
Deep multi-spectral ensemble learning for electronic cleansing in dual-energy CT colonography
NASA Astrophysics Data System (ADS)
Tachibana, Rie; Näppi, Janne J.; Hironaka, Toru; Kim, Se Hyung; Yoshida, Hiroyuki
2017-03-01
We developed a novel electronic cleansing (EC) method for dual-energy CT colonography (DE-CTC) based on an ensemble deep convolution neural network (DCNN) and multi-spectral multi-slice image patches. In the method, an ensemble DCNN is used to classify each voxel of a DE-CTC image volume into five classes: luminal air, soft tissue, tagged fecal materials, and partial-volume boundaries between air and tagging and those between soft tissue and tagging. Each DCNN acts as a voxel classifier, where an input image patch centered at the voxel is generated as input to the DCNNs. An image patch has three channels that are mapped from a region-of-interest containing the image plane of the voxel and the two adjacent image planes. Six different types of spectral input image datasets were derived using two dual-energy CT images, two virtual monochromatic images, and two material images. An ensemble DCNN was constructed by use of a meta-classifier that combines the output of multiple DCNNs, each of which was trained with a different type of multi-spectral image patches. The electronically cleansed CTC images were calculated by removal of regions classified as other than soft tissue, followed by a colon surface reconstruction. For pilot evaluation, 359 volumes of interest (VOIs) representing sources of subtraction artifacts observed in current EC schemes were sampled from 30 clinical CTC cases. Preliminary results showed that the ensemble DCNN can yield high accuracy in labeling of the VOIs, indicating that deep learning of multi-spectral EC with multi-slice imaging could accurately remove residual fecal materials from CTC images without generating major EC artifacts.
A Fuzzy Integral Ensemble Method in Visual P300 Brain-Computer Interface.
Cavrini, Francesco; Bianchi, Luigi; Quitadamo, Lucia Rita; Saggio, Giovanni
2016-01-01
We evaluate the possibility of application of combination of classifiers using fuzzy measures and integrals to Brain-Computer Interface (BCI) based on electroencephalography. In particular, we present an ensemble method that can be applied to a variety of systems and evaluate it in the context of a visual P300-based BCI. Offline analysis of data relative to 5 subjects lets us argue that the proposed classification strategy is suitable for BCI. Indeed, the achieved performance is significantly greater than the average of the base classifiers and, broadly speaking, similar to that of the best one. Thus the proposed methodology allows realizing systems that can be used by different subjects without the need for a preliminary configuration phase in which the best classifier for each user has to be identified. Moreover, the ensemble is often capable of detecting uncertain situations and turning them from misclassifications into abstentions, thereby improving the level of safety in BCI for environmental or device control.
Using Multimodal Input for Autonomous Decision Making for Unmanned Systems
NASA Technical Reports Server (NTRS)
Neilan, James H.; Cross, Charles; Rothhaar, Paul; Tran, Loc; Motter, Mark; Qualls, Garry; Trujillo, Anna; Allen, B. Danette
2016-01-01
Autonomous decision making in the presence of uncertainly is a deeply studied problem space particularly in the area of autonomous systems operations for land, air, sea, and space vehicles. Various techniques ranging from single algorithm solutions to complex ensemble classifier systems have been utilized in a research context in solving mission critical flight decisions. Realized systems on actual autonomous hardware, however, is a difficult systems integration problem, constituting a majority of applied robotics development timelines. The ability to reliably and repeatedly classify objects during a vehicles mission execution is vital for the vehicle to mitigate both static and dynamic environmental concerns such that the mission may be completed successfully and have the vehicle operate and return safely. In this paper, the Autonomy Incubator proposes and discusses an ensemble learning and recognition system planned for our autonomous framework, AEON, in selected domains, which fuse decision criteria, using prior experience on both the individual classifier layer and the ensemble layer to mitigate environmental uncertainty during operation.
Learning ensemble classifiers for diabetic retinopathy assessment.
Saleh, Emran; Błaszczyński, Jerzy; Moreno, Antonio; Valls, Aida; Romero-Aroca, Pedro; de la Riva-Fernández, Sofia; Słowiński, Roman
2018-04-01
Diabetic retinopathy is one of the most common comorbidities of diabetes. Unfortunately, the recommended annual screening of the eye fundus of diabetic patients is too resource-consuming. Therefore, it is necessary to develop tools that may help doctors to determine the risk of each patient to attain this condition, so that patients with a low risk may be screened less frequently and the use of resources can be improved. This paper explores the use of two kinds of ensemble classifiers learned from data: fuzzy random forest and dominance-based rough set balanced rule ensemble. These classifiers use a small set of attributes which represent main risk factors to determine whether a patient is in risk of developing diabetic retinopathy. The levels of specificity and sensitivity obtained in the presented study are over 80%. This study is thus a first successful step towards the construction of a personalized decision support system that could help physicians in daily clinical practice. Copyright © 2017 Elsevier B.V. All rights reserved.
Bayesian Hierarchical Model Characterization of Model Error in Ocean Data Assimilation and Forecasts
2013-09-30
proof-of-concept results comparing a BHM surface wind ensemble with the increments in the surface momentum flux control vector in a four-dimensional...Surface Momentum Flux Ensembles from Summaries of BHM Winds (Mediterranean) include ocean current effect Td...Bayesian Hierarchical Model to provide surface momentum flux ensembles. 3 Figure 2: Domain of interest : squares indicate spatial locations where
NASA Technical Reports Server (NTRS)
Walker, H. F.
1979-01-01
In many pattern recognition problems, data vectors are classified although one or more of the data vector elements are missing. This problem occurs in remote sensing when the ground is obscured by clouds. Optimal linear discrimination procedures for classifying imcomplete data vectors are discussed.
A transposase strategy for creating libraries of circularly permuted proteins.
Mehta, Manan M; Liu, Shirley; Silberg, Jonathan J
2012-05-01
A simple approach for creating libraries of circularly permuted proteins is described that is called PERMutation Using Transposase Engineering (PERMUTE). In PERMUTE, the transposase MuA is used to randomly insert a minitransposon that can function as a protein expression vector into a plasmid that contains the open reading frame (ORF) being permuted. A library of vectors that express different permuted variants of the ORF-encoded protein is created by: (i) using bacteria to select for target vectors that acquire an integrated minitransposon; (ii) excising the ensemble of ORFs that contain an integrated minitransposon from the selected vectors; and (iii) circularizing the ensemble of ORFs containing integrated minitransposons using intramolecular ligation. Construction of a Thermotoga neapolitana adenylate kinase (AK) library using PERMUTE revealed that this approach produces vectors that express circularly permuted proteins with distinct sequence diversity from existing methods. In addition, selection of this library for variants that complement the growth of Escherichia coli with a temperature-sensitive AK identified functional proteins with novel architectures, suggesting that PERMUTE will be useful for the directed evolution of proteins with new functions.
A transposase strategy for creating libraries of circularly permuted proteins
Mehta, Manan M.; Liu, Shirley; Silberg, Jonathan J.
2012-01-01
A simple approach for creating libraries of circularly permuted proteins is described that is called PERMutation Using Transposase Engineering (PERMUTE). In PERMUTE, the transposase MuA is used to randomly insert a minitransposon that can function as a protein expression vector into a plasmid that contains the open reading frame (ORF) being permuted. A library of vectors that express different permuted variants of the ORF-encoded protein is created by: (i) using bacteria to select for target vectors that acquire an integrated minitransposon; (ii) excising the ensemble of ORFs that contain an integrated minitransposon from the selected vectors; and (iii) circularizing the ensemble of ORFs containing integrated minitransposons using intramolecular ligation. Construction of a Thermotoga neapolitana adenylate kinase (AK) library using PERMUTE revealed that this approach produces vectors that express circularly permuted proteins with distinct sequence diversity from existing methods. In addition, selection of this library for variants that complement the growth of Escherichia coli with a temperature-sensitive AK identified functional proteins with novel architectures, suggesting that PERMUTE will be useful for the directed evolution of proteins with new functions. PMID:22319214
A deep learning-based multi-model ensemble method for cancer prediction.
Xiao, Yawen; Wu, Jun; Lin, Zongli; Zhao, Xiaodong
2018-01-01
Cancer is a complex worldwide health problem associated with high mortality. With the rapid development of the high-throughput sequencing technology and the application of various machine learning methods that have emerged in recent years, progress in cancer prediction has been increasingly made based on gene expression, providing insight into effective and accurate treatment decision making. Thus, developing machine learning methods, which can successfully distinguish cancer patients from healthy persons, is of great current interest. However, among the classification methods applied to cancer prediction so far, no one method outperforms all the others. In this paper, we demonstrate a new strategy, which applies deep learning to an ensemble approach that incorporates multiple different machine learning models. We supply informative gene data selected by differential gene expression analysis to five different classification models. Then, a deep learning method is employed to ensemble the outputs of the five classifiers. The proposed deep learning-based multi-model ensemble method was tested on three public RNA-seq data sets of three kinds of cancers, Lung Adenocarcinoma, Stomach Adenocarcinoma and Breast Invasive Carcinoma. The test results indicate that it increases the prediction accuracy of cancer for all the tested RNA-seq data sets as compared to using a single classifier or the majority voting algorithm. By taking full advantage of different classifiers, the proposed deep learning-based multi-model ensemble method is shown to be accurate and effective for cancer prediction. Copyright © 2017 Elsevier B.V. All rights reserved.
Ensemble Semi-supervised Frame-work for Brain Magnetic Resonance Imaging Tissue Segmentation
Azmi, Reza; Pishgoo, Boshra; Norozi, Narges; Yeganeh, Samira
2013-01-01
Brain magnetic resonance images (MRIs) tissue segmentation is one of the most important parts of the clinical diagnostic tools. Pixel classification methods have been frequently used in the image segmentation with two supervised and unsupervised approaches up to now. Supervised segmentation methods lead to high accuracy, but they need a large amount of labeled data, which is hard, expensive, and slow to obtain. Moreover, they cannot use unlabeled data to train classifiers. On the other hand, unsupervised segmentation methods have no prior knowledge and lead to low level of performance. However, semi-supervised learning which uses a few labeled data together with a large amount of unlabeled data causes higher accuracy with less trouble. In this paper, we propose an ensemble semi-supervised frame-work for segmenting of brain magnetic resonance imaging (MRI) tissues that it has been used results of several semi-supervised classifiers simultaneously. Selecting appropriate classifiers has a significant role in the performance of this frame-work. Hence, in this paper, we present two semi-supervised algorithms expectation filtering maximization and MCo_Training that are improved versions of semi-supervised methods expectation maximization and Co_Training and increase segmentation accuracy. Afterward, we use these improved classifiers together with graph-based semi-supervised classifier as components of the ensemble frame-work. Experimental results show that performance of segmentation in this approach is higher than both supervised methods and the individual semi-supervised classifiers. PMID:24098863
Gantner, Melisa E; Peroni, Roxana N; Morales, Juan F; Villalba, María L; Ruiz, María E; Talevi, Alan
2017-08-28
Breast Cancer Resistance Protein (BCRP) is an ATP-dependent efflux transporter linked to the multidrug resistance phenomenon in many diseases such as epilepsy and cancer and a potential source of drug interactions. For these reasons, the early identification of substrates and nonsubstrates of this transporter during the drug discovery stage is of great interest. We have developed a computational nonlinear model ensemble based on conformational independent molecular descriptors using a combined strategy of genetic algorithms, J48 decision tree classifiers, and data fusion. The best model ensemble consists in averaging the ranking of the 12 decision trees that showed the best performance on the training set, which also demonstrated a good performance for the test set. It was experimentally validated using the ex vivo everted rat intestinal sac model. Five anticonvulsant drugs classified as nonsubstrates for BRCP by the model ensemble were experimentally evaluated, and none of them proved to be a BCRP substrate under the experimental conditions used, thus confirming the predictive ability of the model ensemble. The model ensemble reported here is a potentially valuable tool to be used as an in silico ADME filter in computer-aided drug discovery campaigns intended to overcome BCRP-mediated multidrug resistance issues and to prevent drug-drug interactions.
Classification of Odours for Mobile Robots Using an Ensemble of Linear Classifiers
NASA Astrophysics Data System (ADS)
Trincavelli, Marco; Coradeschi, Silvia; Loutfi, Amy
2009-05-01
This paper investigates the classification of odours using an electronic nose mounted on a mobile robot. The samples are collected as the robot explores the environment. Under such conditions, the sensor response differs from typical three phase sampling processes. In this paper, we focus particularly on the classification problem and how it is influenced by the movement of the robot. To cope with these influences, an algorithm consisting of an ensemble of classifiers is presented. Experimental results show that this algorithm increases classification performance compared to other traditional classification methods.
Chemical entity recognition in patents by combining dictionary-based and statistical approaches
Akhondi, Saber A.; Pons, Ewoud; Afzal, Zubair; van Haagen, Herman; Becker, Benedikt F.H.; Hettne, Kristina M.; van Mulligen, Erik M.; Kors, Jan A.
2016-01-01
We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small. Database URL: http://biosemantics.org/chemdner-patents PMID:27141091
Georgiadis, Pantelis; Cavouras, Dionisis; Kalatzis, Ioannis; Glotsos, Dimitris; Athanasiadis, Emmanouil; Kostopoulos, Spiros; Sifaki, Koralia; Malamas, Menelaos; Nikiforidis, George; Solomou, Ekaterini
2009-01-01
Three-dimensional (3D) texture analysis of volumetric brain magnetic resonance (MR) images has been identified as an important indicator for discriminating among different brain pathologies. The purpose of this study was to evaluate the efficiency of 3D textural features using a pattern recognition system in the task of discriminating benign, malignant and metastatic brain tissues on T1 postcontrast MR imaging (MRI) series. The dataset consisted of 67 brain MRI series obtained from patients with verified and untreated intracranial tumors. The pattern recognition system was designed as an ensemble classification scheme employing a support vector machine classifier, specially modified in order to integrate the least squares features transformation logic in its kernel function. The latter, in conjunction with using 3D textural features, enabled boosting up the performance of the system in discriminating metastatic, malignant and benign brain tumors with 77.14%, 89.19% and 93.33% accuracy, respectively. The method was evaluated using an external cross-validation process; thus, results might be considered indicative of the generalization performance of the system to "unseen" cases. The proposed system might be used as an assisting tool for brain tumor characterization on volumetric MRI series.
Muhlbaier, Michael D; Topalis, Apostolos; Polikar, Robi
2009-01-01
We have previously introduced an incremental learning algorithm Learn(++), which learns novel information from consecutive data sets by generating an ensemble of classifiers with each data set, and combining them by weighted majority voting. However, Learn(++) suffers from an inherent "outvoting" problem when asked to learn a new class omega(new) introduced by a subsequent data set, as earlier classifiers not trained on this class are guaranteed to misclassify omega(new) instances. The collective votes of earlier classifiers, for an inevitably incorrect decision, then outweigh the votes of the new classifiers' correct decision on omega(new) instances--until there are enough new classifiers to counteract the unfair outvoting. This forces Learn(++) to generate an unnecessarily large number of classifiers. This paper describes Learn(++).NC, specifically designed for efficient incremental learning of multiple new classes using significantly fewer classifiers. To do so, Learn (++).NC introduces dynamically weighted consult and vote (DW-CAV), a novel voting mechanism for combining classifiers: individual classifiers consult with each other to determine which ones are most qualified to classify a given instance, and decide how much weight, if any, each classifier's decision should carry. Experiments on real-world problems indicate that the new algorithm performs remarkably well with substantially fewer classifiers, not only as compared to its predecessor Learn(++), but also as compared to several other algorithms recently proposed for similar problems.
Dynamic Dimensionality Selection for Bayesian Classifier Ensembles
2015-03-19
learning of weights in an otherwise generatively learned naive Bayes classifier. WANBIA-C is very cometitive to Logistic Regression but much more...classifier, Generative learning, Discriminative learning, Naïve Bayes, Feature selection, Logistic regression , higher order attribute independence 16...discriminative learning of weights in an otherwise generatively learned naive Bayes classifier. WANBIA-C is very cometitive to Logistic Regression but
A novel method for in silico identification of regulatory SNPs in human genome.
Li, Rong; Zhong, Dexing; Liu, Ruiling; Lv, Hongqiang; Zhang, Xinman; Liu, Jun; Han, Jiuqiang
2017-02-21
Regulatory single nucleotide polymorphisms (rSNPs), kind of functional noncoding genetic variants, can affect gene expression in a regulatory way, and they are thought to be associated with increased susceptibilities to complex diseases. Here a novel computational approach to identify potential rSNPs is presented. Different from most other rSNPs finding methods which based on hypothesis that SNPs causing large allele-specific changes in transcription factor binding affinities are more likely to play regulatory functions, we use a set of documented experimentally verified rSNPs and nonfunctional background SNPs to train classifiers, so the discriminating features are found. To characterize variants, an extensive range of characteristics, such as sequence context, DNA structure and evolutionary conservation etc. are analyzed. Support vector machine is adopted to build the classifier model together with an ensemble method to deal with unbalanced data. 10-fold cross-validation result shows that our method can achieve accuracy with sensitivity of ~78% and specificity of ~82%. Furthermore, our method performances better than some other algorithms based on aforementioned hypothesis in handling false positives. The original data and the source matlab codes involved are available at https://sourceforge.net/projects/rsnppredict/. Copyright © 2016 Elsevier Ltd. All rights reserved.
Ensemble of sparse classifiers for high-dimensional biological data.
Kim, Sunghan; Scalzo, Fabien; Telesca, Donatello; Hu, Xiao
2015-01-01
Biological data are often high in dimension while the number of samples is small. In such cases, the performance of classification can be improved by reducing the dimension of data, which is referred to as feature selection. Recently, a novel feature selection method has been proposed utilising the sparsity of high-dimensional biological data where a small subset of features accounts for most variance of the dataset. In this study we propose a new classification method for high-dimensional biological data, which performs both feature selection and classification within a single framework. Our proposed method utilises a sparse linear solution technique and the bootstrap aggregating algorithm. We tested its performance on four public mass spectrometry cancer datasets along with two other conventional classification techniques such as Support Vector Machines and Adaptive Boosting. The results demonstrate that our proposed method performs more accurate classification across various cancer datasets than those conventional classification techniques.
A thyroid nodule classification method based on TI-RADS
NASA Astrophysics Data System (ADS)
Wang, Hao; Yang, Yang; Peng, Bo; Chen, Qin
2017-07-01
Thyroid Imaging Reporting and Data System(TI-RADS) is a valuable tool for differentiating the benign and the malignant thyroid nodules. In clinic, doctors can determine the extent of being benign or malignant in terms of different classes by using TI-RADS. Classification represents the degree of malignancy of thyroid nodules. TI-RADS as a classification standard can be used to guide the ultrasonic doctor to examine thyroid nodules more accurately and reliably. In this paper, we aim to classify the thyroid nodules with the help of TI-RADS. To this end, four ultrasound signs, i.e., cystic and solid, echo pattern, boundary feature and calcification of thyroid nodules are extracted and converted into feature vectors. Then semi-supervised fuzzy C-means ensemble (SS-FCME) model is applied to obtain the classification results. The experimental results demonstrate that the proposed method can help doctors diagnose the thyroid nodules effectively.
Artificial Diversity and Defense Security (ADDSec) Final Report.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chavez, Adrian R.; Hamlet, Jason; Stout, William M.S.
Critical infrastructure systems continue to foster predictable communication patterns and static configurations over extended periods of time. The static nature of these systems eases the process of gathering reconnaissance information that can be used to design, develop, and launch attacks by adversaries. In this research effort, the early phases of an attack vector will be disrupted by randomizing application port numbers, IP addresses, and communication paths dynamically through the use of overlay networks within Industrial Control Systems (ICS). These protective measures convert static systems into "moving targets," adding an additional layer of defense. Additionally, we have developed a framework thatmore » automatically detects and defends against threats within these systems using an ensemble of machine learning algorithms that classify and categorize abnormal behavior. Our proof-of-concept has been demonstrated within a representative ICS environment. Performance metrics of our proof-of-concept have been captured with latency impacts of less than a millisecond, on average.« less
Prediction of in vivo hepatotoxicity effects using in vitro ...
High-throughput in vitro transcriptomics data support molecular understanding of chemical-induced toxicity. Here, we evaluated the utility of such data to predict liver toxicity. First, in vitro gene expression data for 93 genes was generated following exposure of metabolically competent HepaRG cells to 1060 environmental chemicals from the US EPA ToxCast library. The empirical relationship between these data and rat chronic liver endpoints from animal studies in the Toxicity Reference Database (ToxRefDB) was then evaluated using machine learning techniques. Chemicals were classified as positive (242) or negative (135) based on observed hepatic histopathologic effects, and divided into three categories: hypertrophy (183), injury (112) and proliferative lesions (101). Hepatotoxicants were classified on the basis of the bioactivity of 93 genes (descriptors) using six machine learning algorithms: linear discriminant analysis, naïve Bayes, support vector classification, classification and regression trees, k-nearest neighbors, and an ensemble of classifiers. Classification performance was evaluated using 10-fold cross-validation testing, and in-loop, filter-based, feature subset selection. The best balanced accuracy for prediction of hypertrophy, injury and proliferative lesions were 0.81 ± 0.07, 0.79 ± 0.08 and 0.77 ± 0.09, respectively. Gene specific perturbation of xenobiotic metabolism enzymes (CYP7A1/2E1/4A11/1A1/4A22) and transporters (ABCG2, ABCB11, SLC22
Singla, Neeru; Srivastava, Vishal; Mehta, Dalip Singh
2018-05-01
Malaria is a life-threatening infectious blood disease affecting humans and other animals caused by parasitic protozoans belonging to the Plasmodium type especially in developing countries. The gold standard method for the detection of malaria is through the microscopic method of chemically treated blood smears. We developed an automated optical spatial coherence tomographic system using a machine learning approach for a fast identification of malaria cells. In this study, 28 samples (15 healthy, 13 malaria infected stages of red blood cells) were imaged by the developed system and 13 features were extracted. We designed a multilevel ensemble-based classifier for the quantitative prediction of different stages of the malaria cells. The proposed classifier was used by repeating k-fold cross validation dataset and achieve a high-average accuracy of 97.9% for identifying malaria infected late trophozoite stage of cells. Overall, our proposed system and multilevel ensemble model has a substantial quantifiable potential to detect the different stages of malaria infection without staining or expert. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Optical vector network analysis of ultranarrow transitions in 166Er3+ : 7LiYF4 crystal.
Kukharchyk, N; Sholokhov, D; Morozov, O; Korableva, S L; Cole, J H; Kalachev, A A; Bushev, P A
2018-02-15
We present optical vector network analysis (OVNA) of an isotopically purified Er166 3+ :LiYF 4 7 crystal. The OVNA method is based on generation and detection of a modulated optical sideband by using a radio-frequency vector network analyzer. This technique is widely used in the field of microwave photonics for the characterization of optical responses of optical devices such as filters and high-Q resonators. However, dense solid-state atomic ensembles induce a large phase shift on one of the optical sidebands that results in the appearance of extra features on the measured transmission response. We present a simple theoretical model that accurately describes the observed spectra and helps to reconstruct the absorption profile of a solid-state atomic ensemble as well as corresponding change of the refractive index in the vicinity of atomic resonances.
Visualization and classification of physiological failure modes in ensemble hemorrhage simulation
NASA Astrophysics Data System (ADS)
Zhang, Song; Pruett, William Andrew; Hester, Robert
2015-01-01
In an emergency situation such as hemorrhage, doctors need to predict which patients need immediate treatment and care. This task is difficult because of the diverse response to hemorrhage in human population. Ensemble physiological simulations provide a means to sample a diverse range of subjects and may have a better chance of containing the correct solution. However, to reveal the patterns and trends from the ensemble simulation is a challenging task. We have developed a visualization framework for ensemble physiological simulations. The visualization helps users identify trends among ensemble members, classify ensemble member into subpopulations for analysis, and provide prediction to future events by matching a new patient's data to existing ensembles. We demonstrated the effectiveness of the visualization on simulated physiological data. The lessons learned here can be applied to clinically-collected physiological data in the future.
A study of fuzzy logic ensemble system performance on face recognition problem
NASA Astrophysics Data System (ADS)
Polyakova, A.; Lipinskiy, L.
2017-02-01
Some problems are difficult to solve by using a single intelligent information technology (IIT). The ensemble of the various data mining (DM) techniques is a set of models which are able to solve the problem by itself, but the combination of which allows increasing the efficiency of the system as a whole. Using the IIT ensembles can improve the reliability and efficiency of the final decision, since it emphasizes on the diversity of its components. The new method of the intellectual informational technology ensemble design is considered in this paper. It is based on the fuzzy logic and is designed to solve the classification and regression problems. The ensemble consists of several data mining algorithms: artificial neural network, support vector machine and decision trees. These algorithms and their ensemble have been tested by solving the face recognition problems. Principal components analysis (PCA) is used for feature selection.
NASA Astrophysics Data System (ADS)
Liu, Di; Mishra, Ashok K.; Yu, Zhongbo
2016-07-01
This paper examines the combination of support vector machines (SVM) and the dual ensemble Kalman filter (EnKF) technique to estimate root zone soil moisture at different soil layers up to 100 cm depth. Multiple experiments are conducted in a data rich environment to construct and validate the SVM model and to explore the effectiveness and robustness of the EnKF technique. It was observed that the performance of SVM relies more on the initial length of training set than other factors (e.g., cost function, regularization parameter, and kernel parameters). The dual EnKF technique proved to be efficient to improve SVM with observed data either at each time step or at a flexible time steps. The EnKF technique can reach its maximum efficiency when the updating ensemble size approaches a certain threshold. It was observed that the SVM model performance for the multi-layer soil moisture estimation can be influenced by the rainfall magnitude (e.g., dry and wet spells).
Shao, Wei; Liu, Mingxia; Zhang, Daoqiang
2016-01-01
The systematic study of subcellular location pattern is very important for fully characterizing the human proteome. Nowadays, with the great advances in automated microscopic imaging, accurate bioimage-based classification methods to predict protein subcellular locations are highly desired. All existing models were constructed on the independent parallel hypothesis, where the cellular component classes are positioned independently in a multi-class classification engine. The important structural information of cellular compartments is missed. To deal with this problem for developing more accurate models, we proposed a novel cell structure-driven classifier construction approach (SC-PSorter) by employing the prior biological structural information in the learning model. Specifically, the structural relationship among the cellular components is reflected by a new codeword matrix under the error correcting output coding framework. Then, we construct multiple SC-PSorter-based classifiers corresponding to the columns of the error correcting output coding codeword matrix using a multi-kernel support vector machine classification approach. Finally, we perform the classifier ensemble by combining those multiple SC-PSorter-based classifiers via majority voting. We evaluate our method on a collection of 1636 immunohistochemistry images from the Human Protein Atlas database. The experimental results show that our method achieves an overall accuracy of 89.0%, which is 6.4% higher than the state-of-the-art method. The dataset and code can be downloaded from https://github.com/shaoweinuaa/. dqzhang@nuaa.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
NASA Astrophysics Data System (ADS)
Cao, Kunlin; Bhagalia, Roshni; Sood, Anup; Brogi, Edi; Mellinghoff, Ingo K.; Larson, Steven M.
2015-03-01
Positron emission tomography (PET) using uorodeoxyglucose (18F-FDG) is commonly used in the assessment of breast lesions by computing voxel-wise standardized uptake value (SUV) maps. Simple metrics derived from ensemble properties of SUVs within each identified breast lesion are routinely used for disease diagnosis. The maximum SUV within the lesion (SUVmax) is the most popular of these metrics. However these simple metrics are known to be error-prone and are susceptible to image noise. Finding reliable SUV map-based features that correlate to established molecular phenotypes of breast cancer (viz. estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor 2 (HER2) expression) will enable non-invasive disease management. This study investigated 36 SUV features based on first and second order statistics, local histograms and texture of segmented lesions to predict ER and PR expression in 51 breast cancer patients. True ER and PR expression was obtained via immunohistochemistry (IHC) of tissue samples from each lesion. A supervised learning, adaptive boosting-support vector machine (AdaBoost-SVM), framework was used to select a subset of features to classify breast lesions into distinct phenotypes. Performance of the trained multi-feature classifier was compared against the baseline single-feature SUVmax classifier using receiver operating characteristic (ROC) curves. Results show that texture features encoding local lesion homogeneity extracted from gray-level co-occurrence matrices are the strongest discriminator of lesion ER expression. In particular, classifiers including these features increased prediction accuracy from 0.75 (baseline) to 0.82 and the area under the ROC curve from 0.64 (baseline) to 0.75.
NASA Astrophysics Data System (ADS)
Tehrany, Mahyat Shafapour; Pradhan, Biswajeet; Jebur, Mustafa Neamah
2014-05-01
Flood is one of the most devastating natural disasters that occur frequently in Terengganu, Malaysia. Recently, ensemble based techniques are getting extremely popular in flood modeling. In this paper, weights-of-evidence (WoE) model was utilized first, to assess the impact of classes of each conditioning factor on flooding through bivariate statistical analysis (BSA). Then, these factors were reclassified using the acquired weights and entered into the support vector machine (SVM) model to evaluate the correlation between flood occurrence and each conditioning factor. Through this integration, the weak point of WoE can be solved and the performance of the SVM will be enhanced. The spatial database included flood inventory, slope, stream power index (SPI), topographic wetness index (TWI), altitude, curvature, distance from the river, geology, rainfall, land use/cover (LULC), and soil type. Four kernel types of SVM (linear kernel (LN), polynomial kernel (PL), radial basis function kernel (RBF), and sigmoid kernel (SIG)) were used to investigate the performance of each kernel type. The efficiency of the new ensemble WoE and SVM method was tested using area under curve (AUC) which measured the prediction and success rates. The validation results proved the strength and efficiency of the ensemble method over the individual methods. The best results were obtained from RBF kernel when compared with the other kernel types. Success rate and prediction rate for ensemble WoE and RBF-SVM method were 96.48% and 95.67% respectively. The proposed ensemble flood susceptibility mapping method could assist researchers and local governments in flood mitigation strategies.
Aita, Takuyo; Nishigaki, Koichi
2012-11-01
To visualize a bird's-eye view of an ensemble of mitochondrial genome sequences for various species, we recently developed a novel method of mapping a biological sequence ensemble into Three-Dimensional (3D) vector space. First, we represented a biological sequence of a species s by a word-composition vector x(s), where its length [absolute value]x(s)[absolute value] represents the sequence length, and its unit vector x(s)/[absolute value]x(s)[absolute value] represents the relative composition of the K-tuple words through the sequence and the size of the dimension, N=4(K), is the number of all possible words with the length of K. Second, we mapped the vector x(s) to the 3D position vector y(s), based on the two following simple principles: (1) [absolute value]y(s)[absolute value]=[absolute value]x(s)[absolute value] and (2) the angle between y(s) and y(t) maximally correlates with the angle between x(s) and x(t). The mitochondrial genome sequences for 311 species, including 177 Animalia, 85 Fungi and 49 Green plants, were mapped into 3D space by using K=7. The mapping was successful because the angles between vectors before and after the mapping highly correlated with each other (correlation coefficients were 0.92-0.97). Interestingly, the Animalia kingdom is distributed along a single arc belt (just like the Milky Way on a Celestial Globe), and the Fungi and Green plant kingdoms are distributed in a similar arc belt. These two arc belts intersect at their respective middle regions and form a cross structure just like a jet aircraft fuselage and its wings. This new mapping method will allow researchers to intuitively interpret the visual information presented in the maps in a highly effective manner. Copyright © 2012 Elsevier Inc. All rights reserved.
Cuéllar, Ana Carolina; Kjær, Lene Jung; Kirkeby, Carsten; Skovgard, Henrik; Nielsen, Søren Achim; Stockmarr, Anders; Andersson, Gunnar; Lindstrom, Anders; Chirico, Jan; Lühken, Renke; Steinke, Sonja; Kiel, Ellen; Gethmann, Jörn; Conraths, Franz J; Larska, Magdalena; Hamnes, Inger; Sviland, Ståle; Hopp, Petter; Brugger, Katharina; Rubel, Franz; Balenghien, Thomas; Garros, Claire; Rakotoarivony, Ignace; Allène, Xavier; Lhoir, Jonathan; Chavernac, David; Delécolle, Jean-Claude; Mathieu, Bruno; Delécolle, Delphine; Setier-Rio, Marie-Laure; Venail, Roger; Scheid, Bethsabée; Chueca, Miguel Ángel Miranda; Barceló, Carlos; Lucientes, Javier; Estrada, Rosa; Mathis, Alexander; Tack, Wesley; Bødker, Rene
2018-02-27
Biting midges of the genus Culicoides (Diptera: Ceratopogonidae) are vectors of bluetongue virus (BTV), African horse sickness virus and Schmallenberg virus (SBV). Outbreaks of both BTV and SBV have affected large parts of Europe. The spread of these diseases depends largely on vector distribution and abundance. The aim of this analysis was to identify and quantify major spatial patterns and temporal trends in the distribution and seasonal variation of observed Culicoides abundance in nine countries in Europe. We gathered existing Culicoides data from Spain, France, Germany, Switzerland, Austria, Denmark, Sweden, Norway and Poland. In total, 31,429 Culicoides trap collections were available from 904 ruminant farms across these countries between 2007 and 2013. The Obsoletus ensemble was distributed widely in Europe and accounted for 83% of all 8,842,998 Culicoides specimens in the dataset, with the highest mean monthly abundance recorded in France, Germany and southern Norway. The Pulicaris ensemble accounted for only 12% of the specimens and had a relatively southerly and easterly spatial distribution compared to the Obsoletus ensemble. Culicoides imicola Kieffer was only found in Spain and the southernmost part of France. There was a clear spatial trend in the accumulated annual abundance from southern to northern Europe, with the Obsoletus ensemble steadily increasing from 4000 per year in southern Europe to 500,000 in Scandinavia. The Pulicaris ensemble showed a very different pattern, with an increase in the accumulated annual abundance from 1600 in Spain, peaking at 41,000 in northern Germany and then decreasing again toward northern latitudes. For the two species ensembles and C. imicola, the season began between January and April, with later start dates and increasingly shorter vector seasons at more northerly latitudes. We present the first maps of seasonal Culicoides abundance in large parts of Europe covering a gradient from southern Spain to northern Scandinavia. The identified temporal trends and spatial patterns are useful for planning the allocation of resources for international prevention and surveillance programmes in the European Union.
Minimal ensemble based on subset selection using ECG to diagnose categories of CAN.
Abawajy, Jemal; Kelarev, Andrei; Yi, Xun; Jelinek, Herbert F
2018-07-01
Early diagnosis of cardiac autonomic neuropathy (CAN) is critical for reversing or decreasing its progression and prevent complications. Diagnostic accuracy or precision is one of the core requirements of CAN detection. As the standard Ewing battery tests suffer from a number of shortcomings, research in automating and improving the early detection of CAN has recently received serious attention in identifying additional clinical variables and designing advanced ensembles of classifiers to improve the accuracy or precision of CAN diagnostics. Although large ensembles are commonly proposed for the automated diagnosis of CAN, large ensembles are characterized by slow processing speed and computational complexity. This paper applies ECG features and proposes a new ensemble-based approach for diagnosis of CAN progression. We introduce a Minimal Ensemble Based On Subset Selection (MEBOSS) for the diagnosis of all categories of CAN including early, definite and atypical CAN. MEBOSS is based on a novel multi-tier architecture applying classifier subset selection as well as the training subset selection during several steps of its operation. Our experiments determined the diagnostic accuracy or precision obtained in 5 × 2 cross-validation for various options employed in MEBOSS and other classification systems. The experiments demonstrate the operation of the MEBOSS procedure invoking the most effective classifiers available in the open source software environment SageMath. The results of our experiments show that for the large DiabHealth database of CAN related parameters MEBOSS outperformed other classification systems available in SageMath and achieved 94% to 97% precision in 5 × 2 cross-validation correctly distinguishing any two CAN categories to a maximum of five categorizations including control, early, definite, severe and atypical CAN. These results show that MEBOSS architecture is effective and can be recommended for practical implementations in systems for the diagnosis of CAN progression. Copyright © 2018 Elsevier B.V. All rights reserved.
Oliveira, Roberta B; Pereira, Aledir S; Tavares, João Manuel R S
2017-10-01
The number of deaths worldwide due to melanoma has risen in recent times, in part because melanoma is the most aggressive type of skin cancer. Computational systems have been developed to assist dermatologists in early diagnosis of skin cancer, or even to monitor skin lesions. However, there still remains a challenge to improve classifiers for the diagnosis of such skin lesions. The main objective of this article is to evaluate different ensemble classification models based on input feature manipulation to diagnose skin lesions. Input feature manipulation processes are based on feature subset selections from shape properties, colour variation and texture analysis to generate diversity for the ensemble models. Three subset selection models are presented here: (1) a subset selection model based on specific feature groups, (2) a correlation-based subset selection model, and (3) a subset selection model based on feature selection algorithms. Each ensemble classification model is generated using an optimum-path forest classifier and integrated with a majority voting strategy. The proposed models were applied on a set of 1104 dermoscopic images using a cross-validation procedure. The best results were obtained by the first ensemble classification model that generates a feature subset ensemble based on specific feature groups. The skin lesion diagnosis computational system achieved 94.3% accuracy, 91.8% sensitivity and 96.7% specificity. The input feature manipulation process based on specific feature subsets generated the greatest diversity for the ensemble classification model with very promising results. Copyright © 2017 Elsevier B.V. All rights reserved.
Absolute cosine-based SVM-RFE feature selection method for prostate histopathological grading.
Sahran, Shahnorbanun; Albashish, Dheeb; Abdullah, Azizi; Shukor, Nordashima Abd; Hayati Md Pauzi, Suria
2018-04-18
Feature selection (FS) methods are widely used in grading and diagnosing prostate histopathological images. In this context, FS is based on the texture features obtained from the lumen, nuclei, cytoplasm and stroma, all of which are important tissue components. However, it is difficult to represent the high-dimensional textures of these tissue components. To solve this problem, we propose a new FS method that enables the selection of features with minimal redundancy in the tissue components. We categorise tissue images based on the texture of individual tissue components via the construction of a single classifier and also construct an ensemble learning model by merging the values obtained by each classifier. Another issue that arises is overfitting due to the high-dimensional texture of individual tissue components. We propose a new FS method, SVM-RFE(AC), that integrates a Support Vector Machine-Recursive Feature Elimination (SVM-RFE) embedded procedure with an absolute cosine (AC) filter method to prevent redundancy in the selected features of the SV-RFE and an unoptimised classifier in the AC. We conducted experiments on H&E histopathological prostate and colon cancer images with respect to three prostate classifications, namely benign vs. grade 3, benign vs. grade 4 and grade 3 vs. grade 4. The colon benchmark dataset requires a distinction between grades 1 and 2, which are the most difficult cases to distinguish in the colon domain. The results obtained by both the single and ensemble classification models (which uses the product rule as its merging method) confirm that the proposed SVM-RFE(AC) is superior to the other SVM and SVM-RFE-based methods. We developed an FS method based on SVM-RFE and AC and successfully showed that its use enabled the identification of the most crucial texture feature of each tissue component. Thus, it makes possible the distinction between multiple Gleason grades (e.g. grade 3 vs. grade 4) and its performance is far superior to other reported FS methods. Copyright © 2018 Elsevier B.V. All rights reserved.
HLPI-Ensemble: Prediction of human lncRNA-protein interactions based on ensemble strategy.
Hu, Huan; Zhang, Li; Ai, Haixin; Zhang, Hui; Fan, Yetian; Zhao, Qi; Liu, Hongsheng
2018-03-27
LncRNA plays an important role in many biological and disease progression by binding to related proteins. However, the experimental methods for studying lncRNA-protein interactions are time-consuming and expensive. Although there are a few models designed to predict the interactions of ncRNA-protein, they all have some common drawbacks that limit their predictive performance. In this study, we present a model called HLPI-Ensemble designed specifically for human lncRNA-protein interactions. HLPI-Ensemble adopts the ensemble strategy based on three mainstream machine learning algorithms of Support Vector Machines (SVM), Random Forests (RF) and Extreme Gradient Boosting (XGB) to generate HLPI-SVM Ensemble, HLPI-RF Ensemble and HLPI-XGB Ensemble, respectively. The results of 10-fold cross-validation show that HLPI-SVM Ensemble, HLPI-RF Ensemble and HLPI-XGB Ensemble achieved AUCs of 0.95, 0.96 and 0.96, respectively, in the test dataset. Furthermore, we compared the performance of the HLPI-Ensemble models with the previous models through external validation dataset. The results show that the false positives (FPs) of HLPI-Ensemble models are much lower than that of the previous models, and other evaluation indicators of HLPI-Ensemble models are also higher than those of the previous models. It is further showed that HLPI-Ensemble models are superior in predicting human lncRNA-protein interaction compared with previous models. The HLPI-Ensemble is publicly available at: http://ccsipb.lnu.edu.cn/hlpiensemble/ .
Choi, Joon Yul; Yoo, Tae Keun; Seo, Jeong Gi; Kwak, Jiyong; Um, Terry Taewoong; Rim, Tyler Hyungtaek
2017-01-01
Deep learning emerges as a powerful tool for analyzing medical images. Retinal disease detection by using computer-aided diagnosis from fundus image has emerged as a new method. We applied deep learning convolutional neural network by using MatConvNet for an automated detection of multiple retinal diseases with fundus photographs involved in STructured Analysis of the REtina (STARE) database. Dataset was built by expanding data on 10 categories, including normal retina and nine retinal diseases. The optimal outcomes were acquired by using a random forest transfer learning based on VGG-19 architecture. The classification results depended greatly on the number of categories. As the number of categories increased, the performance of deep learning models was diminished. When all 10 categories were included, we obtained results with an accuracy of 30.5%, relative classifier information (RCI) of 0.052, and Cohen's kappa of 0.224. Considering three integrated normal, background diabetic retinopathy, and dry age-related macular degeneration, the multi-categorical classifier showed accuracy of 72.8%, 0.283 RCI, and 0.577 kappa. In addition, several ensemble classifiers enhanced the multi-categorical classification performance. The transfer learning incorporated with ensemble classifier of clustering and voting approach presented the best performance with accuracy of 36.7%, 0.053 RCI, and 0.225 kappa in the 10 retinal diseases classification problem. First, due to the small size of datasets, the deep learning techniques in this study were ineffective to be applied in clinics where numerous patients suffering from various types of retinal disorders visit for diagnosis and treatment. Second, we found that the transfer learning incorporated with ensemble classifiers can improve the classification performance in order to detect multi-categorical retinal diseases. Further studies should confirm the effectiveness of algorithms with large datasets obtained from hospitals.
Facial Expression Recognition using Multiclass Ensemble Least-Square Support Vector Machine
NASA Astrophysics Data System (ADS)
Lawi, Armin; Sya'Rani Machrizzandi, M.
2018-03-01
Facial expression is one of behavior characteristics of human-being. The use of biometrics technology system with facial expression characteristics makes it possible to recognize a person’s mood or emotion. The basic components of facial expression analysis system are face detection, face image extraction, facial classification and facial expressions recognition. This paper uses Principal Component Analysis (PCA) algorithm to extract facial features with expression parameters, i.e., happy, sad, neutral, angry, fear, and disgusted. Then Multiclass Ensemble Least-Squares Support Vector Machine (MELS-SVM) is used for the classification process of facial expression. The result of MELS-SVM model obtained from our 185 different expression images of 10 persons showed high accuracy level of 99.998% using RBF kernel.
A unified classifier for robust face recognition based on combining multiple subspace algorithms
NASA Astrophysics Data System (ADS)
Ijaz Bajwa, Usama; Ahmad Taj, Imtiaz; Waqas Anwar, Muhammad
2012-10-01
Face recognition being the fastest growing biometric technology has expanded manifold in the last few years. Various new algorithms and commercial systems have been proposed and developed. However, none of the proposed or developed algorithm is a complete solution because it may work very well on one set of images with say illumination changes but may not work properly on another set of image variations like expression variations. This study is motivated by the fact that any single classifier cannot claim to show generally better performance against all facial image variations. To overcome this shortcoming and achieve generality, combining several classifiers using various strategies has been studied extensively also incorporating the question of suitability of any classifier for this task. The study is based on the outcome of a comprehensive comparative analysis conducted on a combination of six subspace extraction algorithms and four distance metrics on three facial databases. The analysis leads to the selection of the most suitable classifiers which performs better on one task or the other. These classifiers are then combined together onto an ensemble classifier by two different strategies of weighted sum and re-ranking. The results of the ensemble classifier show that these strategies can be effectively used to construct a single classifier that can successfully handle varying facial image conditions of illumination, aging and facial expressions.
Protein binding hot spots prediction from sequence only by a new ensemble learning method.
Hu, Shan-Shan; Chen, Peng; Wang, Bing; Li, Jinyan
2017-10-01
Hot spots are interfacial core areas of binding proteins, which have been applied as targets in drug design. Experimental methods are costly in both time and expense to locate hot spot areas. Recently, in-silicon computational methods have been widely used for hot spot prediction through sequence or structure characterization. As the structural information of proteins is not always solved, and thus hot spot identification from amino acid sequences only is more useful for real-life applications. This work proposes a new sequence-based model that combines physicochemical features with the relative accessible surface area of amino acid sequences for hot spot prediction. The model consists of 83 classifiers involving the IBk (Instance-based k means) algorithm, where instances are encoded by important properties extracted from a total of 544 properties in the AAindex1 (Amino Acid Index) database. Then top-performance classifiers are selected to form an ensemble by a majority voting technique. The ensemble classifier outperforms the state-of-the-art computational methods, yielding an F1 score of 0.80 on the benchmark binding interface database (BID) test set. http://www2.ahu.edu.cn/pchen/web/HotspotEC.htm .
A Label Propagation Approach for Detecting Buried Objects in Handheld GPR Data
2016-04-17
regions of interest that correspond to locations with anomalous signatures. Second, a classifier (or an ensemble of classifiers ) is used to assign a...investigated for almost two decades and several classifiers have been developed. Most of these methods are based on the supervised learning paradigm where...labeled target and clutter signatures are needed to train a classifier to discriminate between the two classes. Typically, large and diverse labeled
Failure analysis of parameter-induced simulation crashes in climate models
NASA Astrophysics Data System (ADS)
Lucas, D. D.; Klein, R.; Tannahill, J.; Ivanova, D.; Brandon, S.; Domyancic, D.; Zhang, Y.
2013-01-01
Simulations using IPCC-class climate models are subject to fail or crash for a variety of reasons. Quantitative analysis of the failures can yield useful insights to better understand and improve the models. During the course of uncertainty quantification (UQ) ensemble simulations to assess the effects of ocean model parameter uncertainties on climate simulations, we experienced a series of simulation crashes within the Parallel Ocean Program (POP2) component of the Community Climate System Model (CCSM4). About 8.5% of our CCSM4 simulations failed for numerical reasons at combinations of POP2 parameter values. We apply support vector machine (SVM) classification from machine learning to quantify and predict the probability of failure as a function of the values of 18 POP2 parameters. A committee of SVM classifiers readily predicts model failures in an independent validation ensemble, as assessed by the area under the receiver operating characteristic (ROC) curve metric (AUC > 0.96). The causes of the simulation failures are determined through a global sensitivity analysis. Combinations of 8 parameters related to ocean mixing and viscosity from three different POP2 parameterizations are the major sources of the failures. This information can be used to improve POP2 and CCSM4 by incorporating correlations across the relevant parameters. Our method can also be used to quantify, predict, and understand simulation crashes in other complex geoscientific models.
Failure analysis of parameter-induced simulation crashes in climate models
NASA Astrophysics Data System (ADS)
Lucas, D. D.; Klein, R.; Tannahill, J.; Ivanova, D.; Brandon, S.; Domyancic, D.; Zhang, Y.
2013-08-01
Simulations using IPCC (Intergovernmental Panel on Climate Change)-class climate models are subject to fail or crash for a variety of reasons. Quantitative analysis of the failures can yield useful insights to better understand and improve the models. During the course of uncertainty quantification (UQ) ensemble simulations to assess the effects of ocean model parameter uncertainties on climate simulations, we experienced a series of simulation crashes within the Parallel Ocean Program (POP2) component of the Community Climate System Model (CCSM4). About 8.5% of our CCSM4 simulations failed for numerical reasons at combinations of POP2 parameter values. We applied support vector machine (SVM) classification from machine learning to quantify and predict the probability of failure as a function of the values of 18 POP2 parameters. A committee of SVM classifiers readily predicted model failures in an independent validation ensemble, as assessed by the area under the receiver operating characteristic (ROC) curve metric (AUC > 0.96). The causes of the simulation failures were determined through a global sensitivity analysis. Combinations of 8 parameters related to ocean mixing and viscosity from three different POP2 parameterizations were the major sources of the failures. This information can be used to improve POP2 and CCSM4 by incorporating correlations across the relevant parameters. Our method can also be used to quantify, predict, and understand simulation crashes in other complex geoscientific models.
Chemical entity recognition in patents by combining dictionary-based and statistical approaches.
Akhondi, Saber A; Pons, Ewoud; Afzal, Zubair; van Haagen, Herman; Becker, Benedikt F H; Hettne, Kristina M; van Mulligen, Erik M; Kors, Jan A
2016-01-01
We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents. © The Author(s) 2016. Published by Oxford University Press.
Big genomics and clinical data analytics strategies for precision cancer prognosis.
Ow, Ghim Siong; Kuznetsov, Vladimir A
2016-11-07
The field of personalized and precise medicine in the era of big data analytics is growing rapidly. Previously, we proposed our model of patient classification termed Prognostic Signature Vector Matching (PSVM) and identified a 37 variable signature comprising 36 let-7b associated prognostic significant mRNAs and the age risk factor that stratified large high-grade serous ovarian cancer patient cohorts into three survival-significant risk groups. Here, we investigated the predictive performance of PSVM via optimization of the prognostic variable weights, which represent the relative importance of one prognostic variable over the others. In addition, we compared several multivariate prognostic models based on PSVM with classical machine learning techniques such as K-nearest-neighbor, support vector machine, random forest, neural networks and logistic regression. Our results revealed that negative log-rank p-values provides more robust weight values as opposed to the use of other quantities such as hazard ratios, fold change, or a combination of those factors. PSVM, together with the classical machine learning classifiers were combined in an ensemble (multi-test) voting system, which collectively provides a more precise and reproducible patient stratification. The use of the multi-test system approach, rather than the search for the ideal classification/prediction method, might help to address limitations of the individual classification algorithm in specific situation.
Hierarchical ensemble of global and local classifiers for face recognition.
Su, Yu; Shan, Shiguang; Chen, Xilin; Gao, Wen
2009-08-01
In the literature of psychophysics and neurophysiology, many studies have shown that both global and local features are crucial for face representation and recognition. This paper proposes a novel face recognition method which exploits both global and local discriminative features. In this method, global features are extracted from the whole face images by keeping the low-frequency coefficients of Fourier transform, which we believe encodes the holistic facial information, such as facial contour. For local feature extraction, Gabor wavelets are exploited considering their biological relevance. After that, Fisher's linear discriminant (FLD) is separately applied to the global Fourier features and each local patch of Gabor features. Thus, multiple FLD classifiers are obtained, each embodying different facial evidences for face recognition. Finally, all these classifiers are combined to form a hierarchical ensemble classifier. We evaluate the proposed method using two large-scale face databases: FERET and FRGC version 2.0. Experiments show that the results of our method are impressively better than the best known results with the same evaluation protocol.
Computer Based Behavioral Biometric Authentication via Multi-Modal Fusion
2013-03-01
the decisions made by each individual modality. Fusion of features is the simple concatenation of feature vectors from multiple modalities to be...of Features BayesNet MDL 330 LibSVM PCA 80 J48 Wrapper Evaluator 11 3.5.3 Ensemble Based Decision Level Fusion. In ensemble learning multiple ...The high fusion percentages validate our hypothesis that by combining features from multiple modalities, classification accuracy can be improved. As
The cubic ternary complex receptor-occupancy model. III. resurrecting efficacy.
Weiss, J M; Morgan, P H; Lutz, M W; Kenakin, T P
1996-08-21
Early work in pharmacology characterized the interaction of receptors and ligands in terms of two parameters, affinity and efficacy, an approach we term the bipartite view. A precise formulation of efficacy only exists for very simple pharmacological models. Here we extend the notion of efficacy to models that incorporate receptor activation and G-protein coupling. Using the cubic ternary complex model, we show that efficacy is not purely a property of the ligand-receptor interaction; it also depends upon the distributional details of the receptor species in the native receptor ensemble. This suggests a distinction between what we call potential efficacy (a vector) and realized efficacy (a scalar). To each receptor species in the native receptor ensemble we assign a part-worth utility; taken together these utilities comprise the potential efficacy vector. Realized efficacy is the expectation of these part-worth utilities with respect to the frequency distribution of receptor species in the native receptor ensemble. In the parlance of statistical decision theory, the binding of a ligand to a receptor ensemble is a random prospect and realized efficacy is the utility of this prospect. We explore the implications that our definition of efficacy has for understanding agonism and in assessing the legitimacy of the bipartite view in pharmacology.
Simulating ensembles of source water quality using a K-nearest neighbor resampling approach.
Towler, Erin; Rajagopalan, Balaji; Seidel, Chad; Summers, R Scott
2009-03-01
Climatological, geological, and water management factors can cause significant variability in surface water quality. As drinking water quality standards become more stringent, the ability to quantify the variability of source water quality becomes more important for decision-making and planning in water treatment for regulatory compliance. However, paucity of long-term water quality data makes it challenging to apply traditional simulation techniques. To overcome this limitation, we have developed and applied a robust nonparametric K-nearest neighbor (K-nn) bootstrap approach utilizing the United States Environmental Protection Agency's Information Collection Rule (ICR) data. In this technique, first an appropriate "feature vector" is formed from the best available explanatory variables. The nearest neighbors to the feature vector are identified from the ICR data and are resampled using a weight function. Repetition of this results in water quality ensembles, and consequently the distribution and the quantification of the variability. The main strengths of the approach are its flexibility, simplicity, and the ability to use a large amount of spatial data with limited temporal extent to provide water quality ensembles for any given location. We demonstrate this approach by applying it to simulate monthly ensembles of total organic carbon for two utilities in the U.S. with very different watersheds and to alkalinity and bromide at two other U.S. utilities.
Kim, Seongjung; Kim, Jongman; Ahn, Soonjae; Kim, Youngho
2018-04-18
Deaf people use sign or finger languages for communication, but these methods of communication are very specialized. For this reason, the deaf can suffer from social inequalities and financial losses due to their communication restrictions. In this study, we developed a finger language recognition algorithm based on an ensemble artificial neural network (E-ANN) using an armband system with 8-channel electromyography (EMG) sensors. The developed algorithm was composed of signal acquisition, filtering, segmentation, feature extraction and an E-ANN based classifier that was evaluated with the Korean finger language (14 consonants, 17 vowels and 7 numbers) in 17 subjects. E-ANN was categorized according to the number of classifiers (1 to 10) and size of training data (50 to 1500). The accuracy of the E-ANN-based classifier was obtained by 5-fold cross validation and compared with an artificial neural network (ANN)-based classifier. As the number of classifiers (1 to 8) and size of training data (50 to 300) increased, the average accuracy of the E-ANN-based classifier increased and the standard deviation decreased. The optimal E-ANN was composed with eight classifiers and 300 size of training data, and the accuracy of the E-ANN was significantly higher than that of the general ANN.
Quasi-most unstable modes: a window to 'À la carte' ensemble diversity?
NASA Astrophysics Data System (ADS)
Homar Santaner, Victor; Stensrud, David J.
2010-05-01
The atmospheric scientific community is nowadays facing the ambitious challenge of providing useful forecasts of atmospheric events that produce high societal impact. The low level of social resilience to false alarms creates tremendous pressure on forecasting offices to issue accurate, timely and reliable warnings.Currently, no operational numerical forecasting system is able to respond to the societal demand for high-resolution (in time and space) predictions in the 12-72h time span. The main reasons for such deficiencies are the lack of adequate observations and the high non-linearity of the numerical models that are currently used. The whole weather forecasting problem is intrinsically probabilistic and current methods aim at coping with the various sources of uncertainties and the error propagation throughout the forecasting system. This probabilistic perspective is often created by generating ensembles of deterministic predictions that are aimed at sampling the most important sources of uncertainty in the forecasting system. The ensemble generation/sampling strategy is a crucial aspect of their performance and various methods have been proposed. Although global forecasting offices have been using ensembles of perturbed initial conditions for medium-range operational forecasts since 1994, no consensus exists regarding the optimum sampling strategy for high resolution short-range ensemble forecasts. Bred vectors, however, have been hypothesized to better capture the growing modes in the highly nonlinear mesoscale dynamics of severe episodes than singular vectors or observation perturbations. Yet even this technique is not able to produce enough diversity in the ensembles to accurately and routinely predict extreme phenomena such as severe weather. Thus, we propose a new method to generate ensembles of initial conditions perturbations that is based on the breeding technique. Given a standard bred mode, a set of customized perturbations is derived with specified amplitudes and horizontal scales. This allows the ensemble to excite growing modes across a wider range of scales. Results show that this approach produces significantly more spread in the ensemble prediction than standard bred modes alone. Several examples that illustrate the benefits from this approach for severe weather forecasts will be provided.
NASA Technical Reports Server (NTRS)
Garay, Michael J.; Mazzoni, Dominic; Davies, Roger; Wagstaff, Kiri
2004-01-01
Support Vector Machines (SVMs) are a type of supervised learning algorith,, other examples of which are Artificial Neural Networks (ANNs), Decision Trees, and Naive Bayesian Classifiers. Supervised learning algorithms are used to classify objects labled by a 'supervisor' - typically a human 'expert.'.
NASA Astrophysics Data System (ADS)
Wolf, Nils; Hof, Angela
2012-10-01
Urban sprawl driven by shifts in tourism development produces new suburban landscapes of water consumption on Mediterranean coasts. Golf courses, ornamental, 'Atlantic' gardens and swimming pools are the most striking artefacts of this transformation, threatening the local water supply systems and exacerbating water scarcity. In the face of climate change, urban landscape irrigation is becoming increasingly important from a resource management point of view. This paper adopts urban remote sensing towards a targeted mapping approach using machine learning techniques and highresolution satellite imagery (WorldView-2) to generate GIS-ready information for urban water consumption studies. Swimming pools, vegetation and - as a subgroup of vegetation - turf grass are extracted as important determinants of water consumption. For image analysis, the complex nature of urban environments suggests spatial-spectral classification, i.e. the complementary use of the spectral signature and spatial descriptors. Multiscale image segmentation provides means to extract the spatial descriptors - namely object feature layers - which can be concatenated at pixel level to the spectral signature. This study assesses the value of object features using different machine learning techniques and amounts of labeled information for learning. The results indicate the benefit of the spatial-spectral approach if combined with appropriate classifiers like tree-based ensembles or support vector machines, which can handle high dimensionality. Finally, a Random Forest classifier was chosen to deliver the classified input data for the estimation of evaporative water loss and net landscape irrigation requirements.
Little, Eliza; Campbell, Scott R; Shaman, Jeffrey
2016-08-09
West Nile Virus (WNV) is an endemic public health concern in the United States that produces periodic seasonal epidemics. Underlying these outbreaks is the enzootic cycle of WNV between mosquito vectors and bird hosts. Identifying the key environmental conditions that facilitate and accelerate this cycle can be used to inform effective vector control. Here, we model and forecast WNV infection rates among mosquito vectors in Suffolk County, New York using readily available meteorological and hydrological conditions. We first validate a statistical model built with surveillance data between 2001 and 2009 (m09) and specify a set of new statistical models using surveillance data from 2001 to 2012 (m12). This ensemble of new models is then used to make predictions for 2013-2015, and multimodel inference is employed to provide a formal probabilistic interpretation across the disparate individual model predictions. The findings of the m09 and m12 models align; with the ensemble of m12 models indicating an association between warm, dry early spring (April) conditions and increased annual WNV infection rates in Culex mosquitoes. This study shows that real-time climate information can be used to predict WNV infection rates in Culex mosquitoes prior to its seasonal peak and before WNV spillover transmission risk to humans is greatest.
Ensemble Data Assimilation Without Ensembles: Methodology and Application to Ocean Data Assimilation
NASA Technical Reports Server (NTRS)
Keppenne, Christian L.; Rienecker, Michele M.; Kovach, Robin M.; Vernieres, Guillaume
2013-01-01
Two methods to estimate background error covariances for data assimilation are introduced. While both share properties with the ensemble Kalman filter (EnKF), they differ from it in that they do not require the integration of multiple model trajectories. Instead, all the necessary covariance information is obtained from a single model integration. The first method is referred-to as SAFE (Space Adaptive Forecast error Estimation) because it estimates error covariances from the spatial distribution of model variables within a single state vector. It can thus be thought of as sampling an ensemble in space. The second method, named FAST (Flow Adaptive error Statistics from a Time series), constructs an ensemble sampled from a moving window along a model trajectory. The underlying assumption in these methods is that forecast errors in data assimilation are primarily phase errors in space and/or time.
Tuning to optimize SVM approach for assisting ovarian cancer diagnosis with photoacoustic imaging.
Wang, Rui; Li, Rui; Lei, Yanyan; Zhu, Quing
2015-01-01
Support vector machine (SVM) is one of the most effective classification methods for cancer detection. The efficiency and quality of a SVM classifier depends strongly on several important features and a set of proper parameters. Here, a series of classification analyses, with one set of photoacoustic data from ovarian tissues ex vivo and a widely used breast cancer dataset- the Wisconsin Diagnostic Breast Cancer (WDBC), revealed the different accuracy of a SVM classification in terms of the number of features used and the parameters selected. A pattern recognition system is proposed by means of SVM-Recursive Feature Elimination (RFE) with the Radial Basis Function (RBF) kernel. To improve the effectiveness and robustness of the system, an optimized tuning ensemble algorithm called as SVM-RFE(C) with correlation filter was implemented to quantify feature and parameter information based on cross validation. The proposed algorithm is first demonstrated outperforming SVM-RFE on WDBC. Then the best accuracy of 94.643% and sensitivity of 94.595% were achieved when using SVM-RFE(C) to test 57 new PAT data from 19 patients. The experiment results show that the classifier constructed with SVM-RFE(C) algorithm is able to learn additional information from new data and has significant potential in ovarian cancer diagnosis.
NASA Astrophysics Data System (ADS)
Zhang, Chen; Yuan, Heng; Zhang, Ning; Xu, Lixia; Zhang, Jixing; Li, Bo; Fang, Jiancheng
2018-04-01
Negatively charged nitrogen vacancy (NV‑) centers in diamond have been extensively studied as high-sensitivity magnetometers, showcasing a wide range of applications. This study experimentally demonstrates a vector magnetometry scheme based on synchronous manipulation of NV‑ center ensembles in all crystal directions using double frequency microwaves (MWs) and multi-coupled-strip-lines (mCSL) waveguide. The application of the mCSL waveguide ensures a high degree of synchrony (99%) for manipulating NV‑ centers in multiple orientations in a large volume. Manipulation with double frequency MWs makes NV‑ centers of all four crystal directions involved, and additionally leads to an enhancement of the manipulation field. In this work, by monitoring the changes in the slope of the resonance line consisting of multi-axes NV‑ centers, measurement of the direction of the external field vector was demonstrated with a sensitivity of {{10}\\prime}/\\sqrt{Hz} . Based on the scheme, the fluorescence signal contrast was improved by four times higher and the sensitivity to the magnetic field strength was improved by two times. The method provides a more practical way of achieving vector sensors based on NV‑ center ensembles in diamond.
A Bag of Concepts Approach for Biomedical Document Classification Using Wikipedia Knowledge.
Mouriño-García, Marcos A; Pérez-Rodríguez, Roberto; Anido-Rifón, Luis E
2017-01-01
The ability to efficiently review the existing literature is essential for the rapid progress of research. This paper describes a classifier of text documents, represented as vectors in spaces of Wikipedia concepts, and analyses its suitability for classification of Spanish biomedical documents when only English documents are available for training. We propose the cross-language concept matching (CLCM) technique, which relies on Wikipedia interlanguage links to convert concept vectors from the Spanish to the English space. The performance of the classifier is compared to several baselines: a classifier based on machine translation, a classifier that represents documents after performing Explicit Semantic Analysis (ESA), and a classifier that uses a domain-specific semantic an- notator (MetaMap). The corpus used for the experiments (Cross-Language UVigoMED) was purpose-built for this study, and it is composed of 12,832 English and 2,184 Spanish MEDLINE abstracts. The performance of our approach is superior to any other state-of-the art classifier in the benchmark, with performance increases up to: 124% over classical machine translation, 332% over MetaMap, and 60 times over the classifier based on ESA. The results have statistical significance, showing p-values < 0.0001. Using knowledge mined from Wikipedia to represent documents as vectors in a space of Wikipedia concepts and translating vectors between language-specific concept spaces, a cross-language classifier can be built, and it performs better than several state-of-the-art classifiers. Schattauer GmbH.
Mouriño-García, Marcos A; Pérez-Rodríguez, Roberto; Anido-Rifón, Luis E
2017-10-26
The ability to efficiently review the existing literature is essential for the rapid progress of research. This paper describes a classifier of text documents, represented as vectors in spaces of Wikipedia concepts, and analyses its suitability for classification of Spanish biomedical documents when only English documents are available for training. We propose the cross-language concept matching (CLCM) technique, which relies on Wikipedia interlanguage links to convert concept vectors from the Spanish to the English space. The performance of the classifier is compared to several baselines: a classifier based on machine translation, a classifier that represents documents after performing Explicit Semantic Analysis (ESA), and a classifier that uses a domain-specific semantic annotator (MetaMap). The corpus used for the experiments (Cross-Language UVigoMED) was purpose-built for this study, and it is composed of 12,832 English and 2,184 Spanish MEDLINE abstracts. The performance of our approach is superior to any other state-of-the art classifier in the benchmark, with performance increases up to: 124% over classical machine translation, 332% over MetaMap, and 60 times over the classifier based on ESA. The results have statistical significance, showing p-values < 0.0001. Using knowledge mined from Wikipedia to represent documents as vectors in a space of Wikipedia concepts and translating vectors between language-specific concept spaces, a cross-language classifier can be built, and it performs better than several state-of-the-art classifiers.
NASA Astrophysics Data System (ADS)
Krüger, Thomas
2006-05-01
The possibility of teleportation is by sure the most interesting consequence of quantum non-separability. So far, however, teleportation schemes have been formulated by use of state vectors and considering individual entities only. In the present article the feasibility of teleportation is examined on the basis of the rigorous ensemble interpretation of quantum mechanics (not to be confused with a mere treatment of noisy EPR pairs) leading to results which are unexpected from the usual point of view.
Global Optimization Ensemble Model for Classification Methods
Anwar, Hina; Qamar, Usman; Muzaffar Qureshi, Abdul Wahab
2014-01-01
Supervised learning is the process of data mining for deducing rules from training datasets. A broad array of supervised learning algorithms exists, every one of them with its own advantages and drawbacks. There are some basic issues that affect the accuracy of classifier while solving a supervised learning problem, like bias-variance tradeoff, dimensionality of input space, and noise in the input data space. All these problems affect the accuracy of classifier and are the reason that there is no global optimal method for classification. There is not any generalized improvement method that can increase the accuracy of any classifier while addressing all the problems stated above. This paper proposes a global optimization ensemble model for classification methods (GMC) that can improve the overall accuracy for supervised learning problems. The experimental results on various public datasets showed that the proposed model improved the accuracy of the classification models from 1% to 30% depending upon the algorithm complexity. PMID:24883382
NASA Astrophysics Data System (ADS)
Niedzielski, Tomasz; Mizinski, Bartlomiej
2016-04-01
The HydroProg system has been elaborated in frame of the research project no. 2011/01/D/ST10/04171 of the National Science Centre of Poland and is steadily producing multimodel ensemble predictions of hydrograph in real time. Although there are six ensemble members available at present, the longest record of predictions and their statistics is available for two data-based models (uni- and multivariate autoregressive models). Thus, we consider 3-hour predictions of water levels, with lead times ranging from 15 to 180 minutes, computed every 15 minutes since August 2013 for the Nysa Klodzka basin (SW Poland) using the two approaches and their two-model ensemble. Since the launch of the HydroProg system there have been 12 high flow episodes, and the objective of this work is to present the performance of the two-model ensemble in the process of forecasting these events. For a sake of brevity, we limit our investigation to a single gauge located at the Nysa Klodzka river in the town of Klodzko, which is centrally located in the studied basin. We identified certain regular scenarios of how the models perform in predicting the high flows in Klodzko. At the initial phase of the high flow, well before the rising limb of hydrograph, the two-model ensemble is found to provide the most skilful prognoses of water levels. However, while forecasting the rising limb of hydrograph, either the two-model solution or the vector autoregressive model offers the best predictive performance. In addition, it is hypothesized that along with the development of the rising limb phase, the vector autoregression becomes the most skilful approach amongst the scrutinized ones. Our simple two-model exercise confirms that multimodel hydrologic ensemble predictions cannot be treated as universal solutions suitable for forecasting the entire high flow event, but their superior performance may hold only for certain phases of a high flow.
Recognition of emotions using multimodal physiological signals and an ensemble deep learning model.
Yin, Zhong; Zhao, Mengyuan; Wang, Yongxiong; Yang, Jingdong; Zhang, Jianhua
2017-03-01
Using deep-learning methodologies to analyze multimodal physiological signals becomes increasingly attractive for recognizing human emotions. However, the conventional deep emotion classifiers may suffer from the drawback of the lack of the expertise for determining model structure and the oversimplification of combining multimodal feature abstractions. In this study, a multiple-fusion-layer based ensemble classifier of stacked autoencoder (MESAE) is proposed for recognizing emotions, in which the deep structure is identified based on a physiological-data-driven approach. Each SAE consists of three hidden layers to filter the unwanted noise in the physiological features and derives the stable feature representations. An additional deep model is used to achieve the SAE ensembles. The physiological features are split into several subsets according to different feature extraction approaches with each subset separately encoded by a SAE. The derived SAE abstractions are combined according to the physiological modality to create six sets of encodings, which are then fed to a three-layer, adjacent-graph-based network for feature fusion. The fused features are used to recognize binary arousal or valence states. DEAP multimodal database was employed to validate the performance of the MESAE. By comparing with the best existing emotion classifier, the mean of classification rate and F-score improves by 5.26%. The superiority of the MESAE against the state-of-the-art shallow and deep emotion classifiers has been demonstrated under different sizes of the available physiological instances. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Abawajy, Jemal; Kelarev, Andrei; Chowdhury, Morshed U; Jelinek, Herbert F
2016-01-01
Blood biochemistry attributes form an important class of tests, routinely collected several times per year for many patients with diabetes. The objective of this study is to investigate the role of blood biochemistry for improving the predictive accuracy of the diagnosis of cardiac autonomic neuropathy (CAN) progression. Blood biochemistry contributes to CAN, and so it is a causative factor that can provide additional power for the diagnosis of CAN especially in the absence of a complete set of Ewing tests. We introduce automated iterative multitier ensembles (AIME) and investigate their performance in comparison to base classifiers and standard ensemble classifiers for blood biochemistry attributes. AIME incorporate diverse ensembles into several tiers simultaneously and combine them into one automatically generated integrated system so that one ensemble acts as an integral part of another ensemble. We carried out extensive experimental analysis using large datasets from the diabetes screening research initiative (DiScRi) project. The results of our experiments show that several blood biochemistry attributes can be used to supplement the Ewing battery for the detection of CAN in situations where one or more of the Ewing tests cannot be completed because of the individual difficulties faced by each patient in performing the tests. The results show that AIME provide higher accuracy as a multitier CAN classification paradigm. The best predictive accuracy of 99.57% has been obtained by the AIME combining decorate on top tier with bagging on middle tier based on random forest. Practitioners can use these findings to increase the accuracy of CAN diagnosis.
Prediction of plant lncRNA by ensemble machine learning classifiers.
Simopoulos, Caitlin M A; Weretilnyk, Elizabeth A; Golding, G Brian
2018-05-02
In plants, long non-protein coding RNAs are believed to have essential roles in development and stress responses. However, relative to advances on discerning biological roles for long non-protein coding RNAs in animal systems, this RNA class in plants is largely understudied. With comparatively few validated plant long non-coding RNAs, research on this potentially critical class of RNA is hindered by a lack of appropriate prediction tools and databases. Supervised learning models trained on data sets of mostly non-validated, non-coding transcripts have been previously used to identify this enigmatic RNA class with applications largely focused on animal systems. Our approach uses a training set comprised only of empirically validated long non-protein coding RNAs from plant, animal, and viral sources to predict and rank candidate long non-protein coding gene products for future functional validation. Individual stochastic gradient boosting and random forest classifiers trained on only empirically validated long non-protein coding RNAs were constructed. In order to use the strengths of multiple classifiers, we combined multiple models into a single stacking meta-learner. This ensemble approach benefits from the diversity of several learners to effectively identify putative plant long non-coding RNAs from transcript sequence features. When the predicted genes identified by the ensemble classifier were compared to those listed in GreeNC, an established plant long non-coding RNA database, overlap for predicted genes from Arabidopsis thaliana, Oryza sativa and Eutrema salsugineum ranged from 51 to 83% with the highest agreement in Eutrema salsugineum. Most of the highest ranking predictions from Arabidopsis thaliana were annotated as potential natural antisense genes, pseudogenes, transposable elements, or simply computationally predicted hypothetical protein. Due to the nature of this tool, the model can be updated as new long non-protein coding transcripts are identified and functionally verified. This ensemble classifier is an accurate tool that can be used to rank long non-protein coding RNA predictions for use in conjunction with gene expression studies. Selection of plant transcripts with a high potential for regulatory roles as long non-protein coding RNAs will advance research in the elucidation of long non-protein coding RNA function.
An ensemble deep learning based approach for red lesion detection in fundus images.
Orlando, José Ignacio; Prokofyeva, Elena; Del Fresno, Mariana; Blaschko, Matthew B
2018-01-01
Diabetic retinopathy (DR) is one of the leading causes of preventable blindness in the world. Its earliest sign are red lesions, a general term that groups both microaneurysms (MAs) and hemorrhages (HEs). In daily clinical practice, these lesions are manually detected by physicians using fundus photographs. However, this task is tedious and time consuming, and requires an intensive effort due to the small size of the lesions and their lack of contrast. Computer-assisted diagnosis of DR based on red lesion detection is being actively explored due to its improvement effects both in clinicians consistency and accuracy. Moreover, it provides comprehensive feedback that is easy to assess by the physicians. Several methods for detecting red lesions have been proposed in the literature, most of them based on characterizing lesion candidates using hand crafted features, and classifying them into true or false positive detections. Deep learning based approaches, by contrast, are scarce in this domain due to the high expense of annotating the lesions manually. In this paper we propose a novel method for red lesion detection based on combining both deep learned and domain knowledge. Features learned by a convolutional neural network (CNN) are augmented by incorporating hand crafted features. Such ensemble vector of descriptors is used afterwards to identify true lesion candidates using a Random Forest classifier. We empirically observed that combining both sources of information significantly improve results with respect to using each approach separately. Furthermore, our method reported the highest performance on a per-lesion basis on DIARETDB1 and e-ophtha, and for screening and need for referral on MESSIDOR compared to a second human expert. Results highlight the fact that integrating manually engineered approaches with deep learned features is relevant to improve results when the networks are trained from lesion-level annotated data. An open source implementation of our system is publicly available at https://github.com/ignaciorlando/red-lesion-detection. Copyright © 2017 Elsevier B.V. All rights reserved.
Collell, Guillem; Prelec, Drazen; Patil, Kaustubh R
2018-01-31
Class imbalance presents a major hurdle in the application of classification methods. A commonly taken approach is to learn ensembles of classifiers using rebalanced data. Examples include bootstrap averaging (bagging) combined with either undersampling or oversampling of the minority class examples. However, rebalancing methods entail asymmetric changes to the examples of different classes, which in turn can introduce their own biases. Furthermore, these methods often require specifying the performance measure of interest a priori, i.e., before learning. An alternative is to employ the threshold moving technique, which applies a threshold to the continuous output of a model, offering the possibility to adapt to a performance measure a posteriori , i.e., a plug-in method. Surprisingly, little attention has been paid to this combination of a bagging ensemble and threshold-moving. In this paper, we study this combination and demonstrate its competitiveness. Contrary to the other resampling methods, we preserve the natural class distribution of the data resulting in well-calibrated posterior probabilities. Additionally, we extend the proposed method to handle multiclass data. We validated our method on binary and multiclass benchmark data sets by using both, decision trees and neural networks as base classifiers. We perform analyses that provide insights into the proposed method.
Chen, Peng; Li, Jinyan
2010-05-17
Prediction of long-range inter-residue contacts is an important topic in bioinformatics research. It is helpful for determining protein structures, understanding protein foldings, and therefore advancing the annotation of protein functions. In this paper, we propose a novel ensemble of genetic algorithm classifiers (GaCs) to address the long-range contact prediction problem. Our method is based on the key idea called sequence profile centers (SPCs). Each SPC is the average sequence profiles of residue pairs belonging to the same contact class or non-contact class. GaCs train on multiple but different pairs of long-range contact data (positive data) and long-range non-contact data (negative data). The negative data sets, having roughly the same sizes as the positive ones, are constructed by random sampling over the original imbalanced negative data. As a result, about 21.5% long-range contacts are correctly predicted. We also found that the ensemble of GaCs indeed makes an accuracy improvement by around 5.6% over the single GaC. Classifiers with the use of sequence profile centers may advance the long-range contact prediction. In line with this approach, key structural features in proteins would be determined with high efficiency and accuracy.
Zhang, Ming; Xu, Yan; Li, Lei; Liu, Zi; Yang, Xibei; Yu, Dong-Jun
2018-06-01
RNA 5-methylcytosine (m 5 C) is an important post-transcriptional modification that plays an indispensable role in biological processes. The accurate identification of m 5 C sites from primary RNA sequences is especially useful for deeply understanding the mechanisms and functions of m 5 C. Due to the difficulty and expensive costs of identifying m 5 C sites with wet-lab techniques, developing fast and accurate machine-learning-based prediction methods is urgently needed. In this study, we proposed a new m 5 C site predictor, called M5C-HPCR, by introducing a novel heuristic nucleotide physicochemical property reduction (HPCR) algorithm and classifier ensemble. HPCR extracts multiple reducts of physical-chemical properties for encoding discriminative features, while the classifier ensemble is applied to integrate multiple base predictors, each of which is trained based on a separate reduct of the physical-chemical properties obtained from HPCR. Rigorous jackknife tests on two benchmark datasets demonstrate that M5C-HPCR outperforms state-of-the-art m 5 C site predictors, with the highest values of MCC (0.859) and AUC (0.962). We also implemented the webserver of M5C-HPCR, which is freely available at http://cslab.just.edu.cn:8080/M5C-HPCR/. Copyright © 2018 Elsevier Inc. All rights reserved.
Intelligent classifier for dynamic fault patterns based on hidden Markov model
NASA Astrophysics Data System (ADS)
Xu, Bo; Feng, Yuguang; Yu, Jinsong
2006-11-01
It's difficult to build precise mathematical models for complex engineering systems because of the complexity of the structure and dynamics characteristics. Intelligent fault diagnosis introduces artificial intelligence and works in a different way without building the analytical mathematical model of a diagnostic object, so it's a practical approach to solve diagnostic problems of complex systems. This paper presents an intelligent fault diagnosis method, an integrated fault-pattern classifier based on Hidden Markov Model (HMM). This classifier consists of dynamic time warping (DTW) algorithm, self-organizing feature mapping (SOFM) network and Hidden Markov Model. First, after dynamic observation vector in measuring space is processed by DTW, the error vector including the fault feature of being tested system is obtained. Then a SOFM network is used as a feature extractor and vector quantization processor. Finally, fault diagnosis is realized by fault patterns classifying with the Hidden Markov Model classifier. The importing of dynamic time warping solves the problem of feature extracting from dynamic process vectors of complex system such as aeroengine, and makes it come true to diagnose complex system by utilizing dynamic process information. Simulating experiments show that the diagnosis model is easy to extend, and the fault pattern classifier is efficient and is convenient to the detecting and diagnosing of new faults.
A hybrid cost-sensitive ensemble for imbalanced breast thermogram classification.
Krawczyk, Bartosz; Schaefer, Gerald; Woźniak, Michał
2015-11-01
Early recognition of breast cancer, the most commonly diagnosed form of cancer in women, is of crucial importance, given that it leads to significantly improved chances of survival. Medical thermography, which uses an infrared camera for thermal imaging, has been demonstrated as a particularly useful technique for early diagnosis, because it detects smaller tumors than the standard modality of mammography. In this paper, we analyse breast thermograms by extracting features describing bilateral symmetries between the two breast areas, and present a classification system for decision making. Clearly, the costs associated with missing a cancer case are much higher than those for mislabelling a benign case. At the same time, datasets contain significantly fewer malignant cases than benign ones. Standard classification approaches fail to consider either of these aspects. In this paper, we introduce a hybrid cost-sensitive classifier ensemble to address this challenging problem. Our approach entails a pool of cost-sensitive decision trees which assign a higher misclassification cost to the malignant class, thereby boosting its recognition rate. A genetic algorithm is employed for simultaneous feature selection and classifier fusion. As an optimisation criterion, we use a combination of misclassification cost and diversity to achieve both a high sensitivity and a heterogeneous ensemble. Furthermore, we prune our ensemble by discarding classifiers that contribute minimally to the decision making. For a challenging dataset of about 150 thermograms, our approach achieves an excellent sensitivity of 83.10%, while maintaining a high specificity of 89.44%. This not only signifies improved recognition of malignant cases, it also statistically outperforms other state-of-the-art algorithms designed for imbalanced classification, and hence provides an effective approach for analysing breast thermograms. Our proposed hybrid cost-sensitive ensemble can facilitate a highly accurate early diagnostic of breast cancer based on thermogram features. It overcomes the difficulties posed by the imbalanced distribution of patients in the two analysed groups. Copyright © 2015 Elsevier B.V. All rights reserved.
NASA Technical Reports Server (NTRS)
Lin, Qian; Allebach, Jan P.
1990-01-01
An adaptive vector linear minimum mean-squared error (LMMSE) filter for multichannel images with multiplicative noise is presented. It is shown theoretically that the mean-squared error in the filter output is reduced by making use of the correlation between image bands. The vector and conventional scalar LMMSE filters are applied to a three-band SIR-B SAR, and their performance is compared. Based on a mutliplicative noise model, the per-pel maximum likelihood classifier was derived. The authors extend this to the design of sequential and robust classifiers. These classifiers are also applied to the three-band SIR-B SAR image.
Decision tree and ensemble learning algorithms with their applications in bioinformatics.
Che, Dongsheng; Liu, Qi; Rasheed, Khaled; Tao, Xiuping
2011-01-01
Machine learning approaches have wide applications in bioinformatics, and decision tree is one of the successful approaches applied in this field. In this chapter, we briefly review decision tree and related ensemble algorithms and show the successful applications of such approaches on solving biological problems. We hope that by learning the algorithms of decision trees and ensemble classifiers, biologists can get the basic ideas of how machine learning algorithms work. On the other hand, by being exposed to the applications of decision trees and ensemble algorithms in bioinformatics, computer scientists can get better ideas of which bioinformatics topics they may work on in their future research directions. We aim to provide a platform to bridge the gap between biologists and computer scientists.
An assessment of support vector machines for land cover classification
Huang, C.; Davis, L.S.; Townshend, J.R.G.
2002-01-01
The support vector machine (SVM) is a group of theoretically superior machine learning algorithms. It was found competitive with the best available machine learning algorithms in classifying high-dimensional data sets. This paper gives an introduction to the theoretical development of the SVM and an experimental evaluation of its accuracy, stability and training speed in deriving land cover classifications from satellite images. The SVM was compared to three other popular classifiers, including the maximum likelihood classifier (MLC), neural network classifiers (NNC) and decision tree classifiers (DTC). The impacts of kernel configuration on the performance of the SVM and of the selection of training data and input variables on the four classifiers were also evaluated in this experiment.
Seo, Jeong Gi; Kwak, Jiyong; Um, Terry Taewoong; Rim, Tyler Hyungtaek
2017-01-01
Deep learning emerges as a powerful tool for analyzing medical images. Retinal disease detection by using computer-aided diagnosis from fundus image has emerged as a new method. We applied deep learning convolutional neural network by using MatConvNet for an automated detection of multiple retinal diseases with fundus photographs involved in STructured Analysis of the REtina (STARE) database. Dataset was built by expanding data on 10 categories, including normal retina and nine retinal diseases. The optimal outcomes were acquired by using a random forest transfer learning based on VGG-19 architecture. The classification results depended greatly on the number of categories. As the number of categories increased, the performance of deep learning models was diminished. When all 10 categories were included, we obtained results with an accuracy of 30.5%, relative classifier information (RCI) of 0.052, and Cohen’s kappa of 0.224. Considering three integrated normal, background diabetic retinopathy, and dry age-related macular degeneration, the multi-categorical classifier showed accuracy of 72.8%, 0.283 RCI, and 0.577 kappa. In addition, several ensemble classifiers enhanced the multi-categorical classification performance. The transfer learning incorporated with ensemble classifier of clustering and voting approach presented the best performance with accuracy of 36.7%, 0.053 RCI, and 0.225 kappa in the 10 retinal diseases classification problem. First, due to the small size of datasets, the deep learning techniques in this study were ineffective to be applied in clinics where numerous patients suffering from various types of retinal disorders visit for diagnosis and treatment. Second, we found that the transfer learning incorporated with ensemble classifiers can improve the classification performance in order to detect multi-categorical retinal diseases. Further studies should confirm the effectiveness of algorithms with large datasets obtained from hospitals. PMID:29095872
Fixed points, stable manifolds, weather regimes, and their predictability.
Deremble, Bruno; D'Andrea, Fabio; Ghil, Michael
2009-12-01
In a simple, one-layer atmospheric model, we study the links between low-frequency variability and the model's fixed points in phase space. The model dynamics is characterized by the coexistence of multiple "weather regimes." To investigate the transitions from one regime to another, we focus on the identification of stable manifolds associated with fixed points. We show that these manifolds act as separatrices between regimes. We track each manifold by making use of two local predictability measures arising from the meteorological applications of nonlinear dynamics, namely, "bred vectors" and singular vectors. These results are then verified in the framework of ensemble forecasts issued from "clouds" (ensembles) of initial states. The divergence of the trajectories allows us to establish the connections between zones of low predictability, the geometry of the stable manifolds, and transitions between regimes.
Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection
2015-02-01
multi-resolution stacking (MRS), which is a stack of ensemble classifiers. Each classifier in a building block inputs the concatenation of the predictions ...a base classifier in MRS, named boosted deep neural network (bDNN). bDNN first generates multiple base predictions from different contexts of a single...frame by only one DNN and then aggregates the base predictions for a better prediction of the frame, and it is different from computationally
Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke
2008-05-01
Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.
Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke
2008-01-01
Background Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. Results SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. Conclusion The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods. PMID:18452616
NASA Astrophysics Data System (ADS)
Niu, Mingfei; Wang, Yufang; Sun, Shaolong; Li, Yongwu
2016-06-01
To enhance prediction reliability and accuracy, a hybrid model based on the promising principle of "decomposition and ensemble" and a recently proposed meta-heuristic called grey wolf optimizer (GWO) is introduced for daily PM2.5 concentration forecasting. Compared with existing PM2.5 forecasting methods, this proposed model has improved the prediction accuracy and hit rates of directional prediction. The proposed model involves three main steps, i.e., decomposing the original PM2.5 series into several intrinsic mode functions (IMFs) via complementary ensemble empirical mode decomposition (CEEMD) for simplifying the complex data; individually predicting each IMF with support vector regression (SVR) optimized by GWO; integrating all predicted IMFs for the ensemble result as the final prediction by another SVR optimized by GWO. Seven benchmark models, including single artificial intelligence (AI) models, other decomposition-ensemble models with different decomposition methods and models with the same decomposition-ensemble method but optimized by different algorithms, are considered to verify the superiority of the proposed hybrid model. The empirical study indicates that the proposed hybrid decomposition-ensemble model is remarkably superior to all considered benchmark models for its higher prediction accuracy and hit rates of directional prediction.
Tighe, Patrick J; Lucas, Stephen D; Edwards, David A; Boezaart, André P; Aytug, Haldun; Bihorac, Azra
2012-10-01
The purpose of this project was to determine whether machine-learning classifiers could predict which patients would require a preoperative acute pain service (APS) consultation. Retrospective cohort. University teaching hospital. The records of 9,860 surgical patients posted between January 1 and June 30, 2010 were reviewed. Request for APS consultation. A cohort of machine-learning classifiers was compared according to its ability or inability to classify surgical cases as requiring a request for a preoperative APS consultation. Classifiers were then optimized utilizing ensemble techniques. Computational efficiency was measured with the central processing unit processing times required for model training. Classifiers were tested using the full feature set, as well as the reduced feature set that was optimized using a merit-based dimensional reduction strategy. Machine-learning classifiers correctly predicted preoperative requests for APS consultations in 92.3% (95% confidence intervals [CI], 91.8-92.8) of all surgical cases. Bayesian methods yielded the highest area under the receiver operating curve (0.87, 95% CI 0.84-0.89) and lowest training times (0.0018 seconds, 95% CI, 0.0017-0.0019 for the NaiveBayesUpdateable algorithm). An ensemble of high-performing machine-learning classifiers did not yield a higher area under the receiver operating curve than its component classifiers. Dimensional reduction decreased the computational requirements for multiple classifiers, but did not adversely affect classification performance. Using historical data, machine-learning classifiers can predict which surgical cases should prompt a preoperative request for an APS consultation. Dimensional reduction improved computational efficiency and preserved predictive performance. Wiley Periodicals, Inc.
Zaĭtsev, A A; Khodashinskiĭ, I A; Plotnikov, O O
2011-01-01
The importance to have the most efficacious tools and methods for the prevention and treatment of various diseases and rehabilitation of the patients dictates the necessity of search for new means of optimal correction of individual reserves of the organism. One of the approaches to addressing this problem is simulation of prognostication of curative effects of non-medicamental therapy. It is proposed to choose the therapeutic program using an ensemble of classifiers. Two types of them are considered, one based on the solution trees, the other based on the fuzzy rule basis. The software was developed that ensures high accuracy of th e prognosis of the efficiency of the two programs of the spa and resort treatment.
Evolutionary Wavelet Neural Network ensembles for breast cancer and Parkinson's disease prediction.
Khan, Maryam Mahsal; Mendes, Alexandre; Chalup, Stephan K
2018-01-01
Wavelet Neural Networks are a combination of neural networks and wavelets and have been mostly used in the area of time-series prediction and control. Recently, Evolutionary Wavelet Neural Networks have been employed to develop cancer prediction models. The present study proposes to use ensembles of Evolutionary Wavelet Neural Networks. The search for a high quality ensemble is directed by a fitness function that incorporates the accuracy of the classifiers both independently and as part of the ensemble itself. The ensemble approach is tested on three publicly available biomedical benchmark datasets, one on Breast Cancer and two on Parkinson's disease, using a 10-fold cross-validation strategy. Our experimental results show that, for the first dataset, the performance was similar to previous studies reported in literature. On the second dataset, the Evolutionary Wavelet Neural Network ensembles performed better than all previous methods. The third dataset is relatively new and this study is the first to report benchmark results.
Evolutionary Wavelet Neural Network ensembles for breast cancer and Parkinson’s disease prediction
Mendes, Alexandre; Chalup, Stephan K.
2018-01-01
Wavelet Neural Networks are a combination of neural networks and wavelets and have been mostly used in the area of time-series prediction and control. Recently, Evolutionary Wavelet Neural Networks have been employed to develop cancer prediction models. The present study proposes to use ensembles of Evolutionary Wavelet Neural Networks. The search for a high quality ensemble is directed by a fitness function that incorporates the accuracy of the classifiers both independently and as part of the ensemble itself. The ensemble approach is tested on three publicly available biomedical benchmark datasets, one on Breast Cancer and two on Parkinson’s disease, using a 10-fold cross-validation strategy. Our experimental results show that, for the first dataset, the performance was similar to previous studies reported in literature. On the second dataset, the Evolutionary Wavelet Neural Network ensembles performed better than all previous methods. The third dataset is relatively new and this study is the first to report benchmark results. PMID:29420578
Optimization of Support Vector Machine (SVM) for Object Classification
NASA Technical Reports Server (NTRS)
Scholten, Matthew; Dhingra, Neil; Lu, Thomas T.; Chao, Tien-Hsin
2012-01-01
The Support Vector Machine (SVM) is a powerful algorithm, useful in classifying data into species. The SVMs implemented in this research were used as classifiers for the final stage in a Multistage Automatic Target Recognition (ATR) system. A single kernel SVM known as SVMlight, and a modified version known as a SVM with K-Means Clustering were used. These SVM algorithms were tested as classifiers under varying conditions. Image noise levels varied, and the orientation of the targets changed. The classifiers were then optimized to demonstrate their maximum potential as classifiers. Results demonstrate the reliability of SVM as a method for classification. From trial to trial, SVM produces consistent results.
Support vector machines classifiers of physical activities in preschoolers
USDA-ARS?s Scientific Manuscript database
The goal of this study is to develop, test, and compare multinomial logistic regression (MLR) and support vector machines (SVM) in classifying preschool-aged children physical activity data acquired from an accelerometer. In this study, 69 children aged 3-5 years old were asked to participate in a s...
USDA-ARS?s Scientific Manuscript database
This paper presents a novel wrinkle evaluation method that uses modified wavelet coefficients and an optimized support-vector-machine (SVM) classification scheme to characterize and classify wrinkle appearance of fabric. Fabric images were decomposed with the wavelet transform (WT), and five parame...
Pai, Priyadarshini P; Dattatreya, Rohit Kadam; Mondal, Sukanta
2017-11-01
Enzyme interactions with ligands are crucial for various biochemical reactions governing life. Over many years attempts to identify these residues for biotechnological manipulations have been made using experimental and computational techniques. The computational approaches have gathered impetus with the accruing availability of sequence and structure information, broadly classified into template-based and de novo methods. One of the predominant de novo methods using sequence information involves application of biological properties for supervised machine learning. Here, we propose a support vector machines-based ensemble for prediction of protein-ligand interacting residues using one of the most important discriminative contributing properties in the interacting residue neighbourhood, i. e., evolutionary information in the form of position-specific- scoring matrix (PSSM). The study has been performed on a non-redundant dataset comprising of 9269 interacting and 91773 non-interacting residues for prediction model generation and further evaluation. Of the various PSSM-based models explored, the proposed method named ROBBY (pRediction Of Biologically relevant small molecule Binding residues on enzYmes) shows an accuracy of 84.0 %, Matthews Correlation Coefficient of 0.343 and F-measure of 39.0 % on 78 test enzymes. Further, scope of adding domain knowledge such as pocket information has also been investigated; results showed significant enhancement in method precision. Findings are hoped to boost the reliability of small-molecule ligand interaction prediction for enzyme applications and drug design. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
AUC-Maximizing Ensembles through Metalearning.
LeDell, Erin; van der Laan, Mark J; Petersen, Maya
2016-05-01
Area Under the ROC Curve (AUC) is often used to measure the performance of an estimator in binary classification problems. An AUC-maximizing classifier can have significant advantages in cases where ranking correctness is valued or if the outcome is rare. In a Super Learner ensemble, maximization of the AUC can be achieved by the use of an AUC-maximining metalearning algorithm. We discuss an implementation of an AUC-maximization technique that is formulated as a nonlinear optimization problem. We also evaluate the effectiveness of a large number of different nonlinear optimization algorithms to maximize the cross-validated AUC of the ensemble fit. The results provide evidence that AUC-maximizing metalearners can, and often do, out-perform non-AUC-maximizing metalearning methods, with respect to ensemble AUC. The results also demonstrate that as the level of imbalance in the training data increases, the Super Learner ensemble outperforms the top base algorithm by a larger degree.
AUC-Maximizing Ensembles through Metalearning
LeDell, Erin; van der Laan, Mark J.; Peterson, Maya
2016-01-01
Area Under the ROC Curve (AUC) is often used to measure the performance of an estimator in binary classification problems. An AUC-maximizing classifier can have significant advantages in cases where ranking correctness is valued or if the outcome is rare. In a Super Learner ensemble, maximization of the AUC can be achieved by the use of an AUC-maximining metalearning algorithm. We discuss an implementation of an AUC-maximization technique that is formulated as a nonlinear optimization problem. We also evaluate the effectiveness of a large number of different nonlinear optimization algorithms to maximize the cross-validated AUC of the ensemble fit. The results provide evidence that AUC-maximizing metalearners can, and often do, out-perform non-AUC-maximizing metalearning methods, with respect to ensemble AUC. The results also demonstrate that as the level of imbalance in the training data increases, the Super Learner ensemble outperforms the top base algorithm by a larger degree. PMID:27227721
Drug-target interaction prediction using ensemble learning and dimensionality reduction.
Ezzat, Ali; Wu, Min; Li, Xiao-Li; Kwoh, Chee-Keong
2017-10-01
Experimental prediction of drug-target interactions is expensive, time-consuming and tedious. Fortunately, computational methods help narrow down the search space for interaction candidates to be further examined via wet-lab techniques. Nowadays, the number of attributes/features for drugs and targets, as well as the amount of their interactions, are increasing, making these computational methods inefficient or occasionally prohibitive. This motivates us to derive a reduced feature set for prediction. In addition, since ensemble learning techniques are widely used to improve the classification performance, it is also worthwhile to design an ensemble learning framework to enhance the performance for drug-target interaction prediction. In this paper, we propose a framework for drug-target interaction prediction leveraging both feature dimensionality reduction and ensemble learning. First, we conducted feature subspacing to inject diversity into the classifier ensemble. Second, we applied three different dimensionality reduction methods to the subspaced features. Third, we trained homogeneous base learners with the reduced features and then aggregated their scores to derive the final predictions. For base learners, we selected two classifiers, namely Decision Tree and Kernel Ridge Regression, resulting in two variants of ensemble models, EnsemDT and EnsemKRR, respectively. In our experiments, we utilized AUC (Area under ROC Curve) as an evaluation metric. We compared our proposed methods with various state-of-the-art methods under 5-fold cross validation. Experimental results showed EnsemKRR achieving the highest AUC (94.3%) for predicting drug-target interactions. In addition, dimensionality reduction helped improve the performance of EnsemDT. In conclusion, our proposed methods produced significant improvements for drug-target interaction prediction. Copyright © 2017 Elsevier Inc. All rights reserved.
Kim, Yoon Jae; Heo, Jeong; Park, Kwang Suk; Kim, Sungwan
2016-08-01
Arrhythmia refers to a group of conditions in which the heartbeat is irregular, fast, or slow due to abnormal electrical activity in the heart. Some types of arrhythmia such as ventricular fibrillation may result in cardiac arrest or death. Thus, arrhythmia detection becomes an important issue, and various studies have been conducted. Additionally, an arrhythmia detection algorithm for portable devices such as mobile phones has recently been developed because of increasing interest in e-health care. This paper proposes a novel classification approach and features, which are validated for improved real-time arrhythmia monitoring. The classification approach that was employed for arrhythmia detection is based on the concept of ensemble learning and the Taguchi method and has the advantage of being accurate and computationally efficient. The electrocardiography (ECG) data for arrhythmia detection was obtained from the MIT-BIH Arrhythmia Database (n=48). A novel feature, namely the heart rate variability calculated from 5s segments of ECG, which was not considered previously, was used. The novel classification approach and feature demonstrated arrhythmia detection accuracy of 89.13%. When the same data was classified using the conventional support vector machine (SVM), the obtained accuracy was 91.69%, 88.14%, and 88.74% for Gaussian, linear, and polynomial kernels, respectively. In terms of computation time, the proposed classifier was 5821.7 times faster than conventional SVM. In conclusion, the proposed classifier and feature showed performance comparable to those of previous studies, while the computational complexity and update interval were highly reduced. Copyright © 2016 Elsevier Ltd. All rights reserved.
An implementation of support vector machine on sentiment classification of movie reviews
NASA Astrophysics Data System (ADS)
Yulietha, I. M.; Faraby, S. A.; Adiwijaya; Widyaningtyas, W. C.
2018-03-01
With technological advances, all information about movie is available on the internet. If the information is processed properly, it will get the quality of the information. This research proposes to the classify sentiments on movie review documents. This research uses Support Vector Machine (SVM) method because it can classify high dimensional data in accordance with the data used in this research in the form of text. Support Vector Machine is a popular machine learning technique for text classification because it can classify by learning from a collection of documents that have been classified previously and can provide good result. Based on number of datasets, the 90-10 composition has the best result that is 85.6%. Based on SVM kernel, kernel linear with constant 1 has the best result that is 84.9%
Feature detection in satellite images using neural network technology
NASA Technical Reports Server (NTRS)
Augusteijn, Marijke F.; Dimalanta, Arturo S.
1992-01-01
A feasibility study of automated classification of satellite images is described. Satellite images were characterized by the textures they contain. In particular, the detection of cloud textures was investigated. The method of second-order gray level statistics, using co-occurrence matrices, was applied to extract feature vectors from image segments. Neural network technology was employed to classify these feature vectors. The cascade-correlation architecture was successfully used as a classifier. The use of a Kohonen network was also investigated but this architecture could not reliably classify the feature vectors due to the complicated structure of the classification problem. The best results were obtained when data from different spectral bands were fused.
2012-03-01
with each SVM discriminating between a pair of the N total speakers in the data set. The (( + 1))/2 classifiers then vote on the final...classification of a test sample. The Random Forest classifier is an ensemble classifier that votes amongst decision trees generated with each node using...Forest vote , and the effects of overtraining will be mitigated by the fact that each decision tree is overtrained differently (due to the random
Ensemble Feature Learning of Genomic Data Using Support Vector Machine
Anaissi, Ali; Goyal, Madhu; Catchpoole, Daniel R.; Braytee, Ali; Kennedy, Paul J.
2016-01-01
The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data. PMID:27304923
A Hybrid Classification System for Heart Disease Diagnosis Based on the RFRS Method.
Liu, Xiao; Wang, Xiaoli; Su, Qiang; Zhang, Mo; Zhu, Yanhong; Wang, Qiugen; Wang, Qian
2017-01-01
Heart disease is one of the most common diseases in the world. The objective of this study is to aid the diagnosis of heart disease using a hybrid classification system based on the ReliefF and Rough Set (RFRS) method. The proposed system contains two subsystems: the RFRS feature selection system and a classification system with an ensemble classifier. The first system includes three stages: (i) data discretization, (ii) feature extraction using the ReliefF algorithm, and (iii) feature reduction using the heuristic Rough Set reduction algorithm that we developed. In the second system, an ensemble classifier is proposed based on the C4.5 classifier. The Statlog (Heart) dataset, obtained from the UCI database, was used for experiments. A maximum classification accuracy of 92.59% was achieved according to a jackknife cross-validation scheme. The results demonstrate that the performance of the proposed system is superior to the performances of previously reported classification techniques.
Modality-Driven Classification and Visualization of Ensemble Variance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bensema, Kevin; Gosink, Luke; Obermaier, Harald
Advances in computational power now enable domain scientists to address conceptual and parametric uncertainty by running simulations multiple times in order to sufficiently sample the uncertain input space. While this approach helps address conceptual and parametric uncertainties, the ensemble datasets produced by this technique present a special challenge to visualization researchers as the ensemble dataset records a distribution of possible values for each location in the domain. Contemporary visualization approaches that rely solely on summary statistics (e.g., mean and variance) cannot convey the detailed information encoded in ensemble distributions that are paramount to ensemble analysis; summary statistics provide no informationmore » about modality classification and modality persistence. To address this problem, we propose a novel technique that classifies high-variance locations based on the modality of the distribution of ensemble predictions. Additionally, we develop a set of confidence metrics to inform the end-user of the quality of fit between the distribution at a given location and its assigned class. We apply a similar method to time-varying ensembles to illustrate the relationship between peak variance and bimodal or multimodal behavior. These classification schemes enable a deeper understanding of the behavior of the ensemble members by distinguishing between distributions that can be described by a single tendency and distributions which reflect divergent trends in the ensemble.« less
Conformational and functional analysis of molecular dynamics trajectories by Self-Organising Maps
2011-01-01
Background Molecular dynamics (MD) simulations are powerful tools to investigate the conformational dynamics of proteins that is often a critical element of their function. Identification of functionally relevant conformations is generally done clustering the large ensemble of structures that are generated. Recently, Self-Organising Maps (SOMs) were reported performing more accurately and providing more consistent results than traditional clustering algorithms in various data mining problems. We present a novel strategy to analyse and compare conformational ensembles of protein domains using a two-level approach that combines SOMs and hierarchical clustering. Results The conformational dynamics of the α-spectrin SH3 protein domain and six single mutants were analysed by MD simulations. The Cα's Cartesian coordinates of conformations sampled in the essential space were used as input data vectors for SOM training, then complete linkage clustering was performed on the SOM prototype vectors. A specific protocol to optimize a SOM for structural ensembles was proposed: the optimal SOM was selected by means of a Taguchi experimental design plan applied to different data sets, and the optimal sampling rate of the MD trajectory was selected. The proposed two-level approach was applied to single trajectories of the SH3 domain independently as well as to groups of them at the same time. The results demonstrated the potential of this approach in the analysis of large ensembles of molecular structures: the possibility of producing a topological mapping of the conformational space in a simple 2D visualisation, as well as of effectively highlighting differences in the conformational dynamics directly related to biological functions. Conclusions The use of a two-level approach combining SOMs and hierarchical clustering for conformational analysis of structural ensembles of proteins was proposed. It can easily be extended to other study cases and to conformational ensembles from other sources. PMID:21569575
Confident Surgical Decision Making in Temporal Lobe Epilepsy by Heterogeneous Classifier Ensembles
Fakhraei, Shobeir; Soltanian-Zadeh, Hamid; Jafari-Khouzani, Kourosh; Elisevich, Kost; Fotouhi, Farshad
2015-01-01
In medical domains with low tolerance for invalid predictions, classification confidence is highly important and traditional performance measures such as overall accuracy cannot provide adequate insight into classifications reliability. In this paper, a confident-prediction rate (CPR) which measures the upper limit of confident predictions has been proposed based on receiver operating characteristic (ROC) curves. It has been shown that heterogeneous ensemble of classifiers improves this measure. This ensemble approach has been applied to lateralization of focal epileptogenicity in temporal lobe epilepsy (TLE) and prediction of surgical outcomes. A goal of this study is to reduce extraoperative electrocorticography (eECoG) requirement which is the practice of using electrodes placed directly on the exposed surface of the brain. We have shown that such goal is achievable with application of data mining techniques. Furthermore, all TLE surgical operations do not result in complete relief from seizures and it is not always possible for human experts to identify such unsuccessful cases prior to surgery. This study demonstrates the capability of data mining techniques in prediction of undesirable outcome for a portion of such cases. PMID:26609547
Evaluation of handwriting kinematics and pressure for differential diagnosis of Parkinson's disease.
Drotár, Peter; Mekyska, Jiří; Rektorová, Irena; Masarová, Lucia; Smékal, Zdeněk; Faundez-Zanuy, Marcos
2016-02-01
We present the PaHaW Parkinson's disease handwriting database, consisting of handwriting samples from Parkinson's disease (PD) patients and healthy controls. Our goal is to show that kinematic features and pressure features in handwriting can be used for the differential diagnosis of PD. The database contains records from 37 PD patients and 38 healthy controls performing eight different handwriting tasks. The tasks include drawing an Archimedean spiral, repetitively writing orthographically simple syllables and words, and writing of a sentence. In addition to the conventional kinematic features related to the dynamics of handwriting, we investigated new pressure features based on the pressure exerted on the writing surface. To discriminate between PD patients and healthy subjects, three different classifiers were compared: K-nearest neighbors (K-NN), ensemble AdaBoost classifier, and support vector machines (SVM). For predicting PD based on kinematic and pressure features of handwriting, the best performing model was SVM with classification accuracy of Pacc=81.3% (sensitivity Psen=87.4% and specificity of Pspe=80.9%). When evaluated separately, pressure features proved to be relevant for PD diagnosis, yielding Pacc=82.5% compared to Pacc=75.4% using kinematic features. Experimental results showed that an analysis of kinematic and pressure features during handwriting can help assess subtle characteristics of handwriting and discriminate between PD patients and healthy controls. Copyright © 2016 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Elsberry, Russell L.; Jordan, Mary S.; Vitart, Frederic
2010-05-01
The objective of this study is to provide evidence of predictability on intraseasonal time scales (10-30 days) for western North Pacific tropical cyclone formation and subsequent tracks using the 51-member ECMWF 32-day forecasts made once a week from 5 June through 25 December 2008. Ensemble storms are defined by grouping ensemble member vortices whose positions are within a specified separation distance that is equal to 180 n mi at the initial forecast time t and increases linearly to 420 n mi at Day 14 and then is constant. The 12-h track segments are calculated with a Weighted-Mean Vector Motion technique in which the weighting factor is inversely proportional to the distance from the endpoint of the previous 12-h motion vector. Seventy-six percent of the ensemble storms had five or fewer member vortices. On average, the ensemble storms begin 2.5 days before the first entry of the Joint Typhoon Warning Center (JTWC) best-track file, tend to translate too slowly in the deep tropics, and persist for longer periods over land. A strict objective matching technique with the JTWC storms is combined with a second subjective procedure that is then applied to identify nearby ensemble storms that would indicate a greater likelihood of a tropical cyclone developing in that region with that track orientation. The ensemble storms identified in the ECMWF 32-day forecasts provided guidance on intraseasonal timescales of the formations and tracks of the three strongest typhoons and two other typhoons, but not for two early season typhoons and the late season Dolphin. Four strong tropical storms were predicted consistently over Week-1 through Week-4, as was one weak tropical storm. Two other weak tropical storms, three tropical cyclones that developed from precursor baroclinic systems, and three other tropical depressions were not predicted on intraseasonal timescales. At least for the strongest tropical cyclones during the peak season, the ECMWF 32-day ensemble provides guidance of formation and tracks on 10-30 day timescales.
Hu, Wenjun; Chung, Fu-Lai; Wang, Shitong
2012-03-01
Although pattern classification has been extensively studied in the past decades, how to effectively solve the corresponding training on large datasets is a problem that still requires particular attention. Many kernelized classification methods, such as SVM and SVDD, can be formulated as the corresponding quadratic programming (QP) problems, but computing the associated kernel matrices requires O(n2)(or even up to O(n3)) computational complexity, where n is the size of the training patterns, which heavily limits the applicability of these methods for large datasets. In this paper, a new classification method called the maximum vector-angular margin classifier (MAMC) is first proposed based on the vector-angular margin to find an optimal vector c in the pattern feature space, and all the testing patterns can be classified in terms of the maximum vector-angular margin ρ, between the vector c and all the training data points. Accordingly, it is proved that the kernelized MAMC can be equivalently formulated as the kernelized Minimum Enclosing Ball (MEB), which leads to a distinctive merit of MAMC, i.e., it has the flexibility of controlling the sum of support vectors like v-SVC and may be extended to a maximum vector-angular margin core vector machine (MAMCVM) by connecting the core vector machine (CVM) method with MAMC such that the corresponding fast training on large datasets can be effectively achieved. Experimental results on artificial and real datasets are provided to validate the power of the proposed methods. Copyright © 2011 Elsevier Ltd. All rights reserved.
Progressive Classification Using Support Vector Machines
NASA Technical Reports Server (NTRS)
Wagstaff, Kiri; Kocurek, Michael
2009-01-01
An algorithm for progressive classification of data, analogous to progressive rendering of images, makes it possible to compromise between speed and accuracy. This algorithm uses support vector machines (SVMs) to classify data. An SVM is a machine learning algorithm that builds a mathematical model of the desired classification concept by identifying the critical data points, called support vectors. Coarse approximations to the concept require only a few support vectors, while precise, highly accurate models require far more support vectors. Once the model has been constructed, the SVM can be applied to new observations. The cost of classifying a new observation is proportional to the number of support vectors in the model. When computational resources are limited, an SVM of the appropriate complexity can be produced. However, if the constraints are not known when the model is constructed, or if they can change over time, a method for adaptively responding to the current resource constraints is required. This capability is particularly relevant for spacecraft (or any other real-time systems) that perform onboard data analysis. The new algorithm enables the fast, interactive application of an SVM classifier to a new set of data. The classification process achieved by this algorithm is characterized as progressive because a coarse approximation to the true classification is generated rapidly and thereafter iteratively refined. The algorithm uses two SVMs: (1) a fast, approximate one and (2) slow, highly accurate one. New data are initially classified by the fast SVM, producing a baseline approximate classification. For each classified data point, the algorithm calculates a confidence index that indicates the likelihood that it was classified correctly in the first pass. Next, the data points are sorted by their confidence indices and progressively reclassified by the slower, more accurate SVM, starting with the items most likely to be incorrectly classified. The user can halt this reclassification process at any point, thereby obtaining the best possible result for a given amount of computation time. Alternatively, the results can be displayed as they are generated, providing the user with real-time feedback about the current accuracy of classification.
Hajiloo, Mohsen; Sapkota, Yadav; Mackey, John R; Robson, Paula; Greiner, Russell; Damaraju, Sambasivarao
2013-02-22
Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case-control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification. We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual's continental and sub-continental ancestry. To predict an individual's continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control's λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver. ETHNOPRED is a novel technique for producing classifiers that can identify an individual's continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values.
Helms Tillery, S I; Taylor, D M; Schwartz, A B
2003-01-01
We have recently developed a closed-loop environment in which we can test the ability of primates to control the motion of a virtual device using ensembles of simultaneously recorded neurons /29/. Here we use a maximum likelihood method to assess the information about task performance contained in the neuronal ensemble. We trained two animals to control the motion of a computer cursor in three dimensions. Initially the animals controlled cursor motion using arm movements, but eventually they learned to drive the cursor directly from cortical activity. Using a population vector (PV) based upon the relation between cortical activity and arm motion, the animals were able to control the cursor directly from the brain in a closed-loop environment, but with difficulty. We added a supervised learning method that modified the parameters of the PV according to task performance (adaptive PV), and found that animals were able to exert much finer control over the cursor motion from brain signals. Here we describe a maximum likelihood method (ML) to assess the information about target contained in neuronal ensemble activity. Using this method, we compared the information about target contained in the ensemble during arm control, during brain control early in the adaptive PV, and during brain control after the adaptive PV had settled and the animal could drive the cursor reliably and with fine gradations. During the arm-control task, the ML was able to determine the target of the movement in as few as 10% of the trials, and as many as 75% of the trials, with an average of 65%. This average dropped when the animals used a population vector to control motion of the cursor. On average we could determine the target in around 35% of the trials. This low percentage was also reflected in poor control of the cursor, so that the animal was unable to reach the target in a large percentage of trials. Supervised adjustment of the population vector parameters produced new weighting coefficients and directional tuning parameters for many neurons. This produced a much better performance of the brain-controlled cursor motion. It was also reflected in the maximum likelihood measure of cell activity, producing the correct target based only on neuronal activity in over 80% of the trials on average. The changes in maximum likelihood estimates of target location based on ensemble firing show that an animal's ability to regulate the motion of a cortically controlled device is not crucially dependent on the experimenter's ability to estimate intention from neuronal activity.
Comparison of different deep learning approaches for parotid gland segmentation from CT images
NASA Astrophysics Data System (ADS)
Hänsch, Annika; Schwier, Michael; Gass, Tobias; Morgas, Tomasz; Haas, Benjamin; Klein, Jan; Hahn, Horst K.
2018-02-01
The segmentation of target structures and organs at risk is a crucial and very time-consuming step in radiotherapy planning. Good automatic methods can significantly reduce the time clinicians have to spend on this task. Due to its variability in shape and often low contrast to surrounding structures, segmentation of the parotid gland is especially challenging. Motivated by the recent success of deep learning, we study different deep learning approaches for parotid gland segmentation. Particularly, we compare 2D, 2D ensemble and 3D U-Net approaches and find that the 2D U-Net ensemble yields the best results with a mean Dice score of 0.817 on our test data. The ensemble approach reduces false positives without the need for an automatic region of interest detection. We also apply our trained 2D U-Net ensemble to segment the test data of the 2015 MICCAI head and neck auto-segmentation challenge. With a mean Dice score of 0.861, our classifier exceeds the highest mean score in the challenge. This shows that the method generalizes well onto data from independent sites. Since appropriate reference annotations are essential for training but often difficult and expensive to obtain, it is important to know how many samples are needed to properly train a neural network. We evaluate the classifier performance after training with differently sized training sets (50-450) and find that 250 cases (without using extensive data augmentation) are sufficient to obtain good results with the 2D ensemble. Adding more samples does not significantly improve the Dice score of the segmentations.
ENSO Bred Vectors in Coupled Ocean-Atmosphere General Circulation Models
NASA Technical Reports Server (NTRS)
Yang, S. C.; Cai, Ming; Kalnay, E.; Rienecker, M.; Yuan, G.; Toth, ZA.
2004-01-01
The breeding method has been implemented in the NASA Seasonal-to-Interannual Prediction Project (NSIPP) Coupled General Circulation Model (CGCM) with the goal of improving operational seasonal to interannual climate predictions through ensemble forecasting and data assimilation. The coupled instability as cap'tured by the breeding method is the first attempt to isolate the evolving ENSO instability and its corresponding global atmospheric response in a fully coupled ocean-atmosphere GCM. Our results show that the growth rate of the coupled bred vectors (BV) peaks at about 3 months before a background ENSO event. The dominant growing BV modes are reminiscent of the background ENSO anomalies and show a strong tropical response with wind/SST/thermocline interrelated in a manner similar to the background ENSO mode. They exhibit larger amplitudes in the eastern tropical Pacific, reflecting the natural dynamical sensitivity associated with the presence of the shallow thermocline. Moreover, the extratropical perturbations associated with these coupled BV modes reveal the variations related to the atmospheric teleconnection patterns associated with background ENSO variability, e.g. over the North Pacific and North America. A similar experiment was carried out with the NCEP/CFS03 CGCM. Comparisons between bred vectors from the NSIPP CGCM and NCEP/CFS03 CGCM demonstrate the robustness of the results. Our results strongly suggest that the breeding method can serve as a natural filter to identify the slowly varying, coupled instabilities in a coupled GCM, which can be used to construct ensemble perturbations for ensemble forecasts and to estimate the coupled background error covariance for coupled data assimilation.
Moore, Sean M.; Monaghan, Andrew; Griffith, Kevin S.; Apangu, Titus; Mead, Paul S.; Eisen, Rebecca J.
2012-01-01
Climate and weather influence the occurrence, distribution, and incidence of infectious diseases, particularly those caused by vector-borne or zoonotic pathogens. Thus, models based on meteorological data have helped predict when and where human cases are most likely to occur. Such knowledge aids in targeting limited prevention and control resources and may ultimately reduce the burden of diseases. Paradoxically, localities where such models could yield the greatest benefits, such as tropical regions where morbidity and mortality caused by vector-borne diseases is greatest, often lack high-quality in situ local meteorological data. Satellite- and model-based gridded climate datasets can be used to approximate local meteorological conditions in data-sparse regions, however their accuracy varies. Here we investigate how the selection of a particular dataset can influence the outcomes of disease forecasting models. Our model system focuses on plague (Yersinia pestis infection) in the West Nile region of Uganda. The majority of recent human cases have been reported from East Africa and Madagascar, where meteorological observations are sparse and topography yields complex weather patterns. Using an ensemble of meteorological datasets and model-averaging techniques we find that the number of suspected cases in the West Nile region was negatively associated with dry season rainfall (December-February) and positively with rainfall prior to the plague season. We demonstrate that ensembles of available meteorological datasets can be used to quantify climatic uncertainty and minimize its impacts on infectious disease models. These methods are particularly valuable in regions with sparse observational networks and high morbidity and mortality from vector-borne diseases. PMID:23024750
Hussain, Shaista; Basu, Arindam
2016-01-01
The development of power-efficient neuromorphic devices presents the challenge of designing spike pattern classification algorithms which can be implemented on low-precision hardware and can also achieve state-of-the-art performance. In our pursuit of meeting this challenge, we present a pattern classification model which uses a sparse connection matrix and exploits the mechanism of nonlinear dendritic processing to achieve high classification accuracy. A rate-based structural learning rule for multiclass classification is proposed which modifies a connectivity matrix of binary synaptic connections by choosing the best “k” out of “d” inputs to make connections on every dendritic branch (k < < d). Because learning only modifies connectivity, the model is well suited for implementation in neuromorphic systems using address-event representation (AER). We develop an ensemble method which combines several dendritic classifiers to achieve enhanced generalization over individual classifiers. We have two major findings: (1) Our results demonstrate that an ensemble created with classifiers comprising moderate number of dendrites performs better than both ensembles of perceptrons and of complex dendritic trees. (2) In order to determine the moderate number of dendrites required for a specific classification problem, a two-step solution is proposed. First, an adaptive approach is proposed which scales the relative size of the dendritic trees of neurons for each class. It works by progressively adding dendrites with fixed number of synapses to the network, thereby allocating synaptic resources as per the complexity of the given problem. As a second step, theoretical capacity calculations are used to convert each neuronal dendritic tree to its optimal topology where dendrites of each class are assigned different number of synapses. The performance of the model is evaluated on classification of handwritten digits from the benchmark MNIST dataset and compared with other spike classifiers. We show that our system can achieve classification accuracy within 1 − 2% of other reported spike-based classifiers while using much less synaptic resources (only 7%) compared to that used by other methods. Further, an ensemble classifier created with adaptively learned sizes can attain accuracy of 96.4% which is at par with the best reported performance of spike-based classifiers. Moreover, the proposed method achieves this by using about 20% of the synapses used by other spike algorithms. We also present results of applying our algorithm to classify the MNIST-DVS dataset collected from a real spike-based image sensor and show results comparable to the best reported ones (88.1% accuracy). For VLSI implementations, we show that the reduced synaptic memory can save upto 4X area compared to conventional crossbar topologies. Finally, we also present a biologically realistic spike-based version for calculating the correlations required by the structural learning rule and demonstrate the correspondence between the rate-based and spike-based methods of learning. PMID:27065782
NASA Astrophysics Data System (ADS)
Leena, N.; Saju, K. K.
2018-04-01
Nutritional deficiencies in plants are a major concern for farmers as it affects productivity and thus profit. The work aims to classify nutritional deficiencies in maize plant in a non-destructive mannerusing image processing and machine learning techniques. The colored images of the leaves are analyzed and classified with multi-class support vector machine (SVM) method. Several images of maize leaves with known deficiencies like nitrogen, phosphorous and potassium (NPK) are used to train the SVM classifier prior to the classification of test images. The results show that the method was able to classify and identify nutritional deficiencies.
Multi-view L2-SVM and its multi-view core vector machine.
Huang, Chengquan; Chung, Fu-lai; Wang, Shitong
2016-03-01
In this paper, a novel L2-SVM based classifier Multi-view L2-SVM is proposed to address multi-view classification tasks. The proposed Multi-view L2-SVM classifier does not have any bias in its objective function and hence has the flexibility like μ-SVC in the sense that the number of the yielded support vectors can be controlled by a pre-specified parameter. The proposed Multi-view L2-SVM classifier can make full use of the coherence and the difference of different views through imposing the consensus among multiple views to improve the overall classification performance. Besides, based on the generalized core vector machine GCVM, the proposed Multi-view L2-SVM classifier is extended into its GCVM version MvCVM which can realize its fast training on large scale multi-view datasets, with its asymptotic linear time complexity with the sample size and its space complexity independent of the sample size. Our experimental results demonstrated the effectiveness of the proposed Multi-view L2-SVM classifier for small scale multi-view datasets and the proposed MvCVM classifier for large scale multi-view datasets. Copyright © 2015 Elsevier Ltd. All rights reserved.
Liu, Bin; Wang, Shanyi; Dong, Qiwen; Li, Shumin; Liu, Xuan
2016-04-20
DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. With the rapid development of next generation of sequencing technique, the number of protein sequences is unprecedentedly increasing. Thus it is necessary to develop computational methods to identify the DNA-binding proteins only based on the protein sequence information. In this study, a novel method called iDNA-KACC is presented, which combines the Support Vector Machine (SVM) and the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. The sequence order effect can be effectively captured by this scheme. These vectors are then fed into Support Vector Machine (SVM) to discriminate the DNA-binding proteins from the non DNA-binding ones. iDNA-KACC achieves an overall accuracy of 75.16% and Matthew correlation coefficient of 0.5 by a rigorous jackknife test. Its performance is further improved by employing an ensemble learning approach, and the improved predictor is called iDNA-KACC-EL. Experimental results on an independent dataset shows that iDNA-KACC-EL outperforms all the other state-of-the-art predictors, indicating that it would be a useful computational tool for DNA binding protein identification. .
Amozegar, M; Khorasani, K
2016-04-01
In this paper, a new approach for Fault Detection and Isolation (FDI) of gas turbine engines is proposed by developing an ensemble of dynamic neural network identifiers. For health monitoring of the gas turbine engine, its dynamics is first identified by constructing three separate or individual dynamic neural network architectures. Specifically, a dynamic multi-layer perceptron (MLP), a dynamic radial-basis function (RBF) neural network, and a dynamic support vector machine (SVM) are trained to individually identify and represent the gas turbine engine dynamics. Next, three ensemble-based techniques are developed to represent the gas turbine engine dynamics, namely, two heterogeneous ensemble models and one homogeneous ensemble model. It is first shown that all ensemble approaches do significantly improve the overall performance and accuracy of the developed system identification scheme when compared to each of the stand-alone solutions. The best selected stand-alone model (i.e., the dynamic RBF network) and the best selected ensemble architecture (i.e., the heterogeneous ensemble) in terms of their performances in achieving an accurate system identification are then selected for solving the FDI task. The required residual signals are generated by using both a single model-based solution and an ensemble-based solution under various gas turbine engine health conditions. Our extensive simulation studies demonstrate that the fault detection and isolation task achieved by using the residuals that are obtained from the dynamic ensemble scheme results in a significantly more accurate and reliable performance as illustrated through detailed quantitative confusion matrix analysis and comparative studies. Copyright © 2016 Elsevier Ltd. All rights reserved.
Benefits of an ultra large and multiresolution ensemble for estimating available wind power
NASA Astrophysics Data System (ADS)
Berndt, Jonas; Hoppe, Charlotte; Elbern, Hendrik
2016-04-01
In this study we investigate the benefits of an ultra large ensemble with up to 1000 members including multiple nesting with a target horizontal resolution of 1 km. The ensemble shall be used as a basis to detect events of extreme errors in wind power forecasting. Forecast value is the wind vector at wind turbine hub height (~ 100 m) in the short range (1 to 24 hour). Current wind power forecast systems rest already on NWP ensemble models. However, only calibrated ensembles from meteorological institutions serve as input so far, with limited spatial resolution (˜10 - 80 km) and member number (˜ 50). Perturbations related to the specific merits of wind power production are yet missing. Thus, single extreme error events which are not detected by such ensemble power forecasts occur infrequently. The numerical forecast model used in this study is the Weather Research and Forecasting Model (WRF). Model uncertainties are represented by stochastic parametrization of sub-grid processes via stochastically perturbed parametrization tendencies and in conjunction via the complementary stochastic kinetic-energy backscatter scheme already provided by WRF. We perform continuous ensemble updates by comparing each ensemble member with available observations using a sequential importance resampling filter to improve the model accuracy while maintaining ensemble spread. Additionally, we use different ensemble systems from global models (ECMWF and GFS) as input and boundary conditions to capture different synoptic conditions. Critical weather situations which are connected to extreme error events are located and corresponding perturbation techniques are applied. The demanding computational effort is overcome by utilising the supercomputer JUQUEEN at the Forschungszentrum Juelich.
Quantum Support Vector Machine for Big Data Classification
NASA Astrophysics Data System (ADS)
Rebentrost, Patrick; Mohseni, Masoud; Lloyd, Seth
2014-09-01
Supervised machine learning is the classification of new data based on already classified training examples. In this work, we show that the support vector machine, an optimized binary classifier, can be implemented on a quantum computer, with complexity logarithmic in the size of the vectors and the number of training examples. In cases where classical sampling algorithms require polynomial time, an exponential speedup is obtained. At the core of this quantum big data algorithm is a nonsparse matrix exponentiation technique for efficiently performing a matrix inversion of the training data inner-product (kernel) matrix.
Extraction and classification of 3D objects from volumetric CT data
NASA Astrophysics Data System (ADS)
Song, Samuel M.; Kwon, Junghyun; Ely, Austin; Enyeart, John; Johnson, Chad; Lee, Jongkyu; Kim, Namho; Boyd, Douglas P.
2016-05-01
We propose an Automatic Threat Detection (ATD) algorithm for Explosive Detection System (EDS) using our multistage Segmentation Carving (SC) followed by Support Vector Machine (SVM) classifier. The multi-stage Segmentation and Carving (SC) step extracts all suspect 3-D objects. The feature vector is then constructed for all extracted objects and the feature vector is classified by the Support Vector Machine (SVM) previously learned using a set of ground truth threat and benign objects. The learned SVM classifier has shown to be effective in classification of different types of threat materials. The proposed ATD algorithm robustly deals with CT data that are prone to artifacts due to scatter, beam hardening as well as other systematic idiosyncrasies of the CT data. Furthermore, the proposed ATD algorithm is amenable for including newly emerging threat materials as well as for accommodating data from newly developing sensor technologies. Efficacy of the proposed ATD algorithm with the SVM classifier is demonstrated by the Receiver Operating Characteristics (ROC) curve that relates Probability of Detection (PD) as a function of Probability of False Alarm (PFA). The tests performed using CT data of passenger bags shows excellent performance characteristics.
Statistical properties of a cloud ensemble - A numerical study
NASA Technical Reports Server (NTRS)
Tao, Wei-Kuo; Simpson, Joanne; Soong, Su-Tzai
1987-01-01
The statistical properties of cloud ensembles under a specified large-scale environment, such as mass flux by cloud drafts and vertical velocity as well as the condensation and evaporation associated with these cloud drafts, are examined using a three-dimensional numerical cloud ensemble model described by Soong and Ogura (1980) and Tao and Soong (1986). The cloud drafts are classified as active and inactive, and separate contributions to cloud statistics in areas of different cloud activity are then evaluated. The model results compare well with results obtained from aircraft measurements of a well-organized ITCZ rainband that occurred on August 12, 1974, during the Global Atmospheric Research Program's Atlantic Tropical Experiment.
A Proposed Methodology to Classify Frontier Capital Markets
2011-07-31
out of charity, but because it is the surest route to our common good.” -Inaugural Speech by President Barack Obama, Jan 2009 This project...identification, and machine learning. The algorithm consists of a unique binary classifier mechanism that combines three methods: k-Nearest Neighbors ( kNN ...Support Through kNN Ensemble Classification Techniques E. Capital Market Classification Based on Capital Flows and Trading Architecture F
Human action classification using procrustes shape theory
NASA Astrophysics Data System (ADS)
Cho, Wanhyun; Kim, Sangkyoon; Park, Soonyoung; Lee, Myungeun
2015-02-01
In this paper, we propose new method that can classify a human action using Procrustes shape theory. First, we extract a pre-shape configuration vector of landmarks from each frame of an image sequence representing an arbitrary human action, and then we have derived the Procrustes fit vector for pre-shape configuration vector. Second, we extract a set of pre-shape vectors from tanning sample stored at database, and we compute a Procrustes mean shape vector for these preshape vectors. Third, we extract a sequence of the pre-shape vectors from input video, and we project this sequence of pre-shape vectors on the tangent space with respect to the pole taking as a sequence of mean shape vectors corresponding with a target video. And we calculate the Procrustes distance between two sequences of the projection pre-shape vectors on the tangent space and the mean shape vectors. Finally, we classify the input video into the human action class with minimum Procrustes distance. We assess a performance of the proposed method using one public dataset, namely Weizmann human action dataset. Experimental results reveal that the proposed method performs very good on this dataset.
Encoding of Spatial Attention by Primate Prefrontal Cortex Neuronal Ensembles
Treue, Stefan
2018-01-01
Abstract Single neurons in the primate lateral prefrontal cortex (LPFC) encode information about the allocation of visual attention and the features of visual stimuli. However, how this compares to the performance of neuronal ensembles at encoding the same information is poorly understood. Here, we recorded the responses of neuronal ensembles in the LPFC of two macaque monkeys while they performed a task that required attending to one of two moving random dot patterns positioned in different hemifields and ignoring the other pattern. We found single units selective for the location of the attended stimulus as well as for its motion direction. To determine the coding of both variables in the population of recorded units, we used a linear classifier and progressively built neuronal ensembles by iteratively adding units according to their individual performance (best single units), or by iteratively adding units based on their contribution to the ensemble performance (best ensemble). For both methods, ensembles of relatively small sizes (n < 60) yielded substantially higher decoding performance relative to individual single units. However, the decoder reached similar performance using fewer neurons with the best ensemble building method compared with the best single units method. Our results indicate that neuronal ensembles within the LPFC encode more information about the attended spatial and nonspatial features of visual stimuli than individual neurons. They further suggest that efficient coding of attention can be achieved by relatively small neuronal ensembles characterized by a certain relationship between signal and noise correlation structures. PMID:29568798
Gait recognition based on Gabor wavelets and modified gait energy image for human identification
NASA Astrophysics Data System (ADS)
Huang, Deng-Yuan; Lin, Ta-Wei; Hu, Wu-Chih; Cheng, Chih-Hsiang
2013-10-01
This paper proposes a method for recognizing human identity using gait features based on Gabor wavelets and modified gait energy images (GEIs). Identity recognition by gait generally involves gait representation, extraction, and classification. In this work, a modified GEI convolved with an ensemble of Gabor wavelets is proposed as a gait feature. Principal component analysis is then used to project the Gabor-wavelet-based gait features into a lower-dimension feature space for subsequent classification. Finally, support vector machine classifiers based on a radial basis function kernel are trained and utilized to recognize human identity. The major contributions of this paper are as follows: (1) the consideration of the shadow effect to yield a more complete segmentation of gait silhouettes; (2) the utilization of motion estimation to track people when walkers overlap; and (3) the derivation of modified GEIs to extract more useful gait information. Extensive performance evaluation shows a great improvement of recognition accuracy due to the use of shadow removal, motion estimation, and gait representation using the modified GEIs and Gabor wavelets.
Ma, Tao; Wang, Fen; Cheng, Jianjun; Yu, Yang; Chen, Xiaoyun
2016-01-01
The development of intrusion detection systems (IDS) that are adapted to allow routers and network defence systems to detect malicious network traffic disguised as network protocols or normal access is a critical challenge. This paper proposes a novel approach called SCDNN, which combines spectral clustering (SC) and deep neural network (DNN) algorithms. First, the dataset is divided into k subsets based on sample similarity using cluster centres, as in SC. Next, the distance between data points in a testing set and the training set is measured based on similarity features and is fed into the deep neural network algorithm for intrusion detection. Six KDD-Cup99 and NSL-KDD datasets and a sensor network dataset were employed to test the performance of the model. These experimental results indicate that the SCDNN classifier not only performs better than backpropagation neural network (BPNN), support vector machine (SVM), random forest (RF) and Bayes tree models in detection accuracy and the types of abnormal attacks found. It also provides an effective tool of study and analysis of intrusion detection in large networks. PMID:27754380
Ma, Tao; Wang, Fen; Cheng, Jianjun; Yu, Yang; Chen, Xiaoyun
2016-10-13
The development of intrusion detection systems (IDS) that are adapted to allow routers and network defence systems to detect malicious network traffic disguised as network protocols or normal access is a critical challenge. This paper proposes a novel approach called SCDNN, which combines spectral clustering (SC) and deep neural network (DNN) algorithms. First, the dataset is divided into k subsets based on sample similarity using cluster centres, as in SC. Next, the distance between data points in a testing set and the training set is measured based on similarity features and is fed into the deep neural network algorithm for intrusion detection. Six KDD-Cup99 and NSL-KDD datasets and a sensor network dataset were employed to test the performance of the model. These experimental results indicate that the SCDNN classifier not only performs better than backpropagation neural network (BPNN), support vector machine (SVM), random forest (RF) and Bayes tree models in detection accuracy and the types of abnormal attacks found. It also provides an effective tool of study and analysis of intrusion detection in large networks.
Classifying Imbalanced Data Streams via Dynamic Feature Group Weighting with Importance Sampling.
Wu, Ke; Edwards, Andrea; Fan, Wei; Gao, Jing; Zhang, Kun
2014-04-01
Data stream classification and imbalanced data learning are two important areas of data mining research. Each has been well studied to date with many interesting algorithms developed. However, only a few approaches reported in literature address the intersection of these two fields due to their complex interplay. In this work, we proposed an importance sampling driven, dynamic feature group weighting framework (DFGW-IS) for classifying data streams of imbalanced distribution. Two components are tightly incorporated into the proposed approach to address the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concepts are tackled by a weighted ensemble trained on a set of feature groups with each sub-classifier (i.e. a single classifier or an ensemble) weighed by its discriminative power and stable level. The un-even class distribution, on the other hand, is typically battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We derived the theoretical upper bound for the generalization error of the proposed algorithm. We also studied the empirical performance of our method on a set of benchmark synthetic and real world data, and significant improvement has been achieved over the competing algorithms in terms of standard evaluation metrics and parallel running time. Algorithm implementations and datasets are available upon request.
A hybrid MLP-CNN classifier for very fine resolution remotely sensed image classification
NASA Astrophysics Data System (ADS)
Zhang, Ce; Pan, Xin; Li, Huapeng; Gardiner, Andy; Sargent, Isabel; Hare, Jonathon; Atkinson, Peter M.
2018-06-01
The contextual-based convolutional neural network (CNN) with deep architecture and pixel-based multilayer perceptron (MLP) with shallow structure are well-recognized neural network algorithms, representing the state-of-the-art deep learning method and the classical non-parametric machine learning approach, respectively. The two algorithms, which have very different behaviours, were integrated in a concise and effective way using a rule-based decision fusion approach for the classification of very fine spatial resolution (VFSR) remotely sensed imagery. The decision fusion rules, designed primarily based on the classification confidence of the CNN, reflect the generally complementary patterns of the individual classifiers. In consequence, the proposed ensemble classifier MLP-CNN harvests the complementary results acquired from the CNN based on deep spatial feature representation and from the MLP based on spectral discrimination. Meanwhile, limitations of the CNN due to the adoption of convolutional filters such as the uncertainty in object boundary partition and loss of useful fine spatial resolution detail were compensated. The effectiveness of the ensemble MLP-CNN classifier was tested in both urban and rural areas using aerial photography together with an additional satellite sensor dataset. The MLP-CNN classifier achieved promising performance, consistently outperforming the pixel-based MLP, spectral and textural-based MLP, and the contextual-based CNN in terms of classification accuracy. This research paves the way to effectively address the complicated problem of VFSR image classification.
NASA Astrophysics Data System (ADS)
Re, Matteo; Valentini, Giorgio
2012-03-01
Ensemble methods are statistical and computational learning procedures reminiscent of the human social learning behavior of seeking several opinions before making any crucial decision. The idea of combining the opinions of different "experts" to obtain an overall “ensemble” decision is rooted in our culture at least from the classical age of ancient Greece, and it has been formalized during the Enlightenment with the Condorcet Jury Theorem[45]), which proved that the judgment of a committee is superior to those of individuals, provided the individuals have reasonable competence. Ensembles are sets of learning machines that combine in some way their decisions, or their learning algorithms, or different views of data, or other specific characteristics to obtain more reliable and more accurate predictions in supervised and unsupervised learning problems [48,116]. A simple example is represented by the majority vote ensemble, by which the decisions of different learning machines are combined, and the class that receives the majority of “votes” (i.e., the class predicted by the majority of the learning machines) is the class predicted by the overall ensemble [158]. In the literature, a plethora of terms other than ensembles has been used, such as fusion, combination, aggregation, and committee, to indicate sets of learning machines that work together to solve a machine learning problem [19,40,56,66,99,108,123], but in this chapter we maintain the term ensemble in its widest meaning, in order to include the whole range of combination methods. Nowadays, ensemble methods represent one of the main current research lines in machine learning [48,116], and the interest of the research community on ensemble methods is witnessed by conferences and workshops specifically devoted to ensembles, first of all the multiple classifier systems (MCS) conference organized by Roli, Kittler, Windeatt, and other researchers of this area [14,62,85,149,173]. Several theories have been proposed to explain the characteristics and the successful application of ensembles to different application domains. For instance, Allwein, Schapire, and Singer interpreted the improved generalization capabilities of ensembles of learning machines in the framework of large margin classifiers [4,177], Kleinberg in the context of stochastic discrimination theory [112], and Breiman and Friedman in the light of the bias-variance analysis borrowed from classical statistics [21,70]. Empirical studies showed that both in classification and regression problems, ensembles improve on single learning machines, and moreover large experimental studies compared the effectiveness of different ensemble methods on benchmark data sets [10,11,49,188]. The interest in this research area is motivated also by the availability of very fast computers and networks of workstations at a relatively low cost that allow the implementation and the experimentation of complex ensemble methods using off-the-shelf computer platforms. However, as explained in Section 26.2 there are deeper reasons to use ensembles of learning machines, motivated by the intrinsic characteristics of the ensemble methods. The main aim of this chapter is to introduce ensemble methods and to provide an overview and a bibliography of the main areas of research, without pretending to be exhaustive or to explain the detailed characteristics of each ensemble method. The paper is organized as follows. In the next section, the main theoretical and practical reasons for combining multiple learners are introduced. Section 26.3 depicts the main taxonomies on ensemble methods proposed in the literature. In Section 26.4 and 26.5, we present an overview of the main supervised ensemble methods reported in the literature, adopting a simple taxonomy, originally proposed in Ref. [201]. Applications of ensemble methods are only marginally considered, but a specific section on some relevant applications of ensemble methods in astronomy and astrophysics has been added (Section 26.6). The conclusion (Section 26.7) ends this paper and lists some issues not covered in this work.
Vidotti, Vanessa G; Costa, Vital P; Silva, Fabrício R; Resende, Graziela M; Cremasco, Fernanda; Dias, Marcelo; Gomi, Edson S
2012-06-15
Purpose. To investigate the sensitivity and specificity of machine learning classifiers (MLC) and spectral domain optical coherence tomography (SD-OCT) for the diagnosis of glaucoma. Methods. Sixty-two patients with early to moderate glaucomatous visual field damage and 48 healthy individuals were included. All subjects underwent a complete ophthalmologic examination, achromatic standard automated perimetry, and RNFL imaging with SD-OCT (Cirrus HD-OCT; Carl Zeiss Meditec, Inc., Dublin, California, USA). Receiver operating characteristic (ROC) curves were obtained for all SD-OCT parameters. Subsequently, the following MLCs were tested: Classification Tree (CTREE), Random Forest (RAN), Bagging (BAG), AdaBoost M1 (ADA), Ensemble Selection (ENS), Multilayer Perceptron (MLP), Radial Basis Function (RBF), Naive-Bayes (NB), and Support Vector Machine (SVM). Areas under the ROC curves (aROCs) obtained for each parameter and each MLC were compared. Results. The mean age was 57.0±9.2 years for healthy individuals and 59.9±9.0 years for glaucoma patients (p=0.103). Mean deviation values were -4.1±2.4 dB for glaucoma patients and -1.5±1.6 dB for healthy individuals (p<0.001). The SD-OCT parameters with the greater aROCs were inferior quadrant (0.813), average thickness (0.807), 7 o'clock position (0.765), and 6 o'clock position (0.754). The aROCs from classifiers varied from 0.785 (ADA) to 0.818 (BAG). The aROC obtained with BAG was not significantly different from the aROC obtained with the best single SD-OCT parameter (p=0.93). Conclusions. The SD-OCT showed good diagnostic accuracy in a group of patients with early glaucoma. In this series, MLCs did not improve the sensitivity and specificity of SD-OCT for the diagnosis of glaucoma.
A time-frequency classifier for human gait recognition
NASA Astrophysics Data System (ADS)
Mobasseri, Bijan G.; Amin, Moeness G.
2009-05-01
Radar has established itself as an effective all-weather, day or night sensor. Radar signals can penetrate walls and provide information on moving targets. Recently, radar has been used as an effective biometric sensor for classification of gait. The return from a coherent radar system contains a frequency offset in the carrier frequency, known as the Doppler Effect. The movements of arms and legs give rise to micro Doppler which can be clearly detailed in the time-frequency domain using traditional or modern time-frequency signal representation. In this paper we propose a gait classifier based on subspace learning using principal components analysis(PCA). The training set consists of feature vectors defined as either time or frequency snapshots taken from the spectrogram of radar backscatter. We show that gait signature is captured effectively in feature vectors. Feature vectors are then used in training a minimum distance classifier based on Mahalanobis distance metric. Results show that gait classification with high accuracy and short observation window is achievable using the proposed classifier.
Skoura, Angeliki; Bakic, Predrag R; Megalooikonomou, Vasilis
2013-01-01
The analysis of anatomical tree-shape structures visualized in medical images provides insight into the relationship between tree topology and pathology of the corresponding organs. In this paper, we propose three methods to extract descriptive features of the branching topology; the asymmetry index, the encoding of branching patterns using a node labeling scheme and an extension of the Sholl analysis. Based on these descriptors, we present classification schemes for tree topologies with respect to the underlying pathology. Moreover, we present a classifier ensemble approach which combines the predictions of the individual classifiers to optimize the classification accuracy. We applied the proposed methodology to a dataset of x-ray galactograms, medical images which visualize the breast ductal tree, in order to recognize images with radiological findings regarding breast cancer. The experimental results demonstrate the effectiveness of the proposed framework compared to state-of-the-art techniques suggesting that the proposed descriptors provide more valuable information regarding the topological patterns of ductal trees and indicating the potential of facilitating early breast cancer diagnosis.
Skoura, Angeliki; Bakic, Predrag R.; Megalooikonomou, Vasilis
2014-01-01
The analysis of anatomical tree-shape structures visualized in medical images provides insight into the relationship between tree topology and pathology of the corresponding organs. In this paper, we propose three methods to extract descriptive features of the branching topology; the asymmetry index, the encoding of branching patterns using a node labeling scheme and an extension of the Sholl analysis. Based on these descriptors, we present classification schemes for tree topologies with respect to the underlying pathology. Moreover, we present a classifier ensemble approach which combines the predictions of the individual classifiers to optimize the classification accuracy. We applied the proposed methodology to a dataset of x-ray galactograms, medical images which visualize the breast ductal tree, in order to recognize images with radiological findings regarding breast cancer. The experimental results demonstrate the effectiveness of the proposed framework compared to state-of-the-art techniques suggesting that the proposed descriptors provide more valuable information regarding the topological patterns of ductal trees and indicating the potential of facilitating early breast cancer diagnosis. PMID:25414850
Akbar, Shahid; Hayat, Maqsood; Iqbal, Muhammad; Jan, Mian Ahmad
2017-06-01
Cancer is a fatal disease, responsible for one-quarter of all deaths in developed countries. Traditional anticancer therapies such as, chemotherapy and radiation, are highly expensive, susceptible to errors and ineffective techniques. These conventional techniques induce severe side-effects on human cells. Due to perilous impact of cancer, the development of an accurate and highly efficient intelligent computational model is desirable for identification of anticancer peptides. In this paper, evolutionary intelligent genetic algorithm-based ensemble model, 'iACP-GAEnsC', is proposed for the identification of anticancer peptides. In this model, the protein sequences are formulated, using three different discrete feature representation methods, i.e., amphiphilic Pseudo amino acid composition, g-Gap dipeptide composition, and Reduce amino acid alphabet composition. The performance of the extracted feature spaces are investigated separately and then merged to exhibit the significance of hybridization. In addition, the predicted results of individual classifiers are combined together, using optimized genetic algorithm and simple majority technique in order to enhance the true classification rate. It is observed that genetic algorithm-based ensemble classification outperforms than individual classifiers as well as simple majority voting base ensemble. The performance of genetic algorithm-based ensemble classification is highly reported on hybrid feature space, with an accuracy of 96.45%. In comparison to the existing techniques, 'iACP-GAEnsC' model has achieved remarkable improvement in terms of various performance metrics. Based on the simulation results, it is observed that 'iACP-GAEnsC' model might be a leading tool in the field of drug design and proteomics for researchers. Copyright © 2017 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Lo, Joseph Y.; Gavrielides, Marios A.; Markey, Mia K.; Jesneck, Jonathan L.
2003-05-01
We developed an ensemble classifier for the task of computer-aided diagnosis of breast microcalcification clusters,which are very challenging to characterize for radiologists and computer models alike. The purpose of this study is to help radiologists identify whether suspicious calcification clusters are benign vs. malignant, such that they may potentially recommend fewer unnecessary biopsies for actually benign lesions. The data consists of mammographic features extracted by automated image processing algorithms as well as manually interpreted by radiologists according to a standardized lexicon. We used 292 cases from a publicly available mammography database. From each cases, we extracted 22 image processing features pertaining to lesion morphology, 5 radiologist features also pertaining to morphology, and the patient age. Linear discriminant analysis (LDA) models were designed using each of the three data types. Each local model performed poorly; the best was one based upon image processing features which yielded ROC area index AZ of 0.59 +/- 0.03 and partial AZ above 90% sensitivity of 0.08 +/- 0.03. We then developed ensemble models using different combinations of those data types, and these models all improved performance compared to the local models. The final ensemble model was based upon 5 features selected by stepwise LDA from all 28 available features. This ensemble performed with AZ of 0.69 +/- 0.03 and partial AZ of 0.21 +/- 0.04, which was statistically significantly better than the model based on the image processing features alone (p<0.001 and p=0.01 for full and partial AZ respectively). This demonstrated the value of the radiologist-extracted features as a source of information for this task. It also suggested there is potential for improved performance using this ensemble classifier approach to combine different sources of currently available data.
Generalized multiple kernel learning with data-dependent priors.
Mao, Qi; Tsang, Ivor W; Gao, Shenghua; Wang, Li
2015-06-01
Multiple kernel learning (MKL) and classifier ensemble are two mainstream methods for solving learning problems in which some sets of features/views are more informative than others, or the features/views within a given set are inconsistent. In this paper, we first present a novel probabilistic interpretation of MKL such that maximum entropy discrimination with a noninformative prior over multiple views is equivalent to the formulation of MKL. Instead of using the noninformative prior, we introduce a novel data-dependent prior based on an ensemble of kernel predictors, which enhances the prediction performance of MKL by leveraging the merits of the classifier ensemble. With the proposed probabilistic framework of MKL, we propose a hierarchical Bayesian model to learn the proposed data-dependent prior and classification model simultaneously. The resultant problem is convex and other information (e.g., instances with either missing views or missing labels) can be seamlessly incorporated into the data-dependent priors. Furthermore, a variety of existing MKL models can be recovered under the proposed MKL framework and can be readily extended to incorporate these priors. Extensive experiments demonstrate the benefits of our proposed framework in supervised and semisupervised settings, as well as in tasks with partial correspondence among multiple views.
Filgueiras, Paulo R; Terra, Luciana A; Castro, Eustáquio V R; Oliveira, Lize M S L; Dias, Júlio C M; Poppi, Ronei J
2015-09-01
This paper aims to estimate the temperature equivalent to 10% (T10%), 50% (T50%) and 90% (T90%) of distilled volume in crude oils using (1)H NMR and support vector regression (SVR). Confidence intervals for the predicted values were calculated using a boosting-type ensemble method in a procedure called ensemble support vector regression (eSVR). The estimated confidence intervals obtained by eSVR were compared with previously accepted calculations from partial least squares (PLS) models and a boosting-type ensemble applied in the PLS method (ePLS). By using the proposed boosting strategy, it was possible to identify outliers in the T10% property dataset. The eSVR procedure improved the accuracy of the distillation temperature predictions in relation to standard PLS, ePLS and SVR. For T10%, a root mean square error of prediction (RMSEP) of 11.6°C was obtained in comparison with 15.6°C for PLS, 15.1°C for ePLS and 28.4°C for SVR. The RMSEPs for T50% were 24.2°C, 23.4°C, 22.8°C and 14.4°C for PLS, ePLS, SVR and eSVR, respectively. For T90%, the values of RMSEP were 39.0°C, 39.9°C and 39.9°C for PLS, ePLS, SVR and eSVR, respectively. The confidence intervals calculated by the proposed boosting methodology presented acceptable values for the three properties analyzed; however, they were lower than those calculated by the standard methodology for PLS. Copyright © 2015 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Hollenberg, Sebastian; Päs, Heinrich
2012-01-01
The standard wave function approach for the treatment of neutrino oscillations fails in situations where quantum ensembles at a finite temperature with or without an interacting background plasma are encountered. As a first step to treat such phenomena in a novel way, we propose a unified approach to both adiabatic and nonadiabatic two-flavor oscillations in neutrino ensembles with finite temperature and generic (e.g., matter) potentials. Neglecting effects of ensemble decoherence for now, we study the evolution of a neutrino ensemble governed by the associated quantum kinetic equations, which apply to systems with finite temperature. The quantum kinetic equations are solved formally using the Magnus expansion and it is shown that a convenient choice of the quantum mechanical picture (e.g., the interaction picture) reveals suitable parameters to characterize the physics of the underlying system (e.g., an effective oscillation length). It is understood that this method also provides a promising starting point for the treatment of the more general case in which decoherence is taken into account.
2012-01-01
Background Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods. Results This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble. Conclusions The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role. Availability The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/. PMID:23282103
Wahba, Maram A; Ashour, Amira S; Napoleon, Sameh A; Abd Elnaby, Mustafa M; Guo, Yanhui
2017-12-01
Basal cell carcinoma is one of the most common malignant skin lesions. Automated lesion identification and classification using image processing techniques is highly required to reduce the diagnosis errors. In this study, a novel technique is applied to classify skin lesion images into two classes, namely the malignant Basal cell carcinoma and the benign nevus. A hybrid combination of bi-dimensional empirical mode decomposition and gray-level difference method features is proposed after hair removal. The combined features are further classified using quadratic support vector machine (Q-SVM). The proposed system has achieved outstanding performance of 100% accuracy, sensitivity and specificity compared to other support vector machine procedures as well as with different extracted features. Basal Cell Carcinoma is effectively classified using Q-SVM with the proposed combined features.
Simulating Open Quantum Systems with Hamiltonian Ensembles and the Nonclassicality of the Dynamics
NASA Astrophysics Data System (ADS)
Chen, Hong-Bin; Gneiting, Clemens; Lo, Ping-Yuan; Chen, Yueh-Nan; Nori, Franco
2018-01-01
The incoherent dynamical properties of open quantum systems are generically attributed to an ongoing correlation between the system and its environment. Here, we propose a novel way to assess the nature of these system-environment correlations by examining the system dynamics alone. Our approach is based on the possibility or impossibility to simulate open-system dynamics with Hamiltonian ensembles. As we show, such (im)possibility to simulate is closely linked to the system-environment correlations. We thus define the nonclassicality of open-system dynamics in terms of the nonexistence of a Hamiltonian-ensemble simulation. This classifies any nonunital open-system dynamics as nonclassical. We give examples for open-system dynamics that are unital and classical, as well as unital and nonclassical.
NASA Astrophysics Data System (ADS)
Dheeba, J.; Jaya, T.; Singh, N. Albert
2017-09-01
Classification of cancerous masses is a challenging task in many computerised detection systems. Cancerous masses are difficult to detect because these masses are obscured and subtle in mammograms. This paper investigates an intelligent classifier - fuzzy support vector machine (FSVM) applied to classify the tissues containing masses on mammograms for breast cancer diagnosis. The algorithm utilises texture features extracted using Laws texture energy measures and a FSVM to classify the suspicious masses. The new FSVM treats every feature as both normal and abnormal samples, but with different membership. By this way, the new FSVM have more generalisation ability to classify the masses in mammograms. The classifier analysed 219 clinical mammograms collected from breast cancer screening laboratory. The tests made on the real clinical mammograms shows that the proposed detection system has better discriminating power than the conventional support vector machine. With the best combination of FSVM and Laws texture features, the area under the Receiver operating characteristic curve reached .95, which corresponds to a sensitivity of 93.27% with a specificity of 87.17%. The results suggest that detecting masses using FSVM contribute to computer-aided detection of breast cancer and as a decision support system for radiologists.
NASA Astrophysics Data System (ADS)
Lazri, Mourad; Ameur, Soltane
2018-05-01
A model combining three classifiers, namely Support vector machine, Artificial neural network and Random forest (SAR) is designed for improving the classification of convective and stratiform rain. This model (SAR model) has been trained and then tested on a datasets derived from MSG-SEVIRI (Meteosat Second Generation-Spinning Enhanced Visible and Infrared Imager). Well-classified, mid-classified and misclassified pixels are determined from the combination of three classifiers. Mid-classified and misclassified pixels that are considered unreliable pixels are reclassified by using a novel training of the developed scheme. In this novel training, only the input data corresponding to the pixels in question to are used. This whole process is repeated a second time and applied to mid-classified and misclassified pixels separately. Learning and validation of the developed scheme are realized against co-located data observed by ground radar. The developed scheme outperformed different classifiers used separately and reached 97.40% of overall accuracy of classification.
2013-01-01
Background Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification. Results We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual’s continental and sub-continental ancestry. To predict an individual’s continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control’s λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver. Conclusions ETHNOPRED is a novel technique for producing classifiers that can identify an individual’s continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values. PMID:23432980
Machine Learning Techniques for Global Sensitivity Analysis in Climate Models
NASA Astrophysics Data System (ADS)
Safta, C.; Sargsyan, K.; Ricciuto, D. M.
2017-12-01
Climate models studies are not only challenged by the compute intensive nature of these models but also by the high-dimensionality of the input parameter space. In our previous work with the land model components (Sargsyan et al., 2014) we identified subsets of 10 to 20 parameters relevant for each QoI via Bayesian compressive sensing and variance-based decomposition. Nevertheless the algorithms were challenged by the nonlinear input-output dependencies for some of the relevant QoIs. In this work we will explore a combination of techniques to extract relevant parameters for each QoI and subsequently construct surrogate models with quantified uncertainty necessary to future developments, e.g. model calibration and prediction studies. In the first step, we will compare the skill of machine-learning models (e.g. neural networks, support vector machine) to identify the optimal number of classes in selected QoIs and construct robust multi-class classifiers that will partition the parameter space in regions with smooth input-output dependencies. These classifiers will be coupled with techniques aimed at building sparse and/or low-rank surrogate models tailored to each class. Specifically we will explore and compare sparse learning techniques with low-rank tensor decompositions. These models will be used to identify parameters that are important for each QoI. Surrogate accuracy requirements are higher for subsequent model calibration studies and we will ascertain the performance of this workflow for multi-site ALM simulation ensembles.
Local Subspace Classifier with Transform-Invariance for Image Classification
NASA Astrophysics Data System (ADS)
Hotta, Seiji
A family of linear subspace classifiers called local subspace classifier (LSC) outperforms the k-nearest neighbor rule (kNN) and conventional subspace classifiers in handwritten digit classification. However, LSC suffers very high sensitivity to image transformations because it uses projection and the Euclidean distances for classification. In this paper, I present a combination of a local subspace classifier (LSC) and a tangent distance (TD) for improving accuracy of handwritten digit recognition. In this classification rule, we can deal with transform-invariance easily because we are able to use tangent vectors for approximation of transformations. However, we cannot use tangent vectors in other type of images such as color images. Hence, kernel LSC (KLSC) is proposed for incorporating transform-invariance into LSC via kernel mapping. The performance of the proposed methods is verified with the experiments on handwritten digit and color image classification.
NASA Astrophysics Data System (ADS)
Zhang, Yanjiao; Lai, Xiaoping; Zeng, Qiuyao; Li, Linfang; Lin, Lin; Li, Shaoxin; Liu, Zhiming; Su, Chengkang; Qi, Minni; Guo, Zhouyi
2018-03-01
This study aims to classify low-grade and high-grade bladder cancer (BC) patients using serum surface-enhanced Raman scattering (SERS) spectra and support vector machine (SVM) algorithms. Serum SERS spectra are acquired from 88 serum samples with silver nanoparticles as the SERS-active substrate. Diagnostic accuracies of 96.4% and 95.4% are obtained when differentiating the serum SERS spectra of all BC patients versus normal subjects and low-grade versus high-grade BC patients, respectively, with optimal SVM classifier models. This study demonstrates that the serum SERS technique combined with SVM has great potential to noninvasively detect and classify high-grade and low-grade BC patients.
A support vector machine approach for classification of welding defects from ultrasonic signals
NASA Astrophysics Data System (ADS)
Chen, Yuan; Ma, Hong-Wei; Zhang, Guang-Ming
2014-07-01
Defect classification is an important issue in ultrasonic non-destructive evaluation. A layered multi-class support vector machine (LMSVM) classification system, which combines multiple SVM classifiers through a layered architecture, is proposed in this paper. The proposed LMSVM classification system is applied to the classification of welding defects from ultrasonic test signals. The measured ultrasonic defect echo signals are first decomposed into wavelet coefficients by the wavelet packet transform. The energy of the wavelet coefficients at different frequency channels are used to construct the feature vectors. The bees algorithm (BA) is then used for feature selection and SVM parameter optimisation for the LMSVM classification system. The BA-based feature selection optimises the energy feature vectors. The optimised feature vectors are input to the LMSVM classification system for training and testing. Experimental results of classifying welding defects demonstrate that the proposed technique is highly robust, precise and reliable for ultrasonic defect classification.
Fuzzy support vector machine: an efficient rule-based classification technique for microarrays.
Hajiloo, Mohsen; Rabiee, Hamid R; Anooshahpour, Mahdi
2013-01-01
The abundance of gene expression microarray data has led to the development of machine learning algorithms applicable for tackling disease diagnosis, disease prognosis, and treatment selection problems. However, these algorithms often produce classifiers with weaknesses in terms of accuracy, robustness, and interpretability. This paper introduces fuzzy support vector machine which is a learning algorithm based on combination of fuzzy classifiers and kernel machines for microarray classification. Experimental results on public leukemia, prostate, and colon cancer datasets show that fuzzy support vector machine applied in combination with filter or wrapper feature selection methods develops a robust model with higher accuracy than the conventional microarray classification models such as support vector machine, artificial neural network, decision trees, k nearest neighbors, and diagonal linear discriminant analysis. Furthermore, the interpretable rule-base inferred from fuzzy support vector machine helps extracting biological knowledge from microarray data. Fuzzy support vector machine as a new classification model with high generalization power, robustness, and good interpretability seems to be a promising tool for gene expression microarray classification.
Bulashevska, Alla; Eils, Roland
2006-06-14
The subcellular location of a protein is closely related to its function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been made to predict subcellular location from sequence information only, there is the need for further research to improve the accuracy of prediction. A novel method called HensBC is introduced to predict protein subcellular location. HensBC is a recursive algorithm which constructs a hierarchical ensemble of classifiers. The classifiers used are Bayesian classifiers based on Markov chain models. We tested our method on six various datasets; among them are Gram-negative bacteria dataset, data for discriminating outer membrane proteins and apoptosis proteins dataset. We observed that our method can predict the subcellular location with high accuracy. Another advantage of the proposed method is that it can improve the accuracy of the prediction of some classes with few sequences in training and is therefore useful for datasets with imbalanced distribution of classes. This study introduces an algorithm which uses only the primary sequence of a protein to predict its subcellular location. The proposed recursive scheme represents an interesting methodology for learning and combining classifiers. The method is computationally efficient and competitive with the previously reported approaches in terms of prediction accuracies as empirical results indicate. The code for the software is available upon request.
Feature selection for the classification of traced neurons.
López-Cabrera, José D; Lorenzo-Ginori, Juan V
2018-06-01
The great availability of computational tools to calculate the properties of traced neurons leads to the existence of many descriptors which allow the automated classification of neurons from these reconstructions. This situation determines the necessity to eliminate irrelevant features as well as making a selection of the most appropriate among them, in order to improve the quality of the classification obtained. The dataset used contains a total of 318 traced neurons, classified by human experts in 192 GABAergic interneurons and 126 pyramidal cells. The features were extracted by means of the L-measure software, which is one of the most used computational tools in neuroinformatics to quantify traced neurons. We review some current feature selection techniques as filter, wrapper, embedded and ensemble methods. The stability of the feature selection methods was measured. For the ensemble methods, several aggregation methods based on different metrics were applied to combine the subsets obtained during the feature selection process. The subsets obtained applying feature selection methods were evaluated using supervised classifiers, among which Random Forest, C4.5, SVM, Naïve Bayes, Knn, Decision Table and the Logistic classifier were used as classification algorithms. Feature selection methods of types filter, embedded, wrappers and ensembles were compared and the subsets returned were tested in classification tasks for different classification algorithms. L-measure features EucDistanceSD, PathDistanceSD, Branch_pathlengthAve, Branch_pathlengthSD and EucDistanceAve were present in more than 60% of the selected subsets which provides evidence about their importance in the classification of this neurons. Copyright © 2018 Elsevier B.V. All rights reserved.
Ensemble Kalman filter inference of spatially-varying Manning's n coefficients in the coastal ocean
NASA Astrophysics Data System (ADS)
Siripatana, Adil; Mayo, Talea; Knio, Omar; Dawson, Clint; Maître, Olivier Le; Hoteit, Ibrahim
2018-07-01
Ensemble Kalman (EnKF) filtering is an established framework for large scale state estimation problems. EnKFs can also be used for state-parameter estimation, using the so-called "Joint-EnKF" approach. The idea is simply to augment the state vector with the parameters to be estimated and assign invariant dynamics for the time evolution of the parameters. In this contribution, we investigate the efficiency of the Joint-EnKF for estimating spatially-varying Manning's n coefficients used to define the bottom roughness in the Shallow Water Equations (SWEs) of a coastal ocean model. Observation System Simulation Experiments (OSSEs) are conducted using the ADvanced CIRCulation (ADCIRC) model, which solves a modified form of the Shallow Water Equations. A deterministic EnKF, the Singular Evolutive Interpolated Kalman (SEIK) filter, is used to estimate a vector of Manning's n coefficients defined at the model nodal points by assimilating synthetic water elevation data. It is found that with reasonable ensemble size (O (10)) , the filter's estimate converges to the reference Manning's field. To enhance performance, we have further reduced the dimension of the parameter search space through a Karhunen-Loéve (KL) expansion. We have also iterated on the filter update step to better account for the nonlinearity of the parameter estimation problem. We study the sensitivity of the system to the ensemble size, localization scale, dimension of retained KL modes, and number of iterations. The performance of the proposed framework in term of estimation accuracy suggests that a well-tuned Joint-EnKF provides a promising robust approach to infer spatially varying seabed roughness parameters in the context of coastal ocean modeling.
Discrimination of malignant lymphomas and leukemia using Radon transform based-higher order spectra
NASA Astrophysics Data System (ADS)
Luo, Yi; Celenk, Mehmet; Bejai, Prashanth
2006-03-01
A new algorithm that can be used to automatically recognize and classify malignant lymphomas and leukemia is proposed in this paper. The algorithm utilizes the morphological watersheds to obtain boundaries of cells from cell images and isolate them from the surrounding background. The areas of cells are extracted from cell images after background subtraction. The Radon transform and higher-order spectra (HOS) analysis are utilized as an image processing tool to generate class feature vectors of different type cells and to extract testing cells' feature vectors. The testing cells' feature vectors are then compared with the known class feature vectors for a possible match by computing the Euclidean distances. The cell in question is classified as belonging to one of the existing cell classes in the least Euclidean distance sense.
NIMEFI: gene regulatory network inference using multiple ensemble feature importance algorithms.
Ruyssinck, Joeri; Huynh-Thu, Vân Anh; Geurts, Pierre; Dhaene, Tom; Demeester, Piet; Saeys, Yvan
2014-01-01
One of the long-standing open challenges in computational systems biology is the topology inference of gene regulatory networks from high-throughput omics data. Recently, two community-wide efforts, DREAM4 and DREAM5, have been established to benchmark network inference techniques using gene expression measurements. In these challenges the overall top performer was the GENIE3 algorithm. This method decomposes the network inference task into separate regression problems for each gene in the network in which the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. The contribution of this work is twofold. First, we generalize the regression decomposition strategy of GENIE3 to other feature importance methods. We compare the performance of support vector regression, the elastic net, random forest regression, symbolic regression and their ensemble variants in this setting to the original GENIE3 algorithm. To create the ensemble variants, we propose a subsampling approach which allows us to cast any feature selection algorithm that produces a feature ranking into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. As second contribution, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better. An implementation of NIMEFI has been made publicly available.
Klement, William; Wilk, Szymon; Michalowski, Wojtek; Farion, Ken J; Osmond, Martin H; Verter, Vedat
2012-03-01
Using an automatic data-driven approach, this paper develops a prediction model that achieves more balanced performance (in terms of sensitivity and specificity) than the Canadian Assessment of Tomography for Childhood Head Injury (CATCH) rule, when predicting the need for computed tomography (CT) imaging of children after a minor head injury. CT is widely considered an effective tool for evaluating patients with minor head trauma who have potentially suffered serious intracranial injury. However, its use poses possible harmful effects, particularly for children, due to exposure to radiation. Safety concerns, along with issues of cost and practice variability, have led to calls for the development of effective methods to decide when CT imaging is needed. Clinical decision rules represent such methods and are normally derived from the analysis of large prospectively collected patient data sets. The CATCH rule was created by a group of Canadian pediatric emergency physicians to support the decision of referring children with minor head injury to CT imaging. The goal of the CATCH rule was to maximize the sensitivity of predictions of potential intracranial lesion while keeping specificity at a reasonable level. After extensive analysis of the CATCH data set, characterized by severe class imbalance, and after a thorough evaluation of several data mining methods, we derived an ensemble of multiple Naive Bayes classifiers as the prediction model for CT imaging decisions. In the first phase of the experiment we compared the proposed ensemble model to other ensemble models employing rule-, tree- and instance-based member classifiers. Our prediction model demonstrated the best performance in terms of AUC, G-mean and sensitivity measures. In the second phase, using a bootstrapping experiment similar to that reported by the CATCH investigators, we showed that the proposed ensemble model achieved a more balanced predictive performance than the CATCH rule with an average sensitivity of 82.8% and an average specificity of 74.4% (vs. 98.1% and 50.0% for the CATCH rule respectively). Automatically derived prediction models cannot replace a physician's acumen. However, they help establish reference performance indicators for the purpose of developing clinical decision rules so the trade-off between prediction sensitivity and specificity is better understood. Copyright © 2011 Elsevier B.V. All rights reserved.
Amin, Morteza Moradi; Kermani, Saeed; Talebi, Ardeshir; Oghli, Mostafa Ghelich
2015-01-01
Acute lymphoblastic leukemia is the most common form of pediatric cancer which is categorized into three L1, L2, and L3 and could be detected through screening of blood and bone marrow smears by pathologists. Due to being time-consuming and tediousness of the procedure, a computer-based system is acquired for convenient detection of Acute lymphoblastic leukemia. Microscopic images are acquired from blood and bone marrow smears of patients with Acute lymphoblastic leukemia and normal cases. After applying image preprocessing, cells nuclei are segmented by k-means algorithm. Then geometric and statistical features are extracted from nuclei and finally these cells are classified to cancerous and noncancerous cells by means of support vector machine classifier with 10-fold cross validation. These cells are also classified into their sub-types by multi-Support vector machine classifier. Classifier is evaluated by these parameters: Sensitivity, specificity, and accuracy which values for cancerous and noncancerous cells 98%, 95%, and 97%, respectively. These parameters are also used for evaluation of cell sub-types which values in mean 84.3%, 97.3%, and 95.6%, respectively. The results show that proposed algorithm could achieve an acceptable performance for the diagnosis of Acute lymphoblastic leukemia and its sub-types and can be used as an assistant diagnostic tool for pathologists.
The NRL relocatable ocean/acoustic ensemble forecast system
NASA Astrophysics Data System (ADS)
Rowley, C.; Martin, P.; Cummings, J.; Jacobs, G.; Coelho, E.; Bishop, C.; Hong, X.; Peggion, G.; Fabre, J.
2009-04-01
A globally relocatable regional ocean nowcast/forecast system has been developed to support rapid implementation of new regional forecast domains. The system is in operational use at the Naval Oceanographic Office for a growing number of regional and coastal implementations. The new system is the basis for an ocean acoustic ensemble forecast and adaptive sampling capability. We present an overview of the forecast system and the ocean ensemble and adaptive sampling methods. The forecast system consists of core ocean data analysis and forecast modules, software for domain configuration, surface and boundary condition forcing processing, and job control, and global databases for ocean climatology, bathymetry, tides, and river locations and transports. The analysis component is the Navy Coupled Ocean Data Assimilation (NCODA) system, a 3D multivariate optimum interpolation system that produces simultaneous analyses of temperature, salinity, geopotential, and vector velocity using remotely-sensed SST, SSH, and sea ice concentration, plus in situ observations of temperature, salinity, and currents from ships, buoys, XBTs, CTDs, profiling floats, and autonomous gliders. The forecast component is the Navy Coastal Ocean Model (NCOM). The system supports one-way nesting and multiple assimilation methods. The ensemble system uses the ensemble transform technique with error variance estimates from the NCODA analysis to represent initial condition error. Perturbed surface forcing or an atmospheric ensemble is used to represent errors in surface forcing. The ensemble transform Kalman filter is used to assess the impact of adaptive observations on future analysis and forecast uncertainty for both ocean and acoustic properties.
Xia, Wenjun; Mita, Yoshio; Shibata, Tadashi
2016-05-01
Aiming at efficient data condensation and improving accuracy, this paper presents a hardware-friendly template reduction (TR) method for the nearest neighbor (NN) classifiers by introducing the concept of critical boundary vectors. A hardware system is also implemented to demonstrate the feasibility of using an field-programmable gate array (FPGA) to accelerate the proposed method. Initially, k -means centers are used as substitutes for the entire template set. Then, to enhance the classification performance, critical boundary vectors are selected by a novel learning algorithm, which is completed within a single iteration. Moreover, to remove noisy boundary vectors that can mislead the classification in a generalized manner, a global categorization scheme has been explored and applied to the algorithm. The global characterization automatically categorizes each classification problem and rapidly selects the boundary vectors according to the nature of the problem. Finally, only critical boundary vectors and k -means centers are used as the new template set for classification. Experimental results for 24 data sets show that the proposed algorithm can effectively reduce the number of template vectors for classification with a high learning speed. At the same time, it improves the accuracy by an average of 2.17% compared with the traditional NN classifiers and also shows greater accuracy than seven other TR methods. We have shown the feasibility of using a proof-of-concept FPGA system of 256 64-D vectors to accelerate the proposed method on hardware. At a 50-MHz clock frequency, the proposed system achieves a 3.86 times higher learning speed than on a 3.4-GHz PC, while consuming only 1% of the power of that used by the PC.
Pourhoseingholi, Mohamad Amin; Kheirian, Sedigheh; Zali, Mohammad Reza
2017-12-01
Colorectal cancer (CRC) is one of the most common malignancies and cause of cancer mortality worldwide. Given the importance of predicting the survival of CRC patients and the growing use of data mining methods, this study aims to compare the performance of models for predicting 5-year survival of CRC patients using variety of basic and ensemble data mining methods. The CRC dataset from The Shahid Beheshti University of Medical Sciences Research Center for Gastroenterology and Liver Diseases were used for prediction and comparative study of the base and ensemble data mining techniques. Feature selection methods were used to select predictor attributes for classification. The WEKA toolkit and MedCalc software were respectively utilized for creating and comparing the models. The obtained results showed that the predictive performance of developed models was altogether high (all greater than 90%). Overall, the performance of ensemble models was higher than that of basic classifiers and the best result achieved by ensemble voting model in terms of area under the ROC curve (AUC= 0.96). AUC Comparison of models showed that the ensemble voting method significantly outperformed all models except for two methods of Random Forest (RF) and Bayesian Network (BN) considered the overlapping 95% confidence intervals. This result may indicate high predictive power of these two methods along with ensemble voting for predicting 5-year survival of CRC patients.
Modeling task-specific neuronal ensembles improves decoding of grasp
NASA Astrophysics Data System (ADS)
Smith, Ryan J.; Soares, Alcimar B.; Rouse, Adam G.; Schieber, Marc H.; Thakor, Nitish V.
2018-06-01
Objective. Dexterous movement involves the activation and coordination of networks of neuronal populations across multiple cortical regions. Attempts to model firing of individual neurons commonly treat the firing rate as directly modulating with motor behavior. However, motor behavior may additionally be associated with modulations in the activity and functional connectivity of neurons in a broader ensemble. Accounting for variations in neural ensemble connectivity may provide additional information about the behavior being performed. Approach. In this study, we examined neural ensemble activity in primary motor cortex (M1) and premotor cortex (PM) of two male rhesus monkeys during performance of a center-out reach, grasp and manipulate task. We constructed point process encoding models of neuronal firing that incorporated task-specific variations in the baseline firing rate as well as variations in functional connectivity with the neural ensemble. Models were evaluated both in terms of their encoding capabilities and their ability to properly classify the grasp being performed. Main results. Task-specific ensemble models correctly predicted the performed grasp with over 95% accuracy and were shown to outperform models of neuronal activity that assume only a variable baseline firing rate. Task-specific ensemble models exhibited superior decoding performance in 82% of units in both monkeys (p < 0.01). Inclusion of ensemble activity also broadly improved the ability of models to describe observed spiking. Encoding performance of task-specific ensemble models, measured by spike timing predictability, improved upon baseline models in 62% of units. Significance. These results suggest that additional discriminative information about motor behavior found in the variations in functional connectivity of neuronal ensembles located in motor-related cortical regions is relevant to decode complex tasks such as grasping objects, and may serve the basis for more reliable and accurate neural prosthesis.
High Stimulus-Related Information in Barrel Cortex Inhibitory Interneurons
Reyes-Puerta, Vicente; Kim, Suam; Sun, Jyh-Jang; Imbrosci, Barbara; Kilb, Werner; Luhmann, Heiko J.
2015-01-01
The manner in which populations of inhibitory (INH) and excitatory (EXC) neocortical neurons collectively encode stimulus-related information is a fundamental, yet still unresolved question. Here we address this question by simultaneously recording with large-scale multi-electrode arrays (of up to 128 channels) the activity of cell ensembles (of up to 74 neurons) distributed along all layers of 3–4 neighboring cortical columns in the anesthetized adult rat somatosensory barrel cortex in vivo. Using two different whisker stimulus modalities (location and frequency) we show that individual INH neurons – classified as such according to their distinct extracellular spike waveforms – discriminate better between restricted sets of stimuli (≤6 stimulus classes) than EXC neurons in granular and infra-granular layers. We also demonstrate that ensembles of INH cells jointly provide as much information about such stimuli as comparable ensembles containing the ~20% most informative EXC neurons, however presenting less information redundancy – a result which was consistent when applying both theoretical information measurements and linear discriminant analysis classifiers. These results suggest that a consortium of INH neurons dominates the information conveyed to the neocortical network, thereby efficiently processing incoming sensory activity. This conclusion extends our view on the role of the inhibitory system to orchestrate cortical activity. PMID:26098109
A Novel Multi-Class Ensemble Model for Classifying Imbalanced Biomedical Datasets
NASA Astrophysics Data System (ADS)
Bikku, Thulasi; Sambasiva Rao, N., Dr; Rao, Akepogu Ananda, Dr
2017-08-01
This paper mainly focuseson developing aHadoop based framework for feature selection and classification models to classify high dimensionality data in heterogeneous biomedical databases. Wide research has been performing in the fields of Machine learning, Big data and Data mining for identifying patterns. The main challenge is extracting useful features generated from diverse biological systems. The proposed model can be used for predicting diseases in various applications and identifying the features relevant to particular diseases. There is an exponential growth of biomedical repositories such as PubMed and Medline, an accurate predictive model is essential for knowledge discovery in Hadoop environment. Extracting key features from unstructured documents often lead to uncertain results due to outliers and missing values. In this paper, we proposed a two phase map-reduce framework with text preprocessor and classification model. In the first phase, mapper based preprocessing method was designed to eliminate irrelevant features, missing values and outliers from the biomedical data. In the second phase, a Map-Reduce based multi-class ensemble decision tree model was designed and implemented in the preprocessed mapper data to improve the true positive rate and computational time. The experimental results on the complex biomedical datasets show that the performance of our proposed Hadoop based multi-class ensemble model significantly outperforms state-of-the-art baselines.
Random-Forest Classification of High-Resolution Remote Sensing Images and Ndsm Over Urban Areas
NASA Astrophysics Data System (ADS)
Sun, X. F.; Lin, X. G.
2017-09-01
As an intermediate step between raw remote sensing data and digital urban maps, remote sensing data classification has been a challenging and long-standing research problem in the community of remote sensing. In this work, an effective classification method is proposed for classifying high-resolution remote sensing data over urban areas. Starting from high resolution multi-spectral images and 3D geometry data, our method proceeds in three main stages: feature extraction, classification, and classified result refinement. First, we extract color, vegetation index and texture features from the multi-spectral image and compute the height, elevation texture and differential morphological profile (DMP) features from the 3D geometry data. Then in the classification stage, multiple random forest (RF) classifiers are trained separately, then combined to form a RF ensemble to estimate each sample's category probabilities. Finally the probabilities along with the feature importance indicator outputted by RF ensemble are used to construct a fully connected conditional random field (FCCRF) graph model, by which the classification results are refined through mean-field based statistical inference. Experiments on the ISPRS Semantic Labeling Contest dataset show that our proposed 3-stage method achieves 86.9% overall accuracy on the test data.
CATCh, an Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing Studies
Mysara, Mohamed; Saeys, Yvan; Leys, Natalie; Raes, Jeroen
2014-01-01
In ecological studies, microbial diversity is nowadays mostly assessed via the detection of phylogenetic marker genes, such as 16S rRNA. However, PCR amplification of these marker genes produces a significant amount of artificial sequences, often referred to as chimeras. Different algorithms have been developed to remove these chimeras, but efforts to combine different methodologies are limited. Therefore, two machine learning classifiers (reference-based and de novo CATCh) were developed by integrating the output of existing chimera detection tools into a new, more powerful method. When comparing our classifiers with existing tools in either the reference-based or de novo mode, a higher performance of our ensemble method was observed on a wide range of sequencing data, including simulated, 454 pyrosequencing, and Illumina MiSeq data sets. Since our algorithm combines the advantages of different individual chimera detection tools, our approach produces more robust results when challenged with chimeric sequences having a low parent divergence, short length of the chimeric range, and various numbers of parents. Additionally, it could be shown that integrating CATCh in the preprocessing pipeline has a beneficial effect on the quality of the clustering in operational taxonomic units. PMID:25527546
NASA Astrophysics Data System (ADS)
Saito, Kazuo; Hara, Masahiro; Kunii, Masaru; Seko, Hiromu; Yamaguchi, Munehiko
2011-05-01
Different initial perturbation methods for the mesoscale ensemble prediction were compared by the Meteorological Research Institute (MRI) as a part of the intercomparison of mesoscale ensemble prediction systems (EPSs) of the World Weather Research Programme (WWRP) Beijing 2008 Olympics Research and Development Project (B08RDP). Five initial perturbation methods for mesoscale ensemble prediction were developed for B08RDP and compared at MRI: (1) a downscaling method of the Japan Meteorological Agency (JMA)'s operational one-week EPS (WEP), (2) a targeted global model singular vector (GSV) method, (3) a mesoscale model singular vector (MSV) method based on the adjoint model of the JMA non-hydrostatic model (NHM), (4) a mesoscale breeding growing mode (MBD) method based on the NHM forecast and (5) a local ensemble transform (LET) method based on the local ensemble transform Kalman filter (LETKF) using NHM. These perturbation methods were applied to the preliminary experiments of the B08RDP Tier-1 mesoscale ensemble prediction with a horizontal resolution of 15 km. To make the comparison easier, the same horizontal resolution (40 km) was employed for the three mesoscale model-based initial perturbation methods (MSV, MBD and LET). The GSV method completely outperformed the WEP method, confirming the advantage of targeting in mesoscale EPS. The GSV method generally performed well with regard to root mean square errors of the ensemble mean, large growth rates of ensemble spreads throughout the 36-h forecast period, and high detection rates and high Brier skill scores (BSSs) for weak rains. On the other hand, the mesoscale model-based initial perturbation methods showed good detection rates and BSSs for intense rains. The MSV method showed a rapid growth in the ensemble spread of precipitation up to a forecast time of 6 h, which suggests suitability of the mesoscale SV for short-range EPSs, but the initial large growth of the perturbation did not last long. The performance of the MBD method was good for ensemble prediction of intense rain with a relatively small computing cost. The LET method showed similar characteristics to the MBD method, but the spread and growth rate were slightly smaller and the relative operating characteristic area skill score and BSS did not surpass those of MBD. These characteristic features of the five methods were confirmed by checking the evolution of the total energy norms and their growth rates. Characteristics of the initial perturbations obtained by four methods (GSV, MSV, MBD and LET) were examined for the case of a synoptic low-pressure system passing over eastern China. With GSV and MSV, the regions of large spread were near the low-pressure system, but with MSV, the distribution was more concentrated on the mesoscale disturbance. On the other hand, large-spread areas were observed southwest of the disturbance in MBD and LET. The horizontal pattern of LET perturbation was similar to that of MBD, but the amplitude of the LET perturbation reflected the observation density.
NASA Astrophysics Data System (ADS)
Monten, Ruben; Toldo, Chiara
2018-02-01
We present new AdS4 black hole solutions in N =2 gauged supergravity coupled to vector and hypermultiplets. We focus on a particular consistent truncation of M-theory on the homogeneous Sasaki–Einstein seven-manifold M 111, characterized by the presence of one Betti vector multiplet. We numerically construct static and spherically symmetric black holes with electric and magnetic charges, corresponding to M2 and M5 branes wrapping non-contractible cycles of the internal manifold. The novel feature characterizing these nonzero temperature configurations is the presence of a massive vector field halo. Moreover, we verify the first law of black hole mechanics and we study the thermodynamics in the canonical ensemble. We analyze the behavior of the massive vector field condensate across the small-large black hole phase transition and we interpret the process in the dual field theory.
2002-01-01
their expression profile and for classification of cells into tumerous and non- tumerous classes. Then we will present a parallel tree method for... cancerous cells. We will use the same dataset and use tree structured classifiers with multi-resolution analysis for classifying cancerous from non- cancerous ...cells. We have the expressions of 4096 genes from 98 different cell types. Of these 98, 72 are cancerous while 26 are non- cancerous . We are interested
Invariant object recognition based on the generalized discrete radon transform
NASA Astrophysics Data System (ADS)
Easley, Glenn R.; Colonna, Flavia
2004-04-01
We introduce a method for classifying objects based on special cases of the generalized discrete Radon transform. We adjust the transform and the corresponding ridgelet transform by means of circular shifting and a singular value decomposition (SVD) to obtain a translation, rotation and scaling invariant set of feature vectors. We then use a back-propagation neural network to classify the input feature vectors. We conclude with experimental results and compare these with other invariant recognition methods.
Comprehensive benchmarking and ensemble approaches for metagenomic classifiers.
McIntyre, Alexa B R; Ounit, Rachid; Afshinnekoo, Ebrahim; Prill, Robert J; Hénaff, Elizabeth; Alexander, Noah; Minot, Samuel S; Danko, David; Foox, Jonathan; Ahsanuddin, Sofia; Tighe, Scott; Hasan, Nur A; Subramanian, Poorani; Moffat, Kelly; Levy, Shawn; Lonardi, Stefano; Greenfield, Nick; Colwell, Rita R; Rosen, Gail L; Mason, Christopher E
2017-09-21
One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.
NASA Astrophysics Data System (ADS)
Dementev, A. O.; Dmitriev, E. V.; Kozoderov, V. V.; Egorov, V. D.
2017-10-01
Hyperspectral imaging is up-to-date promising technology widely applied for the accurate thematic mapping. The presence of a large number of narrow survey channels allows us to use subtle differences in spectral characteristics of objects and to make a more detailed classification than in the case of using standard multispectral data. The difficulties encountered in the processing of hyperspectral images are usually associated with the redundancy of spectral information which leads to the problem of the curse of dimensionality. Methods currently used for recognizing objects on multispectral and hyperspectral images are usually based on standard base supervised classification algorithms of various complexity. Accuracy of these algorithms can be significantly different depending on considered classification tasks. In this paper we study the performance of ensemble classification methods for the problem of classification of the forest vegetation. Error correcting output codes and boosting are tested on artificial data and real hyperspectral images. It is demonstrates, that boosting gives more significant improvement when used with simple base classifiers. The accuracy in this case in comparable the error correcting output code (ECOC) classifier with Gaussian kernel SVM base algorithm. However the necessity of boosting ECOC with Gaussian kernel SVM is questionable. It is demonstrated, that selected ensemble classifiers allow us to recognize forest species with high enough accuracy which can be compared with ground-based forest inventory data.
Alghamdi, Manal; Al-Mallah, Mouaz; Keteyian, Steven; Brawner, Clinton; Ehrman, Jonathan; Sakr, Sherif
2017-01-01
Machine learning is becoming a popular and important approach in the field of medical research. In this study, we investigate the relative performance of various machine learning methods such as Decision Tree, Naïve Bayes, Logistic Regression, Logistic Model Tree and Random Forests for predicting incident diabetes using medical records of cardiorespiratory fitness. In addition, we apply different techniques to uncover potential predictors of diabetes. This FIT project study used data of 32,555 patients who are free of any known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 5-year follow-up. At the completion of the fifth year, 5,099 of those patients have developed diabetes. The dataset contained 62 attributes classified into four categories: demographic characteristics, disease history, medication use history, and stress test vital signs. We developed an Ensembling-based predictive model using 13 attributes that were selected based on their clinical importance, Multiple Linear Regression, and Information Gain Ranking methods. The negative effect of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE). The overall performance of the predictive model classifier was improved by the Ensemble machine learning approach using the Vote method with three Decision Trees (Naïve Bayes Tree, Random Forest, and Logistic Model Tree) and achieved high accuracy of prediction (AUC = 0.92). The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data.
Ensemble transcript interaction networks: a case study on Alzheimer's disease.
Armañanzas, Rubén; Larrañaga, Pedro; Bielza, Concha
2012-10-01
Systems biology techniques are a topic of recent interest within the neurological field. Computational intelligence (CI) addresses this holistic perspective by means of consensus or ensemble techniques ultimately capable of uncovering new and relevant findings. In this paper, we propose the application of a CI approach based on ensemble Bayesian network classifiers and multivariate feature subset selection to induce probabilistic dependences that could match or unveil biological relationships. The research focuses on the analysis of high-throughput Alzheimer's disease (AD) transcript profiling. The analysis is conducted from two perspectives. First, we compare the expression profiles of hippocampus subregion entorhinal cortex (EC) samples of AD patients and controls. Second, we use the ensemble approach to study four types of samples: EC and dentate gyrus (DG) samples from both patients and controls. Results disclose transcript interaction networks with remarkable structures and genes not directly related to AD by previous studies. The ensemble is able to identify a variety of transcripts that play key roles in other neurological pathologies. Classical statistical assessment by means of non-parametric tests confirms the relevance of the majority of the transcripts. The ensemble approach pinpoints key metabolic mechanisms that could lead to new findings in the pathogenesis and development of AD. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
Coherence rephasing combined with spin-wave storage using chirped control pulses
NASA Astrophysics Data System (ADS)
Demeter, Gabor
2014-06-01
Photon-echo based optical quantum memory schemes often employ intermediate steps to transform optical coherences to spin coherences for longer storage times. We analyze a scheme that uses three identical chirped control pulses for coherence rephasing in an inhomogeneously broadened ensemble of three-level Λ systems. The pulses induce a cyclic permutation of the atomic populations in the adiabatic regime. Optical coherences created by a signal pulse are stored as spin coherences at an intermediate time interval, and are rephased for echo emission when the ensemble is returned to the initial state. Echo emission during a possible partial rephasing when the medium is inverted can be suppressed with an appropriate choice of control pulse wave vectors. We demonstrate that the scheme works in an optically dense ensemble, despite control pulse distortions during propagation. It integrates conveniently the spin-wave storage step into memory schemes based on a second rephasing of the atomic coherences.
Hsieh, Nan-Chen; Hung, Lun-Ping; Shih, Chun-Che; Keh, Huan-Chao; Chan, Chien-Hui
2012-06-01
Endovascular aneurysm repair (EVAR) is an advanced minimally invasive surgical technology that is helpful for reducing patients' recovery time, postoperative morbidity and mortality. This study proposes an ensemble model to predict postoperative morbidity after EVAR. The ensemble model was developed using a training set of consecutive patients who underwent EVAR between 2000 and 2009. All data required for prediction modeling, including patient demographics, preoperative, co-morbidities, and complication as outcome variables, was collected prospectively and entered into a clinical database. A discretization approach was used to categorize numerical values into informative feature space. Then, the Bayesian network (BN), artificial neural network (ANN), and support vector machine (SVM) were adopted as base models, and stacking combined multiple models. The research outcomes consisted of an ensemble model to predict postoperative morbidity after EVAR, the occurrence of postoperative complications prospectively recorded, and the causal effect knowledge by BNs with Markov blanket concept.
Ligare, Martin
2016-05-01
Multiple-pulse NMR experiments are a powerful tool for the investigation of molecules with coupled nuclear spins. The product operator formalism provides a way to understand the quantum evolution of an ensemble of weakly coupled spins in such experiments using some of the more intuitive concepts of classical physics and semi-classical vector representations. In this paper I present a new way in which to interpret the quantum evolution of an ensemble of spins. I recast the quantum problem in terms of mixtures of pure states of two spins whose expectation values evolve identically to those of classical moments. Pictorial representations of these classically evolving states provide a way to calculate the time evolution of ensembles of weakly coupled spins without the full machinery of quantum mechanics, offering insight to anyone who understands precession of magnetic moments in magnetic fields.
Proper projective symmetry in LRS Bianchi type V spacetimes
NASA Astrophysics Data System (ADS)
Shabbir, Ghulam; Mahomed, K. S.; Mahomed, F. M.; Moitsheki, R. J.
2018-04-01
In this paper, we investigate proper projective vector fields of locally rotationally symmetric (LRS) Bianchi type V spacetimes using direct integration and algebraic techniques. Despite the non-degeneracy in the Riemann tensor eigenvalues, we classify proper Bianchi type V spacetimes and show that the above spacetimes do not admit proper projective vector fields. Here, in all the cases projective vector fields are Killing vector fields.
Bayes Error Rate Estimation Using Classifier Ensembles
NASA Technical Reports Server (NTRS)
Tumer, Kagan; Ghosh, Joydeep
2003-01-01
The Bayes error rate gives a statistical lower bound on the error achievable for a given classification problem and the associated choice of features. By reliably estimating th is rate, one can assess the usefulness of the feature set that is being used for classification. Moreover, by comparing the accuracy achieved by a given classifier with the Bayes rate, one can quantify how effective that classifier is. Classical approaches for estimating or finding bounds for the Bayes error, in general, yield rather weak results for small sample sizes; unless the problem has some simple characteristics, such as Gaussian class-conditional likelihoods. This article shows how the outputs of a classifier ensemble can be used to provide reliable and easily obtainable estimates of the Bayes error with negligible extra computation. Three methods of varying sophistication are described. First, we present a framework that estimates the Bayes error when multiple classifiers, each providing an estimate of the a posteriori class probabilities, a recombined through averaging. Second, we bolster this approach by adding an information theoretic measure of output correlation to the estimate. Finally, we discuss a more general method that just looks at the class labels indicated by ensem ble members and provides error estimates based on the disagreements among classifiers. The methods are illustrated for artificial data, a difficult four-class problem involving underwater acoustic data, and two problems from the Problem benchmarks. For data sets with known Bayes error, the combiner-based methods introduced in this article outperform existing methods. The estimates obtained by the proposed methods also seem quite reliable for the real-life data sets for which the true Bayes rates are unknown.
Bouktif, Salah; Hanna, Eileen Marie; Zaki, Nazar; Abu Khousa, Eman
2014-01-01
Prediction and classification techniques have been well studied by machine learning researchers and developed for several real-word problems. However, the level of acceptance and success of prediction models are still below expectation due to some difficulties such as the low performance of prediction models when they are applied in different environments. Such a problem has been addressed by many researchers, mainly from the machine learning community. A second problem, principally raised by model users in different communities, such as managers, economists, engineers, biologists, and medical practitioners, etc., is the prediction models' interpretability. The latter is the ability of a model to explain its predictions and exhibit the causality relationships between the inputs and the outputs. In the case of classification, a successful way to alleviate the low performance is to use ensemble classiers. It is an intuitive strategy to activate collaboration between different classifiers towards a better performance than individual classier. Unfortunately, ensemble classifiers method do not take into account the interpretability of the final classification outcome. It even worsens the original interpretability of the individual classifiers. In this paper we propose a novel implementation of classifiers combination approach that does not only promote the overall performance but also preserves the interpretability of the resulting model. We propose a solution based on Ant Colony Optimization and tailored for the case of Bayesian classifiers. We validate our proposed solution with case studies from medical domain namely, heart disease and Cardiotography-based predictions, problems where interpretability is critical to make appropriate clinical decisions. The datasets, Prediction Models and software tool together with supplementary materials are available at http://faculty.uaeu.ac.ae/salahb/ACO4BC.htm.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Teramoto, Atsushi, E-mail: teramoto@fujita-hu.ac.jp; Fujita, Hiroshi; Yamamuro, Osamu
Purpose: Automated detection of solitary pulmonary nodules using positron emission tomography (PET) and computed tomography (CT) images shows good sensitivity; however, it is difficult to detect nodules in contact with normal organs, and additional efforts are needed so that the number of false positives (FPs) can be further reduced. In this paper, the authors propose an improved FP-reduction method for the detection of pulmonary nodules in PET/CT images by means of convolutional neural networks (CNNs). Methods: The overall scheme detects pulmonary nodules using both CT and PET images. In the CT images, a massive region is first detected using anmore » active contour filter, which is a type of contrast enhancement filter that has a deformable kernel shape. Subsequently, high-uptake regions detected by the PET images are merged with the regions detected by the CT images. FP candidates are eliminated using an ensemble method; it consists of two feature extractions, one by shape/metabolic feature analysis and the other by a CNN, followed by a two-step classifier, one step being rule based and the other being based on support vector machines. Results: The authors evaluated the detection performance using 104 PET/CT images collected by a cancer-screening program. The sensitivity in detecting candidates at an initial stage was 97.2%, with 72.8 FPs/case. After performing the proposed FP-reduction method, the sensitivity of detection was 90.1%, with 4.9 FPs/case; the proposed method eliminated approximately half the FPs existing in the previous study. Conclusions: An improved FP-reduction scheme using CNN technique has been developed for the detection of pulmonary nodules in PET/CT images. The authors’ ensemble FP-reduction method eliminated 93% of the FPs; their proposed method using CNN technique eliminates approximately half the FPs existing in the previous study. These results indicate that their method may be useful in the computer-aided detection of pulmonary nodules using PET/CT images.« less
Lynch, Chip M; Abdollahi, Behnaz; Fuqua, Joshua D; de Carlo, Alexandra R; Bartholomai, James A; Balgemann, Rayeanne N; van Berkel, Victor H; Frieboes, Hermann B
2017-12-01
Outcomes for cancer patients have been previously estimated by applying various machine learning techniques to large datasets such as the Surveillance, Epidemiology, and End Results (SEER) program database. In particular for lung cancer, it is not well understood which types of techniques would yield more predictive information, and which data attributes should be used in order to determine this information. In this study, a number of supervised learning techniques is applied to the SEER database to classify lung cancer patients in terms of survival, including linear regression, Decision Trees, Gradient Boosting Machines (GBM), Support Vector Machines (SVM), and a custom ensemble. Key data attributes in applying these methods include tumor grade, tumor size, gender, age, stage, and number of primaries, with the goal to enable comparison of predictive power between the various methods The prediction is treated like a continuous target, rather than a classification into categories, as a first step towards improving survival prediction. The results show that the predicted values agree with actual values for low to moderate survival times, which constitute the majority of the data. The best performing technique was the custom ensemble with a Root Mean Square Error (RMSE) value of 15.05. The most influential model within the custom ensemble was GBM, while Decision Trees may be inapplicable as it had too few discrete outputs. The results further show that among the five individual models generated, the most accurate was GBM with an RMSE value of 15.32. Although SVM underperformed with an RMSE value of 15.82, statistical analysis singles the SVM as the only model that generated a distinctive output. The results of the models are consistent with a classical Cox proportional hazards model used as a reference technique. We conclude that application of these supervised learning techniques to lung cancer data in the SEER database may be of use to estimate patient survival time with the ultimate goal to inform patient care decisions, and that the performance of these techniques with this particular dataset may be on par with that of classical methods. Copyright © 2017 Elsevier B.V. All rights reserved.
Recognizing human activities using appearance metric feature and kinematics feature
NASA Astrophysics Data System (ADS)
Qian, Huimin; Zhou, Jun; Lu, Xinbiao; Wu, Xinye
2017-05-01
The problem of automatically recognizing human activities from videos through the fusion of the two most important cues, appearance metric feature and kinematics feature, is considered. And a system of two-dimensional (2-D) Poisson equations is introduced to extract the more discriminative appearance metric feature. Specifically, the moving human blobs are first detected out from the video by background subtraction technique to form a binary image sequence, from which the appearance feature designated as the motion accumulation image and the kinematics feature termed as centroid instantaneous velocity are extracted. Second, 2-D discrete Poisson equations are employed to reinterpret the motion accumulation image to produce a more differentiated Poisson silhouette image, from which the appearance feature vector is created through the dimension reduction technique called bidirectional 2-D principal component analysis, considering the balance between classification accuracy and time consumption. Finally, a cascaded classifier based on the nearest neighbor classifier and two directed acyclic graph support vector machine classifiers, integrated with the fusion of the appearance feature vector and centroid instantaneous velocity vector, is applied to recognize the human activities. Experimental results on the open databases and a homemade one confirm the recognition performance of the proposed algorithm.
Early sinkhole detection using a drone-based thermal camera and image processing
NASA Astrophysics Data System (ADS)
Lee, Eun Ju; Shin, Sang Young; Ko, Byoung Chul; Chang, Chunho
2016-09-01
Accurate advance detection of the sinkholes that are occurring more frequently now is an important way of preventing human fatalities and property damage. Unlike naturally occurring sinkholes, human-induced ones in urban areas are typically due to groundwater disturbances and leaks of water and sewage caused by large-scale construction. Although many sinkhole detection methods have been developed, it is still difficult to predict sinkholes that occur in depth areas. In addition, conventional methods are inappropriate for scanning a large area because of their high cost. Therefore, this paper uses a drone combined with a thermal far-infrared (FIR) camera to detect potential sinkholes over a large area based on computer vision and pattern classification techniques. To make a standard dataset, we dug eight holes of depths 0.5-2 m in increments of 0.5 m and with a maximum width of 1 m. We filmed these using the drone-based FIR camera at a height of 50 m. We first detect candidate regions by analysing cold spots in the thermal images based on the fact that a sinkhole typically has a lower thermal energy than its background. Then, these regions are classified into sinkhole and non-sinkhole classes using a pattern classifier. In this study, we ensemble the classification results based on a light convolutional neural network (CNN) and those based on a Boosted Random Forest (BRF) with handcrafted features. We apply the proposed ensemble method successfully to sinkhole data for various sizes and depths in different environments, and prove that the CNN ensemble and the BRF one with handcrafted features are better at detecting sinkholes than other classifiers or standalone CNN.
LiCABEDS II. Modeling of ligand selectivity for G-protein-coupled cannabinoid receptors.
Ma, Chao; Wang, Lirong; Yang, Peng; Myint, Kyaw Z; Xie, Xiang-Qun
2013-01-28
The cannabinoid receptor subtype 2 (CB2) is a promising therapeutic target for blood cancer, pain relief, osteoporosis, and immune system disease. The recent withdrawal of Rimonabant, which targets another closely related cannabinoid receptor (CB1), accentuates the importance of selectivity for the development of CB2 ligands in order to minimize their effects on the CB1 receptor. In our previous study, LiCABEDS (Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps) was reported as a generic ligand classification algorithm for the prediction of categorical molecular properties. Here, we report extension of the application of LiCABEDS to the modeling of cannabinoid ligand selectivity with molecular fingerprints as descriptors. The performance of LiCABEDS was systematically compared with another popular classification algorithm, support vector machine (SVM), according to prediction precision and recall rate. In addition, the examination of LiCABEDS models revealed the difference in structure diversity of CB1 and CB2 selective ligands. The structure determination from data mining could be useful for the design of novel cannabinoid lead compounds. More importantly, the potential of LiCABEDS was demonstrated through successful identification of newly synthesized CB2 selective compounds.
Dropout Prediction in E-Learning Courses through the Combination of Machine Learning Techniques
ERIC Educational Resources Information Center
Lykourentzou, Ioanna; Giannoukos, Ioannis; Nikolopoulos, Vassilis; Mpardis, George; Loumos, Vassili
2009-01-01
In this paper, a dropout prediction method for e-learning courses, based on three popular machine learning techniques and detailed student data, is proposed. The machine learning techniques used are feed-forward neural networks, support vector machines and probabilistic ensemble simplified fuzzy ARTMAP. Since a single technique may fail to…
Lahiri, A; Roy, Abhijit Guha; Sheet, Debdoot; Biswas, Prabir Kumar
2016-08-01
Automated segmentation of retinal blood vessels in label-free fundus images entails a pivotal role in computed aided diagnosis of ophthalmic pathologies, viz., diabetic retinopathy, hypertensive disorders and cardiovascular diseases. The challenge remains active in medical image analysis research due to varied distribution of blood vessels, which manifest variations in their dimensions of physical appearance against a noisy background. In this paper we formulate the segmentation challenge as a classification task. Specifically, we employ unsupervised hierarchical feature learning using ensemble of two level of sparsely trained denoised stacked autoencoder. First level training with bootstrap samples ensures decoupling and second level ensemble formed by different network architectures ensures architectural revision. We show that ensemble training of auto-encoders fosters diversity in learning dictionary of visual kernels for vessel segmentation. SoftMax classifier is used for fine tuning each member autoencoder and multiple strategies are explored for 2-level fusion of ensemble members. On DRIVE dataset, we achieve maximum average accuracy of 95.33% with an impressively low standard deviation of 0.003 and Kappa agreement coefficient of 0.708. Comparison with other major algorithms substantiates the high efficacy of our model.
Olesen, Alexander Neergaard; Christensen, Julie A E; Sorensen, Helge B D; Jennum, Poul J
2016-08-01
Reducing the number of recording modalities for sleep staging research can benefit both researchers and patients, under the condition that they provide as accurate results as conventional systems. This paper investigates the possibility of exploiting the multisource nature of the electrooculography (EOG) signals by presenting a method for automatic sleep staging using the complete ensemble empirical mode decomposition with adaptive noise algorithm, and a random forest classifier. It achieves a high overall accuracy of 82% and a Cohen's kappa of 0.74 indicating substantial agreement between automatic and manual scoring.
Diamond-Based Magnetic Imaging with Fourier Optical Processing
NASA Astrophysics Data System (ADS)
Backlund, Mikael P.; Kehayias, Pauli; Walsworth, Ronald L.
2017-11-01
Diamond-based magnetic field sensors have attracted great interest in recent years. In particular, wide-field magnetic imaging using nitrogen-vacancy (NV) centers in diamond has been previously demonstrated in condensed matter, biological, and paleomagnetic applications. Vector magnetic imaging with NV ensembles typically requires a significant applied field (>10 G ) to resolve the contributions from four crystallographic orientations, hindering studies of magnetic samples that require measurement in low or independently specified bias fields. Here we model and measure the complex amplitude distribution of NV emission at the microscope's Fourier plane and show that by modulating this collected light at the Fourier plane, one can decompose the NV ensemble magnetic resonance spectrum into its constituent orientations by purely optical means. This decomposition effectively extends the dynamic range at a given bias field and enables wide-field vector magnetic imaging at arbitrarily low bias fields, thus broadening potential applications of NV imaging and sensing. Our results demonstrate that NV-based microscopy stands to benefit greatly from Fourier optical approaches, which have already found widespread utility in other branches of microscopy.
NASA Astrophysics Data System (ADS)
Oh, Seok-Geun; Suh, Myoung-Seok
2017-07-01
The projection skills of five ensemble methods were analyzed according to simulation skills, training period, and ensemble members, using 198 sets of pseudo-simulation data (PSD) produced by random number generation assuming the simulated temperature of regional climate models. The PSD sets were classified into 18 categories according to the relative magnitude of bias, variance ratio, and correlation coefficient, where each category had 11 sets (including 1 truth set) with 50 samples. The ensemble methods used were as follows: equal weighted averaging without bias correction (EWA_NBC), EWA with bias correction (EWA_WBC), weighted ensemble averaging based on root mean square errors and correlation (WEA_RAC), WEA based on the Taylor score (WEA_Tay), and multivariate linear regression (Mul_Reg). The projection skills of the ensemble methods improved generally as compared with the best member for each category. However, their projection skills are significantly affected by the simulation skills of the ensemble member. The weighted ensemble methods showed better projection skills than non-weighted methods, in particular, for the PSD categories having systematic biases and various correlation coefficients. The EWA_NBC showed considerably lower projection skills than the other methods, in particular, for the PSD categories with systematic biases. Although Mul_Reg showed relatively good skills, it showed strong sensitivity to the PSD categories, training periods, and number of members. On the other hand, the WEA_Tay and WEA_RAC showed relatively superior skills in both the accuracy and reliability for all the sensitivity experiments. This indicates that WEA_Tay and WEA_RAC are applicable even for simulation data with systematic biases, a short training period, and a small number of ensemble members.
Pirooznia, Mehdi; Deng, Youping
2006-12-12
Graphical user interface (GUI) software promotes novelty by allowing users to extend the functionality. SVM Classifier is a cross-platform graphical application that handles very large datasets well. The purpose of this study is to create a GUI application that allows SVM users to perform SVM training, classification and prediction. The GUI provides user-friendly access to state-of-the-art SVM methods embodied in the LIBSVM implementation of Support Vector Machine. We implemented the java interface using standard swing libraries. We used a sample data from a breast cancer study for testing classification accuracy. We achieved 100% accuracy in classification among the BRCA1-BRCA2 samples with RBF kernel of SVM. We have developed a java GUI application that allows SVM users to perform SVM training, classification and prediction. We have demonstrated that support vector machines can accurately classify genes into functional categories based upon expression data from DNA microarray hybridization experiments. Among the different kernel functions that we examined, the SVM that uses a radial basis kernel function provides the best performance. The SVM Classifier is available at http://mfgn.usm.edu/ebl/svm/.
Fuzzy Nonlinear Proximal Support Vector Machine for Land Extraction Based on Remote Sensing Image
Zhong, Xiaomei; Li, Jianping; Dou, Huacheng; Deng, Shijun; Wang, Guofei; Jiang, Yu; Wang, Yongjie; Zhou, Zebing; Wang, Li; Yan, Fei
2013-01-01
Currently, remote sensing technologies were widely employed in the dynamic monitoring of the land. This paper presented an algorithm named fuzzy nonlinear proximal support vector machine (FNPSVM) by basing on ETM+ remote sensing image. This algorithm is applied to extract various types of lands of the city Da’an in northern China. Two multi-category strategies, namely “one-against-one” and “one-against-rest” for this algorithm were described in detail and then compared. A fuzzy membership function was presented to reduce the effects of noises or outliers on the data samples. The approaches of feature extraction, feature selection, and several key parameter settings were also given. Numerous experiments were carried out to evaluate its performances including various accuracies (overall accuracies and kappa coefficient), stability, training speed, and classification speed. The FNPSVM classifier was compared to the other three classifiers including the maximum likelihood classifier (MLC), back propagation neural network (BPN), and the proximal support vector machine (PSVM) under different training conditions. The impacts of the selection of training samples, testing samples and features on the four classifiers were also evaluated in these experiments. PMID:23936016
Faradji, Farhad; Ward, Rabab K; Birch, Gary E
2009-06-15
The feasibility of having a self-paced brain-computer interface (BCI) based on mental tasks is investigated. The EEG signals of four subjects performing five mental tasks each are used in the design of a 2-state self-paced BCI. The output of the BCI should only be activated when the subject performs a specific mental task and should remain inactive otherwise. For each subject and each task, the feature coefficient and the classifier that yield the best performance are selected, using the autoregressive coefficients as the features. The classifier with a zero false positive rate and the highest true positive rate is selected as the best classifier. The classifiers tested include: linear discriminant analysis, quadratic discriminant analysis, Mahalanobis discriminant analysis, support vector machine, and radial basis function neural network. The results show that: (1) some classifiers obtained the desired zero false positive rate; (2) the linear discriminant analysis classifier does not yield acceptable performance; (3) the quadratic discriminant analysis classifier outperforms the Mahalanobis discriminant analysis classifier and performs almost as well as the radial basis function neural network; and (4) the support vector machine classifier has the highest true positive rates but unfortunately has nonzero false positive rates in most cases.
Boosting with Averaged Weight Vectors
NASA Technical Reports Server (NTRS)
Oza, Nikunj C.; Clancy, Daniel (Technical Monitor)
2002-01-01
AdaBoost is a well-known ensemble learning algorithm that constructs its constituent or base models in sequence. A key step in AdaBoost is constructing a distribution over the training examples to create each base model. This distribution, represented as a vector, is constructed to be orthogonal to the vector of mistakes made by the previous base model in the sequence. The idea is to make the next base model's errors uncorrelated with those of the previous model. Some researchers have pointed out the intuition that it is probably better to construct a distribution that is orthogonal to the mistake vectors of all the previous base models, but that this is not always possible. We present an algorithm that attempts to come as close as possible to this goal in an efficient manner. We present experimental results demonstrating significant improvement over AdaBoost and the Totally Corrective boosting algorithm, which also attempts to satisfy this goal.
Learning disordered topological phases by statistical recovery of symmetry
NASA Astrophysics Data System (ADS)
Yoshioka, Nobuyuki; Akagi, Yutaka; Katsura, Hosho
2018-05-01
We apply the artificial neural network in a supervised manner to map out the quantum phase diagram of disordered topological superconductors in class DIII. Given the disorder that keeps the discrete symmetries of the ensemble as a whole, translational symmetry which is broken in the quasiparticle distribution individually is recovered statistically by taking an ensemble average. By using this, we classify the phases by the artificial neural network that learned the quasiparticle distribution in the clean limit and show that the result is totally consistent with the calculation by the transfer matrix method or noncommutative geometry approach. If all three phases, namely the Z2, trivial, and thermal metal phases, appear in the clean limit, the machine can classify them with high confidence over the entire phase diagram. If only the former two phases are present, we find that the machine remains confused in a certain region, leading us to conclude the detection of the unknown phase which is eventually identified as the thermal metal phase.
Classifying features in CT imagery: accuracy for some single- and multiple-species classifiers
Daniel L. Schmoldt; Jing He; A. Lynn Abbott
1998-01-01
Our current approach to automatically label features in CT images of hardwood logs classifies each pixel of an image individually. These feature classifiers use a back-propagation artificial neural network (ANN) and feature vectors that include a small, local neighborhood of pixels and the distance of the target pixel to the center of the log. Initially, this type of...
NASA Astrophysics Data System (ADS)
Kaltenboeck, Rudolf; Kerschbaum, Markus; Hennermann, Karin; Mayer, Stefan
2013-04-01
Nowcasting of precipitation events, especially thunderstorm events or winter storms, has high impact on flight safety and efficiency for air traffic management. Future strategic planning by air traffic control will result in circumnavigation of potential hazardous areas, reduction of load around efficiency hot spots by offering alternatives, increase of handling capacity, anticipation of avoidance manoeuvres and increase of awareness before dangerous areas are entered by aircraft. To facilitate this rapid update forecasts of location, intensity, size, movement and development of local storms are necessary. Weather radar data deliver precipitation analysis of high temporal and spatial resolution close to real time by using clever scanning strategies. These data are the basis to generate rapid update forecasts in a time frame up to 2 hours and more for applications in aviation meteorological service provision, such as optimizing safety and economic impact in the context of sub-scale phenomena. On the basis of tracking radar echoes by correlation the movement vectors of successive weather radar images are calculated. For every new successive radar image a set of ensemble precipitation fields is collected by using different parameter sets like pattern match size, different time steps, filter methods and an implementation of history of tracking vectors and plausibility checks. This method considers the uncertainty in rain field displacement and different scales in time and space. By validating manually a set of case studies, the best verification method and skill score is defined and implemented into an online-verification scheme which calculates the optimized forecasts for different time steps and different areas by using different extrapolation ensemble members. To get information about the quality and reliability of the extrapolation process additional information of data quality (e.g. shielding in Alpine areas) is extrapolated and combined with an extrapolation-quality-index. Subsequently the probability and quality information of the forecast ensemble is available and flexible blending to numerical prediction model for each subarea is possible. Simultaneously with automatic processing the ensemble nowcasting product is visualized in a new innovative way which combines the intensity, probability and quality information for different subareas in one forecast image.
Adaptive road crack detection system by pavement classification.
Gavilán, Miguel; Balcones, David; Marcos, Oscar; Llorca, David F; Sotelo, Miguel A; Parra, Ignacio; Ocaña, Manuel; Aliseda, Pedro; Yarza, Pedro; Amírola, Alejandro
2011-01-01
This paper presents a road distress detection system involving the phases needed to properly deal with fully automatic road distress assessment. A vehicle equipped with line scan cameras, laser illumination and acquisition HW-SW is used to storage the digital images that will be further processed to identify road cracks. Pre-processing is firstly carried out to both smooth the texture and enhance the linear features. Non-crack features detection is then applied to mask areas of the images with joints, sealed cracks and white painting, that usually generate false positive cracking. A seed-based approach is proposed to deal with road crack detection, combining Multiple Directional Non-Minimum Suppression (MDNMS) with a symmetry check. Seeds are linked by computing the paths with the lowest cost that meet the symmetry restrictions. The whole detection process involves the use of several parameters. A correct setting becomes essential to get optimal results without manual intervention. A fully automatic approach by means of a linear SVM-based classifier ensemble able to distinguish between up to 10 different types of pavement that appear in the Spanish roads is proposed. The optimal feature vector includes different texture-based features. The parameters are then tuned depending on the output provided by the classifier. Regarding non-crack features detection, results show that the introduction of such module reduces the impact of false positives due to non-crack features up to a factor of 2. In addition, the observed performance of the crack detection system is significantly boosted by adapting the parameters to the type of pavement.
Adaptive Road Crack Detection System by Pavement Classification
Gavilán, Miguel; Balcones, David; Marcos, Oscar; Llorca, David F.; Sotelo, Miguel A.; Parra, Ignacio; Ocaña, Manuel; Aliseda, Pedro; Yarza, Pedro; Amírola, Alejandro
2011-01-01
This paper presents a road distress detection system involving the phases needed to properly deal with fully automatic road distress assessment. A vehicle equipped with line scan cameras, laser illumination and acquisition HW-SW is used to storage the digital images that will be further processed to identify road cracks. Pre-processing is firstly carried out to both smooth the texture and enhance the linear features. Non-crack features detection is then applied to mask areas of the images with joints, sealed cracks and white painting, that usually generate false positive cracking. A seed-based approach is proposed to deal with road crack detection, combining Multiple Directional Non-Minimum Suppression (MDNMS) with a symmetry check. Seeds are linked by computing the paths with the lowest cost that meet the symmetry restrictions. The whole detection process involves the use of several parameters. A correct setting becomes essential to get optimal results without manual intervention. A fully automatic approach by means of a linear SVM-based classifier ensemble able to distinguish between up to 10 different types of pavement that appear in the Spanish roads is proposed. The optimal feature vector includes different texture-based features. The parameters are then tuned depending on the output provided by the classifier. Regarding non-crack features detection, results show that the introduction of such module reduces the impact of false positives due to non-crack features up to a factor of 2. In addition, the observed performance of the crack detection system is significantly boosted by adapting the parameters to the type of pavement. PMID:22163717
An ultra low power feature extraction and classification system for wearable seizure detection.
Page, Adam; Pramod Tim Oates, Siddharth; Mohsenin, Tinoosh
2015-01-01
In this paper we explore the use of a variety of machine learning algorithms for designing a reliable and low-power, multi-channel EEG feature extractor and classifier for predicting seizures from electroencephalographic data (scalp EEG). Different machine learning classifiers including k-nearest neighbor, support vector machines, naïve Bayes, logistic regression, and neural networks are explored with the goal of maximizing detection accuracy while minimizing power, area, and latency. The input to each machine learning classifier is a 198 feature vector containing 9 features for each of the 22 EEG channels obtained over 1-second windows. All classifiers were able to obtain F1 scores over 80% and onset sensitivity of 100% when tested on 10 patients. Among five different classifiers that were explored, logistic regression (LR) proved to have minimum hardware complexity while providing average F-1 score of 91%. Both ASIC and FPGA implementations of logistic regression are presented and show the smallest area, power consumption, and the lowest latency when compared to the previous work.
Cao, Qi; Leung, K M
2014-09-22
Reliable computer models for the prediction of chemical biodegradability from molecular descriptors and fingerprints are very important for making health and environmental decisions. Coupling of the differential evolution (DE) algorithm with the support vector classifier (SVC) in order to optimize the main parameters of the classifier resulted in an improved classifier called the DE-SVC, which is introduced in this paper for use in chemical biodegradability studies. The DE-SVC was applied to predict the biodegradation of chemicals on the basis of extensive sample data sets and known structural features of molecules. Our optimization experiments showed that DE can efficiently find the proper parameters of the SVC. The resulting classifier possesses strong robustness and reliability compared with grid search, genetic algorithm, and particle swarm optimization methods. The classification experiments conducted here showed that the DE-SVC exhibits better classification performance than models previously used for such studies. It is a more effective and efficient prediction model for chemical biodegradability.
Power line identification of millimeter wave radar based on PCA-GS-SVM
NASA Astrophysics Data System (ADS)
Fang, Fang; Zhang, Guifeng; Cheng, Yansheng
2017-12-01
Aiming at the problem that the existing detection method can not effectively solve the security of UAV's ultra low altitude flight caused by power line, a power line recognition method based on grid search (GS) and the principal component analysis and support vector machine (PCA-SVM) is proposed. Firstly, the candidate line of Hough transform is reduced by PCA, and the main feature of candidate line is extracted. Then, upport vector machine (SVM is) optimized by grid search method (GS). Finally, using support vector machine classifier optimized parameters to classify the candidate line. MATLAB simulation results show that this method can effectively identify the power line and noise, and has high recognition accuracy and algorithm efficiency.
NASA Astrophysics Data System (ADS)
Khuluse-Makhanya, Sibusisiwe; Stein, Alfred; Breytenbach, André; Gxumisa, Athi; Dudeni-Tlhone, Nontembeko; Debba, Pravesh
2017-10-01
In urban areas the deterioration of air quality as a result of fugitive dust receives less attention than the more prominent traffic and industrial emissions. We assessed whether fugitive dust emission sources in the neighbourhood of an air quality monitor are predictors of ambient PM10 concentrations on days characterized by strong local winds. An ensemble maximum likelihood method is developed for land cover mapping in the vicinity of an air quality station using SPOT 6 multi-spectral images. The ensemble maximum likelihood classifier is developed through multiple training iterations for improved accuracy of the bare soil class. Five primary land cover classes are considered, namely built-up areas, vegetation, bare soil, water and 'mixed bare soil' which denotes areas where soil is mixed with either vegetation or synthetic materials. Preliminary validation of the ensemble classifier for the bare soil class results in an accuracy range of 65-98%. Final validation of all classes results in an overall accuracy of 78%. Next, cluster analysis and a varying intercepts regression model are used to assess the statistical association between land cover, a fugitive dust emissions proxy and observed PM10. We found that land cover patterns in the neighbourhood of an air quality station are significant predictors of observed average PM10 concentrations on days when wind speeds are conducive for dust emissions. This study concludes that in the absence of an emissions inventory for ambient particulate matter, PM10 emitted from dust reservoirs can be statistically accounted for by land cover characteristics. This supports the use of land cover data for improved prediction of PM10 at locations without air quality monitoring stations.
NASA Astrophysics Data System (ADS)
Hashemi, H.; Tax, D. M. J.; Duin, R. P. W.; Javaherian, A.; de Groot, P.
2008-11-01
Seismic object detection is a relatively new field in which 3-D bodies are visualized and spatial relationships between objects of different origins are studied in order to extract geologic information. In this paper, we propose a method for finding an optimal classifier with the help of a statistical feature ranking technique and combining different classifiers. The method, which has general applicability, is demonstrated here on a gas chimney detection problem. First, we evaluate a set of input seismic attributes extracted at locations labeled by a human expert using regularized discriminant analysis (RDA). In order to find the RDA score for each seismic attribute, forward and backward search strategies are used. Subsequently, two non-linear classifiers: multilayer perceptron (MLP) and support vector classifier (SVC) are run on the ranked seismic attributes. Finally, to capitalize on the intrinsic differences between both classifiers, the MLP and SVC results are combined using logical rules of maximum, minimum and mean. The proposed method optimizes the ranked feature space size and yields the lowest classification error in the final combined result. We will show that the logical minimum reveals gas chimneys that exhibit both the softness of MLP and the resolution of SVC classifiers.
Leveraging Wikipedia knowledge to classify multilingual biomedical documents.
Antonio Mouriño García, Marcos; Pérez Rodríguez, Roberto; Anido Rifón, Luis
2018-05-02
This article presents a classifier that leverages Wikipedia knowledge to represent documents as vectors of concepts weights, and analyses its suitability for classifying biomedical documents written in any language when it is trained only with English documents. We propose the cross-language concept matching technique, which relies on Wikipedia interlanguage links to convert concept vectors between languages. The performance of the classifier is compared to a classifier based on machine translation, and two classifiers based on MetaMap. To perform the experiments, we created two multilingual corpus. The first one, Multi-Lingual UVigoMED (ML-UVigoMED) is composed of 23,647 Wikipedia documents about biomedical topics written in English, German, French, Spanish, Italian, Galician, Romanian, and Icelandic. The second one, English-French-Spanish-German UVigoMED (EFSG-UVigoMED) is composed of 19,210 biomedical abstract extracted from MEDLINE written in English, French, Spanish, and German. The performance of the approach proposed is superior to any of the state-of-the art classifier in the benchmark. We conclude that leveraging Wikipedia knowledge is of great advantage in tasks of multilingual classification of biomedical documents. Copyright © 2018 Elsevier B.V. All rights reserved.
Ghose, Soumya; Mitra, Jhimli; Karunanithi, Mohan; Dowling, Jason
2015-01-01
Home monitoring of chronically ill or elderly patient can reduce frequent hospitalisations and hence provide improved quality of care at a reduced cost to the community, therefore reducing the burden on the healthcare system. Activity recognition of such patients is of high importance in such a design. In this work, a system for automatic human physical activity recognition from smart-phone inertial sensors data is proposed. An ensemble of decision trees framework is adopted to train and predict the multi-class human activity system. A comparison of our proposed method with a multi-class traditional support vector machine shows significant improvement in activity recognition accuracies.
Ren, Fulong; Cao, Peng; Li, Wei; Zhao, Dazhe; Zaiane, Osmar
2017-01-01
Diabetic retinopathy (DR) is a progressive disease, and its detection at an early stage is crucial for saving a patient's vision. An automated screening system for DR can help in reduce the chances of complete blindness due to DR along with lowering the work load on ophthalmologists. Among the earliest signs of DR are microaneurysms (MAs). However, current schemes for MA detection appear to report many false positives because detection algorithms have high sensitivity. Inevitably some non-MAs structures are labeled as MAs in the initial MAs identification step. This is a typical "class imbalance problem". Class imbalanced data has detrimental effects on the performance of conventional classifiers. In this work, we propose an ensemble based adaptive over-sampling algorithm for overcoming the class imbalance problem in the false positive reduction, and we use Boosting, Bagging, Random subspace as the ensemble framework to improve microaneurysm detection. The ensemble based over-sampling methods we proposed combine the strength of adaptive over-sampling and ensemble. The objective of the amalgamation of ensemble and adaptive over-sampling is to reduce the induction biases introduced from imbalanced data and to enhance the generalization classification performance of extreme learning machines (ELM). Experimental results show that our ASOBoost method has higher area under the ROC curve (AUC) and G-mean values than many existing class imbalance learning methods. Copyright © 2016 Elsevier Ltd. All rights reserved.
Lung Cancer Survival Prediction using Ensemble Data Mining on Seer Data
Agrawal, Ankit; Misra, Sanchit; Narayanan, Ramanathan; ...
2012-01-01
We analyze the lung cancer data available from the SEER program with the aim of developing accurate survival prediction models for lung cancer. Carefully designed preprocessing steps resulted in removal/modification/splitting of several attributes, and 2 of the 11 derived attributes were found to have significant predictive power. Several supervised classification methods were used on the preprocessed data along with various data mining optimizations and validations. In our experiments, ensemble voting of five decision tree based classifiers and meta-classifiers was found to result in the best prediction performance in terms of accuracy and area under the ROC curve. We have developedmore » an on-line lung cancer outcome calculator for estimating the risk of mortality after 6 months, 9 months, 1 year, 2 year and 5 years of diagnosis, for which a smaller non-redundant subset of 13 attributes was carefully selected using attribute selection techniques, while trying to retain the predictive power of the original set of attributes. Further, ensemble voting models were also created for predicting conditional survival outcome for lung cancer (estimating risk of mortality after 5 years of diagnosis, given that the patient has already survived for a period of time), and included in the calculator. The on-line lung cancer outcome calculator developed as a result of this study is available at http://info.eecs.northwestern.edu:8080/LungCancerOutcomeCalculator/.« less
Applying machine-learning techniques to Twitter data for automatic hazard-event classification.
NASA Astrophysics Data System (ADS)
Filgueira, R.; Bee, E. J.; Diaz-Doce, D.; Poole, J., Sr.; Singh, A.
2017-12-01
The constant flow of information offered by tweets provides valuable information about all sorts of events at a high temporal and spatial resolution. Over the past year we have been analyzing in real-time geological hazards/phenomenon, such as earthquakes, volcanic eruptions, landslides, floods or the aurora, as part of the GeoSocial project, by geo-locating tweets filtered by keywords in a web-map. However, not all the filtered tweets are related with hazard/phenomenon events. This work explores two classification techniques for automatic hazard-event categorization based on tweets about the "Aurora". First, tweets were filtered using aurora-related keywords, removing stop words and selecting the ones written in English. For classifying the remaining between "aurora-event" or "no-aurora-event" categories, we compared two state-of-art techniques: Support Vector Machine (SVM) and Deep Convolutional Neural Networks (CNN) algorithms. Both approaches belong to the family of supervised learning algorithms, which make predictions based on labelled training dataset. Therefore, we created a training dataset by tagging 1200 tweets between both categories. The general form of SVM is used to separate two classes by a function (kernel). We compared the performance of four different kernels (Linear Regression, Logistic Regression, Multinomial Naïve Bayesian and Stochastic Gradient Descent) provided by Scikit-Learn library using our training dataset to build the SVM classifier. The results shown that the Logistic Regression (LR) gets the best accuracy (87%). So, we selected the SVM-LR classifier to categorise a large collection of tweets using the "dispel4py" framework.Later, we developed a CNN classifier, where the first layer embeds words into low-dimensional vectors. The next layer performs convolutions over the embedded word vectors. Results from the convolutional layer are max-pooled into a long feature vector, which is classified using a softmax layer. The CNN's accuracy is lower (83%) than the SVM-LR, since the algorithm needs a bigger training dataset to increase its accuracy. We used TensorFlow framework for applying CNN classifier to the same collection of tweets.In future we will modify both classifiers to work with other geo-hazards, use larger training datasets and apply them in real-time.
NIMEFI: Gene Regulatory Network Inference using Multiple Ensemble Feature Importance Algorithms
Ruyssinck, Joeri; Huynh-Thu, Vân Anh; Geurts, Pierre; Dhaene, Tom; Demeester, Piet; Saeys, Yvan
2014-01-01
One of the long-standing open challenges in computational systems biology is the topology inference of gene regulatory networks from high-throughput omics data. Recently, two community-wide efforts, DREAM4 and DREAM5, have been established to benchmark network inference techniques using gene expression measurements. In these challenges the overall top performer was the GENIE3 algorithm. This method decomposes the network inference task into separate regression problems for each gene in the network in which the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. The contribution of this work is twofold. First, we generalize the regression decomposition strategy of GENIE3 to other feature importance methods. We compare the performance of support vector regression, the elastic net, random forest regression, symbolic regression and their ensemble variants in this setting to the original GENIE3 algorithm. To create the ensemble variants, we propose a subsampling approach which allows us to cast any feature selection algorithm that produces a feature ranking into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. As second contribution, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better. An implementation of NIMEFI has been made publicly available. PMID:24667482
Breast Cancer Recognition Using a Novel Hybrid Intelligent Method
Addeh, Jalil; Ebrahimzadeh, Ata
2012-01-01
Breast cancer is the second largest cause of cancer deaths among women. At the same time, it is also among the most curable cancer types if it can be diagnosed early. This paper presents a novel hybrid intelligent method for recognition of breast cancer tumors. The proposed method includes three main modules: the feature extraction module, the classifier module, and the optimization module. In the feature extraction module, fuzzy features are proposed as the efficient characteristic of the patterns. In the classifier module, because of the promising generalization capability of support vector machines (SVM), a SVM-based classifier is proposed. In support vector machine training, the hyperparameters have very important roles for its recognition accuracy. Therefore, in the optimization module, the bees algorithm (BA) is proposed for selecting appropriate parameters of the classifier. The proposed system is tested on Wisconsin Breast Cancer database and simulation results show that the recommended system has a high accuracy. PMID:23626945
Cost-sensitive AdaBoost algorithm for ordinal regression based on extreme learning machine.
Riccardi, Annalisa; Fernández-Navarro, Francisco; Carloni, Sante
2014-10-01
In this paper, the well known stagewise additive modeling using a multiclass exponential (SAMME) boosting algorithm is extended to address problems where there exists a natural order in the targets using a cost-sensitive approach. The proposed ensemble model uses an extreme learning machine (ELM) model as a base classifier (with the Gaussian kernel and the additional regularization parameter). The closed form of the derived weighted least squares problem is provided, and it is employed to estimate analytically the parameters connecting the hidden layer to the output layer at each iteration of the boosting algorithm. Compared to the state-of-the-art boosting algorithms, in particular those using ELM as base classifier, the suggested technique does not require the generation of a new training dataset at each iteration. The adoption of the weighted least squares formulation of the problem has been presented as an unbiased and alternative approach to the already existing ELM boosting techniques. Moreover, the addition of a cost model for weighting the patterns, according to the order of the targets, enables the classifier to tackle ordinal regression problems further. The proposed method has been validated by an experimental study by comparing it with already existing ensemble methods and ELM techniques for ordinal regression, showing competitive results.
Displacement data assimilation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rosenthal, W. Steven; Venkataramani, Shankar; Mariano, Arthur J.
We show that modifying a Bayesian data assimilation scheme by incorporating kinematically-consistent displacement corrections produces a scheme that is demonstrably better at estimating partially observed state vectors in a setting where feature information is important. While the displacement transformation is generic, here we implement it within an ensemble Kalman Filter framework and demonstrate its effectiveness in tracking stochastically perturbed vortices.
Advanced Methods for Passive Acoustic Detection, Classification, and Localization of Marine Mammals
2012-09-30
floor 1176 Howell St Newport RI 02842 phone: (401) 832-5749 fax: (401) 832-4441 email: David.Moretti@navy.mil Steve W. Martin SPAWAR...multiclass support vector machine (SVM) classifier was previously developed ( Jarvis et al. 2008). This classifier both detects and classifies echolocation...whales. Here Moretti’s group, especially S. Jarvis , will improve the SVM classifier by resolving confusion between species whose clicks overlap in
Clustering-Based Ensemble Learning for Activity Recognition in Smart Homes
Jurek, Anna; Nugent, Chris; Bi, Yaxin; Wu, Shengli
2014-01-01
Application of sensor-based technology within activity monitoring systems is becoming a popular technique within the smart environment paradigm. Nevertheless, the use of such an approach generates complex constructs of data, which subsequently requires the use of intricate activity recognition techniques to automatically infer the underlying activity. This paper explores a cluster-based ensemble method as a new solution for the purposes of activity recognition within smart environments. With this approach activities are modelled as collections of clusters built on different subsets of features. A classification process is performed by assigning a new instance to its closest cluster from each collection. Two different sensor data representations have been investigated, namely numeric and binary. Following the evaluation of the proposed methodology it has been demonstrated that the cluster-based ensemble method can be successfully applied as a viable option for activity recognition. Results following exposure to data collected from a range of activities indicated that the ensemble method had the ability to perform with accuracies of 94.2% and 97.5% for numeric and binary data, respectively. These results outperformed a range of single classifiers considered as benchmarks. PMID:25014095
Clustering-based ensemble learning for activity recognition in smart homes.
Jurek, Anna; Nugent, Chris; Bi, Yaxin; Wu, Shengli
2014-07-10
Application of sensor-based technology within activity monitoring systems is becoming a popular technique within the smart environment paradigm. Nevertheless, the use of such an approach generates complex constructs of data, which subsequently requires the use of intricate activity recognition techniques to automatically infer the underlying activity. This paper explores a cluster-based ensemble method as a new solution for the purposes of activity recognition within smart environments. With this approach activities are modelled as collections of clusters built on different subsets of features. A classification process is performed by assigning a new instance to its closest cluster from each collection. Two different sensor data representations have been investigated, namely numeric and binary. Following the evaluation of the proposed methodology it has been demonstrated that the cluster-based ensemble method can be successfully applied as a viable option for activity recognition. Results following exposure to data collected from a range of activities indicated that the ensemble method had the ability to perform with accuracies of 94.2% and 97.5% for numeric and binary data, respectively. These results outperformed a range of single classifiers considered as benchmarks.
NASA Astrophysics Data System (ADS)
Ouyang, Qi; Lu, Wenxi; Lin, Jin; Deng, Wenbing; Cheng, Weiguo
2017-08-01
The surrogate-based simulation-optimization techniques are frequently used for optimal groundwater remediation design. When this technique is used, surrogate errors caused by surrogate-modeling uncertainty may lead to generation of infeasible designs. In this paper, a conservative strategy that pushes the optimal design into the feasible region was used to address surrogate-modeling uncertainty. In addition, chance-constrained programming (CCP) was adopted to compare with the conservative strategy in addressing this uncertainty. Three methods, multi-gene genetic programming (MGGP), Kriging (KRG) and support vector regression (SVR), were used to construct surrogate models for a time-consuming multi-phase flow model. To improve the performance of the surrogate model, ensemble surrogates were constructed based on combinations of different stand-alone surrogate models. The results show that: (1) the surrogate-modeling uncertainty was successfully addressed by the conservative strategy, which means that this method is promising for addressing surrogate-modeling uncertainty. (2) The ensemble surrogate model that combines MGGP with KRG showed the most favorable performance, which indicates that this ensemble surrogate can utilize both stand-alone surrogate models to improve the performance of the surrogate model.
NASA Astrophysics Data System (ADS)
Khade, Vikram; Kurian, Jaison; Chang, Ping; Szunyogh, Istvan; Thyng, Kristen; Montuoro, Raffaele
2017-05-01
This paper demonstrates the potential of ocean ensemble forecasting in the Gulf of Mexico (GoM). The Bred Vector (BV) technique with one week rescaling frequency is implemented on a 9 km resolution version of the Regional Ocean Modelling System (ROMS). Numerical experiments are carried out by using the HYCOM analysis products to define the initial conditions and the lateral boundary conditions. The growth rates of the forecast uncertainty are estimated to be about 10% of initial amplitude per week. By carrying out ensemble forecast experiments with and without perturbed surface forcing, it is demonstrated that in the coastal regions accounting for uncertainties in the atmospheric forcing is more important than accounting for uncertainties in the ocean initial conditions. In the Loop Current region, the initial condition uncertainties, are the dominant source of the forecast uncertainty. The root-mean-square error of the Lagrangian track forecasts at the 15-day forecast lead time can be reduced by about 10 - 50 km using the ensemble mean Eulerian forecast of the oceanic flow for the computation of the tracks, instead of the single-initial-condition Eulerian forecast.
Men, Zhongxian; Yee, Eugene; Lien, Fue-Sang; Yang, Zhiling; Liu, Yongqian
2014-01-01
Short-term wind speed and wind power forecasts (for a 72 h period) are obtained using a nonlinear autoregressive exogenous artificial neural network (ANN) methodology which incorporates either numerical weather prediction or high-resolution computational fluid dynamics wind field information as an exogenous input. An ensemble approach is used to combine the predictions from many candidate ANNs in order to provide improved forecasts for wind speed and power, along with the associated uncertainties in these forecasts. More specifically, the ensemble ANN is used to quantify the uncertainties arising from the network weight initialization and from the unknown structure of the ANN. All members forming the ensemble of neural networks were trained using an efficient particle swarm optimization algorithm. The results of the proposed methodology are validated using wind speed and wind power data obtained from an operational wind farm located in Northern China. The assessment demonstrates that this methodology for wind speed and power forecasting generally provides an improvement in predictive skills when compared to the practice of using an "optimal" weight vector from a single ANN while providing additional information in the form of prediction uncertainty bounds.
Lien, Fue-Sang; Yang, Zhiling; Liu, Yongqian
2014-01-01
Short-term wind speed and wind power forecasts (for a 72 h period) are obtained using a nonlinear autoregressive exogenous artificial neural network (ANN) methodology which incorporates either numerical weather prediction or high-resolution computational fluid dynamics wind field information as an exogenous input. An ensemble approach is used to combine the predictions from many candidate ANNs in order to provide improved forecasts for wind speed and power, along with the associated uncertainties in these forecasts. More specifically, the ensemble ANN is used to quantify the uncertainties arising from the network weight initialization and from the unknown structure of the ANN. All members forming the ensemble of neural networks were trained using an efficient particle swarm optimization algorithm. The results of the proposed methodology are validated using wind speed and wind power data obtained from an operational wind farm located in Northern China. The assessment demonstrates that this methodology for wind speed and power forecasting generally provides an improvement in predictive skills when compared to the practice of using an “optimal” weight vector from a single ANN while providing additional information in the form of prediction uncertainty bounds. PMID:27382627
NASA Technical Reports Server (NTRS)
Keppenne, Christian L.; Rienecker, Michele M.; Koblinsky, Chester (Technical Monitor)
2001-01-01
A multivariate ensemble Kalman filter (MvEnKF) implemented on a massively parallel computer architecture has been implemented for the Poseidon ocean circulation model and tested with a Pacific Basin model configuration. There are about two million prognostic state-vector variables. Parallelism for the data assimilation step is achieved by regionalization of the background-error covariances that are calculated from the phase-space distribution of the ensemble. Each processing element (PE) collects elements of a matrix measurement functional from nearby PEs. To avoid the introduction of spurious long-range covariances associated with finite ensemble sizes, the background-error covariances are given compact support by means of a Hadamard (element by element) product with a three-dimensional canonical correlation function. The methodology and the MvEnKF configuration are discussed. It is shown that the regionalization of the background covariances; has a negligible impact on the quality of the analyses. The parallel algorithm is very efficient for large numbers of observations but does not scale well beyond 100 PEs at the current model resolution. On a platform with distributed memory, memory rather than speed is the limiting factor.
Non-coaxial superposition of vector vortex beams.
Aadhi, A; Vaity, Pravin; Chithrabhanu, P; Reddy, Salla Gangi; Prabakar, Shashi; Singh, R P
2016-02-10
Vector vortex beams are classified into four types depending upon spatial variation in their polarization vector. We have generated all four of these types of vector vortex beams by using a modified polarization Sagnac interferometer with a vortex lens. Further, we have studied the non-coaxial superposition of two vector vortex beams. It is observed that the superposition of two vector vortex beams with same polarization singularity leads to a beam with another kind of polarization singularity in their interaction region. The results may be of importance in ultrahigh security of the polarization-encrypted data that utilizes vector vortex beams and multiple optical trapping with non-coaxial superposition of vector vortex beams. We verified our experimental results with theory.
NASA Astrophysics Data System (ADS)
Mohan, Dhanya; Kumar, C. Santhosh
2016-03-01
Predicting the physiological condition (normal/abnormal) of a patient is highly desirable to enhance the quality of health care. Multi-parameter patient monitors (MPMs) using heart rate, arterial blood pressure, respiration rate and oxygen saturation (S pO2) as input parameters were developed to monitor the condition of patients, with minimum human resource utilization. The Support vector machine (SVM), an advanced machine learning approach popularly used for classification and regression is used for the realization of MPMs. For making MPMs cost effective, we experiment on the hardware implementation of the MPM using support vector machine classifier. The training of the system is done using the matlab environment and the detection of the alarm/noalarm condition is implemented in hardware. We used different kernels for SVM classification and note that the best performance was obtained using intersection kernel SVM (IKSVM). The intersection kernel support vector machine classifier MPM has outperformed the best known MPM using radial basis function kernel by an absoute improvement of 2.74% in accuracy, 1.86% in sensitivity and 3.01% in specificity. The hardware model was developed based on the improved performance system using Verilog Hardware Description Language and was implemented on Altera cyclone-II development board.
Neural network ensemble based CAD system for focal liver lesions from B-mode ultrasound.
Virmani, Jitendra; Kumar, Vinod; Kalra, Naveen; Khandelwal, Niranjan
2014-08-01
A neural network ensemble (NNE) based computer-aided diagnostic (CAD) system to assist radiologists in differential diagnosis between focal liver lesions (FLLs), including (1) typical and atypical cases of Cyst, hemangioma (HEM) and metastatic carcinoma (MET) lesions, (2) small and large hepatocellular carcinoma (HCC) lesions, along with (3) normal (NOR) liver tissue is proposed in the present work. Expert radiologists, visualize the textural characteristics of regions inside and outside the lesions to differentiate between different FLLs, accordingly texture features computed from inside lesion regions of interest (IROIs) and texture ratio features computed from IROIs and surrounding lesion regions of interests (SROIs) are taken as input. Principal component analysis (PCA) is used for reducing the dimensionality of the feature space before classifier design. The first step of classification module consists of a five class PCA-NN based primary classifier which yields probability outputs for five liver image classes. The second step of classification module consists of ten binary PCA-NN based secondary classifiers for NOR/Cyst, NOR/HEM, NOR/HCC, NOR/MET, Cyst/HEM, Cyst/HCC, Cyst/MET, HEM/HCC, HEM/MET and HCC/MET classes. The probability outputs of five class PCA-NN based primary classifier is used to determine the first two most probable classes for a test instance, based on which it is directed to the corresponding binary PCA-NN based secondary classifier for crisp classification between two classes. By including the second step of the classification module, classification accuracy increases from 88.7 % to 95 %. The promising results obtained by the proposed system indicate its usefulness to assist radiologists in differential diagnosis of FLLs.
Issues in Engineering Self-Aware and Self-Expressive Ensembles
NASA Astrophysics Data System (ADS)
Hölzl, Matthias; Wirsing, Martin
The following sections are included: * Introduction * Classifying Awareness and Self-Expression * Elements of the awareness mechanism * A white-box definition of awareness * Self-expression * A black-box definition of awareness * Awareness Engineering * Socio-Technical Issues * Privacy, ownership and control of data * Discrimination and responsibility * Failuremodes * Summary and Conclusions
Improving precision of glomerular filtration rate estimating model by ensemble learning.
Liu, Xun; Li, Ningshan; Lv, Linsheng; Fu, Yongmei; Cheng, Cailian; Wang, Caixia; Ye, Yuqiu; Li, Shaomin; Lou, Tanqi
2017-11-09
Accurate assessment of kidney function is clinically important, but estimates of glomerular filtration rate (GFR) by regression are imprecise. We hypothesized that ensemble learning could improve precision. A total of 1419 participants were enrolled, with 1002 in the development dataset and 417 in the external validation dataset. GFR was independently estimated from age, sex and serum creatinine using an artificial neural network (ANN), support vector machine (SVM), regression, and ensemble learning. GFR was measured by 99mTc-DTPA renal dynamic imaging calibrated with dual plasma sample 99mTc-DTPA GFR. Mean measured GFRs were 70.0 ml/min/1.73 m 2 in the developmental and 53.4 ml/min/1.73 m 2 in the external validation cohorts. In the external validation cohort, precision was better in the ensemble model of the ANN, SVM and regression equation (IQR = 13.5 ml/min/1.73 m 2 ) than in the new regression model (IQR = 14.0 ml/min/1.73 m 2 , P < 0.001). The precision of ensemble learning was the best of the three models, but the models had similar bias and accuracy. The median difference ranged from 2.3 to 3.7 ml/min/1.73 m 2 , 30% accuracy ranged from 73.1 to 76.0%, and P was > 0.05 for all comparisons of the new regression equation and the other new models. An ensemble learning model including three variables, the average ANN, SVM, and regression equation values, was more precise than the new regression model. A more complex ensemble learning strategy may further improve GFR estimates.
Estimation of the uncertainty of a climate model using an ensemble simulation
NASA Astrophysics Data System (ADS)
Barth, A.; Mathiot, P.; Goosse, H.
2012-04-01
The atmospheric forcings play an important role in the study of the ocean and sea-ice dynamics of the Southern Ocean. Error in the atmospheric forcings will inevitably result in uncertain model results. The sensitivity of the model results to errors in the atmospheric forcings are studied with ensemble simulations using multivariate perturbations of the atmospheric forcing fields. The numerical ocean model used is the NEMO-LIM in a global configuration with an horizontal resolution of 2°. NCEP reanalyses are used to provide air temperature and wind data to force the ocean model over the last 50 years. A climatological mean is used to prescribe relative humidity, cloud cover and precipitation. In a first step, the model results is compared with OSTIA SST and OSI SAF sea ice concentration of the southern hemisphere. The seasonal behavior of the RMS difference and bias in SST and ice concentration is highlighted as well as the regions with relatively high RMS errors and biases such as the Antarctic Circumpolar Current and near the ice-edge. Ensemble simulations are performed to statistically characterize the model error due to uncertainties in the atmospheric forcings. Such information is a crucial element for future data assimilation experiments. Ensemble simulations are performed with perturbed air temperature and wind forcings. A Fourier decomposition of the NCEP wind vectors and air temperature for 2007 is used to generate ensemble perturbations. The perturbations are scaled such that the resulting ensemble spread matches approximately the RMS differences between the satellite SST and sea ice concentration. The ensemble spread and covariance are analyzed for the minimum and maximum sea ice extent. It is shown that errors in the atmospheric forcings can extend to several hundred meters in depth near the Antarctic Circumpolar Current.
Representation of photon limited data in emission tomography using origin ensembles
NASA Astrophysics Data System (ADS)
Sitek, A.
2008-06-01
Representation and reconstruction of data obtained by emission tomography scanners are challenging due to high noise levels in the data. Typically, images obtained using tomographic measurements are represented using grids. In this work, we define images as sets of origins of events detected during tomographic measurements; we call these origin ensembles (OEs). A state in the ensemble is characterized by a vector of 3N parameters Y, where the parameters are the coordinates of origins of detected events in a three-dimensional space and N is the number of detected events. The 3N-dimensional probability density function (PDF) for that ensemble is derived, and we present an algorithm for OE image estimation from tomographic measurements. A displayable image (e.g. grid based image) is derived from the OE formulation by calculating ensemble expectations based on the PDF using the Markov chain Monte Carlo method. The approach was applied to computer-simulated 3D list-mode positron emission tomography data. The reconstruction errors for a 10 000 000 event acquisition for simulated ranged from 0.1 to 34.8%, depending on object size and sampling density. The method was also applied to experimental data and the results of the OE method were consistent with those obtained by a standard maximum-likelihood approach. The method is a new approach to representation and reconstruction of data obtained by photon-limited emission tomography measurements.
NASA Technical Reports Server (NTRS)
Fichtl, G. H.; Holland, R. L.
1978-01-01
A stochastic model of spacecraft motion was developed based on the assumption that the net torque vector due to crew activity and rocket thruster firings is a statistically stationary Gaussian vector process. The process had zero ensemble mean value, and the components of the torque vector were mutually stochastically independent. The linearized rigid-body equations of motion were used to derive the autospectral density functions of the components of the spacecraft rotation vector. The cross-spectral density functions of the components of the rotation vector vanish for all frequencies so that the components of rotation were mutually stochastically independent. The autospectral and cross-spectral density functions of the induced gravity environment imparted to scientific apparatus rigidly attached to the spacecraft were calculated from the rotation rate spectral density functions via linearized inertial frame to body-fixed principal axis frame transformation formulae. The induced gravity process was a Gaussian one with zero mean value. Transformation formulae were used to rotate the principal axis body-fixed frame to which the rotation rate and induced gravity vector were referred to a body-fixed frame in which the components of the induced gravity vector were stochastically independent. Rice's theory of exceedances was used to calculate expected exceedance rates of the components of the rotation and induced gravity vector processes.
Advanced Methods for Passive Acoustic Detection, Classification, and Localization of Marine Mammals
2013-09-30
N0001411WX21394 Steve W. Martin SPAWAR Systems Center Pacific 53366 Front St. San Diego, CA 92152-6551 phone: (619) 553-9882 email: Steve.W.Martin...multiclass support vector machine (SVM) classifier was previously developed ( Jarvis et al. 2008). This classifier both detects and classifies echolocation...whales. Here Moretti’s group, particularly S. Jarvis , will improve the SVM classifier by resolving confusion between species whose clicks overlap in
Application of Bred Vectors To Data Assimilation
NASA Astrophysics Data System (ADS)
Corazza, M.; Kalnay, E.; Patil, Dj
We introduced a statistic, the BV-dimension, to measure the effective local finite-time dimensionality of the atmosphere. We show that this dimension is often quite low, and suggest that this finding has important implications for data assimilation and the accuracy of weather forecasting (Patil et al, 2001). The original database for this study was the forecasts of the NCEP global ensemble forecasting system. The initial differences between the control forecast and the per- turbed forecasts are called bred vectors. The control and perturbed initial conditions valid at time t=n(t are evolved using the forecast model until time t=(n+1) (t. The differences between the perturbed and the control forecasts are scaled down to their initial amplitude, and constitute the bred vectors valid at (n+1) (t. Their growth rate is typically about 1.5/day. The bred vectors are similar by construction to leading Lya- punov vectors except that they have small but finite amplitude, and they are valid at finite times. The original NCEP ensemble data set has 5 independent bred vectors. We define a local bred vector at each grid point by choosing the 5 by 5 grid points centered at the grid point (a region of about 1100km by 1100km), and using the north-south and east- west velocity components at 500mb pressure level to form a 50 dimensional column vector. Since we have k=5 global bred vectors, we also have k local bred vectors at each grid point. We estimate the effective dimensionality of the subspace spanned by the local bred vectors by performing a singular value decomposition (EOF analysis). The k local bred vector columns form a 50xk matrix M. The singular values s(i) of M measure the extent to which the k column unit vectors making up the matrix M point in the direction of v(i). We define the bred vector dimension as BVDIM={Sum[s(i)]}^2/{Sum[s(i)]^2} For example, if 4 out of the 5 vectors lie along v, and one lies along v, the BV- dimension would be BVDIM[sqrt(4), 1, 0,0,0]=1.8, less than 2 because one direction is more dominant than the other in representing the original data. The results (Patil et al, 2001) show that there are large regions where the bred vectors span a subspace of substantially lower dimension than that of the full space. These low dimensionality regions are dominant in the baroclinic extratropics, typically have a lifetime of 3-7 days, have a well-defined horizontal and vertical structure that spans 1 most of the atmosphere, and tend to move eastward. New results with a large number of ensemble members confirm these results and indicate that the low dimensionality regions are quite robust, and depend only on the verification time (i.e., the underlying flow). Corazza et al (2001) have performed experiments with a data assimilation system based on a quasi-geostrophic model and simulated observations (Morss, 1999, Hamill et al, 2000). A 3D-variational data assimilation scheme for a quasi-geostrophic chan- nel model is used to study the structure of the background error and its relationship to the corresponding bred vectors. The "true" evolution of the model atmosphere is defined by an integration of the model and "rawinsonde observations" are simulated by randomly perturbing the true state at fixed locations. It is found that after 3-5 days the bred vectors develop well organized structures which are very similar for the two different norms considered in this paper (potential vorticity norm and streamfunction norm). The results show that the bred vectors do indeed represent well the characteristics of the data assimilation forecast errors, and that the subspace of bred vectors contains most of the forecast error, except in areas where the forecast errors are small. For example, the angle between the 6hr forecast error and the subspace spanned by 10 bred vectors is less than 10o over 90% of the domain, indicating a pattern correlation of more than 98.5% between the forecast error and its projection onto the bred vector subspace. The presence of low-dimensional regions in the perturbations of the basic flow has important implications for data assimilation. At any given time, there is a difference between the true atmospheric state and the model forecast. Assuming that model er- rors are not the dominant source of errors, in a region of low BV-dimensionality the difference between the true state and the forecast should lie substantially in the low dimensional unstable subspace of the few bred vectors that contribute most strongly to the low BV-dimension. This information should yield a substantial improvement in the forecast: the data assimilation algorithm should correct the model state by moving it closer to the observations along the unstable subspace, since this is where the true state most likely lies. Preliminary experiments have been conducted with the quasi-geostrophic data assim- ilation system testing whether it is possible to add "errors of the day" based on bred vectors to the standard (constant) 3D-Var background error covariance in order to capture these important errors. The results are extremely encouraging, indicating a significant reduction (about 40%) in the analysis errors at a very low computational cost. References: 2 Corazza, M., E. Kalnay, DJ Patil, R. Morss, M Cai, I. Szunyogh, BR Hunt, E Ott and JA Yorke, 2001: Use of the breeding technique to estimate the structure of the analysis "errors of the day". Submitted to Nonlinear Processes in Geophysics. Hamill, T.M., Snyder, C., and Morss, R.E., 2000: A Comparison of Probabilistic Fore- casts from Bred, Singular-Vector and Perturbed Observation Ensembles, Mon. Wea. Rev., 128, 18351851. Kalnay, E., and Z. Toth, 1994: Removing growing errors in the analysis cycle. Preprints of the Tenth Conference on Numerical Weather Prediction, Amer. Meteor. Soc., 1994, 212-215. Morss, R. E., 1999: Adaptive observations: Idealized sampling strategies for improv- ing numerical weather prediction. PHD thesis, Massachussetts Institute of technology, 225pp. Patil, D. J. S., B. R. Hunt, E. Kalnay, J. A. Yorke, and E. Ott., 2001: Local Low Dimensionality of Atmospheric Dynamics. Phys. Rev. Lett., 86, 5878. 3
Incremental classification learning for anomaly detection in medical images
NASA Astrophysics Data System (ADS)
Giritharan, Balathasan; Yuan, Xiaohui; Liu, Jianguo
2009-02-01
Computer-aided diagnosis usually screens thousands of instances to find only a few positive cases that indicate probable presence of disease.The amount of patient data increases consistently all the time. In diagnosis of new instances, disagreement occurs between a CAD system and physicians, which suggests inaccurate classifiers. Intuitively, misclassified instances and the previously acquired data should be used to retrain the classifier. This, however, is very time consuming and, in some cases where dataset is too large, becomes infeasible. In addition, among the patient data, only a small percentile shows positive sign, which is known as imbalanced data.We present an incremental Support Vector Machines(SVM) as a solution for the class imbalance problem in classification of anomaly in medical images. The support vectors provide a concise representation of the distribution of the training data. Here we use bootstrapping to identify potential candidate support vectors for future iterations. Experiments were conducted using images from endoscopy videos, and the sensitivity and specificity were close to that of SVM trained using all samples available at a given incremental step with significantly improved efficiency in training the classifier.
Scalar and vector Keldysh models in the time domain
NASA Astrophysics Data System (ADS)
Kiselev, M. N.; Kikoin, K. A.
2009-04-01
The exactly solvable Keldysh model of disordered electron system in a random scattering field with extremely long correlation length is converted to the time-dependent model with extremely long relaxation. The dynamical problem is solved for the ensemble of two-level systems (TLS) with fluctuating well depths having the discrete Z 2 symmetry. It is shown also that the symmetric TLS with fluctuating barrier transparency may be described in terms of the vector Keldysh model with dime-dependent random planar rotations in xy plane having continuous SO(2) symmetry. Application of this model to description of dynamic fluctuations in quantum dots and optical lattices is discussed.
NASA Astrophysics Data System (ADS)
Maier, Oskar; Wilms, Matthias; von der Gablentz, Janina; Krämer, Ulrike; Handels, Heinz
2014-03-01
Automatic segmentation of ischemic stroke lesions in magnetic resonance (MR) images is important in clinical practice and for neuroscientific trials. The key problem is to detect largely inhomogeneous regions of varying sizes, shapes and locations. We present a stroke lesion segmentation method based on local features extracted from multi-spectral MR data that are selected to model a human observer's discrimination criteria. A support vector machine classifier is trained on expert-segmented examples and then used to classify formerly unseen images. Leave-one-out cross validation on eight datasets with lesions of varying appearances is performed, showing our method to compare favourably with other published approaches in terms of accuracy and robustness. Furthermore, we compare a number of feature selectors and closely examine each feature's and MR sequence's contribution.
Kambhampati, Satya Samyukta; Singh, Vishal; Manikandan, M Sabarimalai; Ramkumar, Barathram
2015-08-01
In this Letter, the authors present a unified framework for fall event detection and classification using the cumulants extracted from the acceleration (ACC) signals acquired using a single waist-mounted triaxial accelerometer. The main objective of this Letter is to find suitable representative cumulants and classifiers in effectively detecting and classifying different types of fall and non-fall events. It was discovered that the first level of the proposed hierarchical decision tree algorithm implements fall detection using fifth-order cumulants and support vector machine (SVM) classifier. In the second level, the fall event classification algorithm uses the fifth-order cumulants and SVM. Finally, human activity classification is performed using the second-order cumulants and SVM. The detection and classification results are compared with those of the decision tree, naive Bayes, multilayer perceptron and SVM classifiers with different types of time-domain features including the second-, third-, fourth- and fifth-order cumulants and the signal magnitude vector and signal magnitude area. The experimental results demonstrate that the second- and fifth-order cumulant features and SVM classifier can achieve optimal detection and classification rates of above 95%, as well as the lowest false alarm rate of 1.03%.
A hybrid approach to select features and classify diseases based on medical data
NASA Astrophysics Data System (ADS)
AbdelLatif, Hisham; Luo, Jiawei
2018-03-01
Feature selection is popular problem in the classification of diseases in clinical medicine. Here, we developing a hybrid methodology to classify diseases, based on three medical datasets, Arrhythmia, Breast cancer, and Hepatitis datasets. This methodology called k-means ANOVA Support Vector Machine (K-ANOVA-SVM) uses K-means cluster with ANOVA statistical to preprocessing data and selection the significant features, and Support Vector Machines in the classification process. To compare and evaluate the performance, we choice three classification algorithms, decision tree Naïve Bayes, Support Vector Machines and applied the medical datasets direct to these algorithms. Our methodology was a much better classification accuracy is given of 98% in Arrhythmia datasets, 92% in Breast cancer datasets and 88% in Hepatitis datasets, Compare to use the medical data directly with decision tree Naïve Bayes, and Support Vector Machines. Also, the ROC curve and precision with (K-ANOVA-SVM) Achieved best results than other algorithms
NASA Astrophysics Data System (ADS)
Watanabe, Tatsuhito; Katsura, Seiichiro
A person operating a mobile robot in a remote environment receives realistic visual feedback about the condition of the road on which the robot is moving. The categorization of the road condition is necessary to evaluate the conditions for safe and comfortable driving. For this purpose, the mobile robot should be capable of recognizing and classifying the condition of the road surfaces. This paper proposes a method for recognizing the type of road surfaces on the basis of the friction between the mobile robot and the road surfaces. This friction is estimated by a disturbance observer, and a support vector machine is used to classify the surfaces. The support vector machine identifies the type of the road surface using feature vector, which is determined using the arithmetic average and variance derived from the torque values. Further, these feature vectors are mapped onto a higher dimensional space by using a kernel function. The validity of the proposed method is confirmed by experimental results.
Lenselink, Eelke B; Ten Dijke, Niels; Bongers, Brandon; Papadatos, George; van Vlijmen, Herman W T; Kowalczyk, Wojtek; IJzerman, Adriaan P; van Westen, Gerard J P
2017-08-14
The increase of publicly available bioactivity data in recent years has fueled and catalyzed research in chemogenomics, data mining, and modeling approaches. As a direct result, over the past few years a multitude of different methods have been reported and evaluated, such as target fishing, nearest neighbor similarity-based methods, and Quantitative Structure Activity Relationship (QSAR)-based protocols. However, such studies are typically conducted on different datasets, using different validation strategies, and different metrics. In this study, different methods were compared using one single standardized dataset obtained from ChEMBL, which is made available to the public, using standardized metrics (BEDROC and Matthews Correlation Coefficient). Specifically, the performance of Naïve Bayes, Random Forests, Support Vector Machines, Logistic Regression, and Deep Neural Networks was assessed using QSAR and proteochemometric (PCM) methods. All methods were validated using both a random split validation and a temporal validation, with the latter being a more realistic benchmark of expected prospective execution. Deep Neural Networks are the top performing classifiers, highlighting the added value of Deep Neural Networks over other more conventional methods. Moreover, the best method ('DNN_PCM') performed significantly better at almost one standard deviation higher than the mean performance. Furthermore, Multi-task and PCM implementations were shown to improve performance over single task Deep Neural Networks. Conversely, target prediction performed almost two standard deviations under the mean performance. Random Forests, Support Vector Machines, and Logistic Regression performed around mean performance. Finally, using an ensemble of DNNs, alongside additional tuning, enhanced the relative performance by another 27% (compared with unoptimized 'DNN_PCM'). Here, a standardized set to test and evaluate different machine learning algorithms in the context of multi-task learning is offered by providing the data and the protocols. Graphical Abstract .
Applying six classifiers to airborne hyperspectral imagery for detecting giant reed
USDA-ARS?s Scientific Manuscript database
This study evaluated and compared six different image classifiers, including minimum distance (MD), Mahalanobis distance (MAHD), maximum likelihood (ML), spectral angle mapper (SAM), mixture tuned matched filtering (MTMF) and support vector machine (SVM), for detecting and mapping giant reed (Arundo...
Recognition Using Hybrid Classifiers.
Osadchy, Margarita; Keren, Daniel; Raviv, Dolev
2016-04-01
A canonical problem in computer vision is category recognition (e.g., find all instances of human faces, cars etc., in an image). Typically, the input for training a binary classifier is a relatively small sample of positive examples, and a huge sample of negative examples, which can be very diverse, consisting of images from a large number of categories. The difficulty of the problem sharply increases with the dimension and size of the negative example set. We propose to alleviate this problem by applying a "hybrid" classifier, which replaces the negative samples by a prior, and then finds a hyperplane which separates the positive samples from this prior. The method is extended to kernel space and to an ensemble-based approach. The resulting binary classifiers achieve an identical or better classification rate than SVM, while requiring far smaller memory and lower computational complexity to train and apply.
Ensemble Pruning for Glaucoma Detection in an Unbalanced Data Set.
Adler, Werner; Gefeller, Olaf; Gul, Asma; Horn, Folkert K; Khan, Zardad; Lausen, Berthold
2016-12-07
Random forests are successful classifier ensemble methods consisting of typically 100 to 1000 classification trees. Ensemble pruning techniques reduce the computational cost, especially the memory demand, of random forests by reducing the number of trees without relevant loss of performance or even with increased performance of the sub-ensemble. The application to the problem of an early detection of glaucoma, a severe eye disease with low prevalence, based on topographical measurements of the eye background faces specific challenges. We examine the performance of ensemble pruning strategies for glaucoma detection in an unbalanced data situation. The data set consists of 102 topographical features of the eye background of 254 healthy controls and 55 glaucoma patients. We compare the area under the receiver operating characteristic curve (AUC), and the Brier score on the total data set, in the majority class, and in the minority class of pruned random forest ensembles obtained with strategies based on the prediction accuracy of greedily grown sub-ensembles, the uncertainty weighted accuracy, and the similarity between single trees. To validate the findings and to examine the influence of the prevalence of glaucoma in the data set, we additionally perform a simulation study with lower prevalences of glaucoma. In glaucoma classification all three pruning strategies lead to improved AUC and smaller Brier scores on the total data set with sub-ensembles as small as 30 to 80 trees compared to the classification results obtained with the full ensemble consisting of 1000 trees. In the simulation study, we were able to show that the prevalence of glaucoma is a critical factor and lower prevalence decreases the performance of our pruning strategies. The memory demand for glaucoma classification in an unbalanced data situation based on random forests could effectively be reduced by the application of pruning strategies without loss of performance in a population with increased risk of glaucoma.
Alcaide-Leon, P; Dufort, P; Geraldo, A F; Alshafai, L; Maralani, P J; Spears, J; Bharatha, A
2017-06-01
Accurate preoperative differentiation of primary central nervous system lymphoma and enhancing glioma is essential to avoid unnecessary neurosurgical resection in patients with primary central nervous system lymphoma. The purpose of the study was to evaluate the diagnostic performance of a machine-learning algorithm by using texture analysis of contrast-enhanced T1-weighted images for differentiation of primary central nervous system lymphoma and enhancing glioma. Seventy-one adult patients with enhancing gliomas and 35 adult patients with primary central nervous system lymphomas were included. The tumors were manually contoured on contrast-enhanced T1WI, and the resulting volumes of interest were mined for textural features and subjected to a support vector machine-based machine-learning protocol. Three readers classified the tumors independently on contrast-enhanced T1WI. Areas under the receiver operating characteristic curves were estimated for each reader and for the support vector machine classifier. A noninferiority test for diagnostic accuracy based on paired areas under the receiver operating characteristic curve was performed with a noninferiority margin of 0.15. The mean areas under the receiver operating characteristic curve were 0.877 (95% CI, 0.798-0.955) for the support vector machine classifier; 0.878 (95% CI, 0.807-0.949) for reader 1; 0.899 (95% CI, 0.833-0.966) for reader 2; and 0.845 (95% CI, 0.757-0.933) for reader 3. The mean area under the receiver operating characteristic curve of the support vector machine classifier was significantly noninferior to the mean area under the curve of reader 1 ( P = .021), reader 2 ( P = .035), and reader 3 ( P = .007). Support vector machine classification based on textural features of contrast-enhanced T1WI is noninferior to expert human evaluation in the differentiation of primary central nervous system lymphoma and enhancing glioma. © 2017 by American Journal of Neuroradiology.
On the use of feature selection to improve the detection of sea oil spills in SAR images
NASA Astrophysics Data System (ADS)
Mera, David; Bolon-Canedo, Veronica; Cotos, J. M.; Alonso-Betanzos, Amparo
2017-03-01
Fast and effective oil spill detection systems are crucial to ensure a proper response to environmental emergencies caused by hydrocarbon pollution on the ocean's surface. Typically, these systems uncover not only oil spills, but also a high number of look-alikes. The feature extraction is a critical and computationally intensive phase where each detected dark spot is independently examined. Traditionally, detection systems use an arbitrary set of features to discriminate between oil spills and look-alikes phenomena. However, Feature Selection (FS) methods based on Machine Learning (ML) have proved to be very useful in real domains for enhancing the generalization capabilities of the classifiers, while discarding the existing irrelevant features. In this work, we present a generic and systematic approach, based on FS methods, for choosing a concise and relevant set of features to improve the oil spill detection systems. We have compared five FS methods: Correlation-based feature selection (CFS), Consistency-based filter, Information Gain, ReliefF and Recursive Feature Elimination for Support Vector Machine (SVM-RFE). They were applied on a 141-input vector composed of features from a collection of outstanding studies. Selected features were validated via a Support Vector Machine (SVM) classifier and the results were compared with previous works. Test experiments revealed that the classifier trained with the 6-input feature vector proposed by SVM-RFE achieved the best accuracy and Cohen's kappa coefficient (87.1% and 74.06% respectively). This is a smaller feature combination with similar or even better classification accuracy than previous works. The presented finding allows to speed up the feature extraction phase without reducing the classifier accuracy. Experiments also confirmed the significance of the geometrical features since 75.0% of the different features selected by the applied FS methods as well as 66.67% of the proposed 6-input feature vector belong to this category.
Xie, Hong-Bo; Huang, Hu; Wu, Jianhua; Liu, Lei
2015-02-01
We present a multiclass fuzzy relevance vector machine (FRVM) learning mechanism and evaluate its performance to classify multiple hand motions using surface electromyographic (sEMG) signals. The relevance vector machine (RVM) is a sparse Bayesian kernel method which avoids some limitations of the support vector machine (SVM). However, RVM still suffers the difficulty of possible unclassifiable regions in multiclass problems. We propose two fuzzy membership function-based FRVM algorithms to solve such problems, based on experiments conducted on seven healthy subjects and two amputees with six hand motions. Two feature sets, namely, AR model coefficients and room mean square value (AR-RMS), and wavelet transform (WT) features, are extracted from the recorded sEMG signals. Fuzzy support vector machine (FSVM) analysis was also conducted for wide comparison in terms of accuracy, sparsity, training and testing time, as well as the effect of training sample sizes. FRVM yielded comparable classification accuracy with dramatically fewer support vectors in comparison with FSVM. Furthermore, the processing delay of FRVM was much less than that of FSVM, whilst training time of FSVM much faster than FRVM. The results indicate that FRVM classifier trained using sufficient samples can achieve comparable generalization capability as FSVM with significant sparsity in multi-channel sEMG classification, which is more suitable for sEMG-based real-time control applications.
Ensemble LUT classification for degraded document enhancement
NASA Astrophysics Data System (ADS)
Obafemi-Ajayi, Tayo; Agam, Gady; Frieder, Ophir
2008-01-01
The fast evolution of scanning and computing technologies have led to the creation of large collections of scanned paper documents. Examples of such collections include historical collections, legal depositories, medical archives, and business archives. Moreover, in many situations such as legal litigation and security investigations scanned collections are being used to facilitate systematic exploration of the data. It is almost always the case that scanned documents suffer from some form of degradation. Large degradations make documents hard to read and substantially deteriorate the performance of automated document processing systems. Enhancement of degraded document images is normally performed assuming global degradation models. When the degradation is large, global degradation models do not perform well. In contrast, we propose to estimate local degradation models and use them in enhancing degraded document images. Using a semi-automated enhancement system we have labeled a subset of the Frieder diaries collection.1 This labeled subset was then used to train an ensemble classifier. The component classifiers are based on lookup tables (LUT) in conjunction with the approximated nearest neighbor algorithm. The resulting algorithm is highly effcient. Experimental evaluation results are provided using the Frieder diaries collection.1
Palaniappan, Rajkumar; Sundaraj, Kenneth; Sundaraj, Sebastian; Huliraj, N; Revadi, S S
2017-06-08
Auscultation is a medical procedure used for the initial diagnosis and assessment of lung and heart diseases. From this perspective, we propose assessing the performance of the extreme learning machine (ELM) classifiers for the diagnosis of pulmonary pathology using breath sounds. Energy and entropy features were extracted from the breath sound using the wavelet packet transform. The statistical significance of the extracted features was evaluated by one-way analysis of variance (ANOVA). The extracted features were inputted into the ELM classifier. The maximum classification accuracies obtained for the conventional validation (CV) of the energy and entropy features were 97.36% and 98.37%, respectively, whereas the accuracies obtained for the cross validation (CRV) of the energy and entropy features were 96.80% and 97.91%, respectively. In addition, maximum classification accuracies of 98.25% and 99.25% were obtained for the CV and CRV of the ensemble features, respectively. The results indicate that the classification accuracy obtained with the ensemble features was higher than those obtained with the energy and entropy features.
Wan, Shixiang; Duan, Yucong; Zou, Quan
2017-09-01
Predicting the subcellular localization of proteins is an important and challenging problem. Traditional experimental approaches are often expensive and time-consuming. Consequently, a growing number of research efforts employ a series of machine learning approaches to predict the subcellular location of proteins. There are two main challenges among the state-of-the-art prediction methods. First, most of the existing techniques are designed to deal with multi-class rather than multi-label classification, which ignores connections between multiple labels. In reality, multiple locations of particular proteins imply that there are vital and unique biological significances that deserve special focus and cannot be ignored. Second, techniques for handling imbalanced data in multi-label classification problems are necessary, but never employed. For solving these two issues, we have developed an ensemble multi-label classifier called HPSLPred, which can be applied for multi-label classification with an imbalanced protein source. For convenience, a user-friendly webserver has been established at http://server.malab.cn/HPSLPred. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Bowd, Christopher; Medeiros, Felipe A.; Zhang, Zuohua; Zangwill, Linda M.; Hao, Jiucang; Lee, Te-Won; Sejnowski, Terrence J.; Weinreb, Robert N.; Goldbaum, Michael H.
2010-01-01
Purpose To classify healthy and glaucomatous eyes using relevance vector machine (RVM) and support vector machine (SVM) learning classifiers trained on retinal nerve fiber layer (RNFL) thickness measurements obtained by scanning laser polarimetry (SLP). Methods Seventy-two eyes of 72 healthy control subjects (average age = 64.3 ± 8.8 years, visual field mean deviation =−0.71 ± 1.2 dB) and 92 eyes of 92 patients with glaucoma (average age = 66.9 ± 8.9 years, visual field mean deviation =−5.32 ± 4.0 dB) were imaged with SLP with variable corneal compensation (GDx VCC; Laser Diagnostic Technologies, San Diego, CA). RVM and SVM learning classifiers were trained and tested on SLP-determined RNFL thickness measurements from 14 standard parameters and 64 sectors (approximately 5.6° each) obtained in the circumpapillary area under the instrument-defined measurement ellipse (total 78 parameters). Tenfold cross-validation was used to train and test RVM and SVM classifiers on unique subsets of the full 164-eye data set and areas under the receiver operating characteristic (AUROC) curve for the classification of eyes in the test set were generated. AUROC curve results from RVM and SVM were compared to those for 14 SLP software-generated global and regional RNFL thickness parameters. Also reported was the AUROC curve for the GDx VCC software-generated nerve fiber indicator (NFI). Results The AUROC curves for RVM and SVM were 0.90 and 0.91, respectively, and increased to 0.93 and 0.94 when the training sets were optimized with sequential forward and backward selection (resulting in reduced dimensional data sets). AUROC curves for optimized RVM and SVM were significantly larger than those for all individual SLP parameters. The AUROC curve for the NFI was 0.87. Conclusions Results from RVM and SVM trained on SLP RNFL thickness measurements are similar and provide accurate classification of glaucomatous and healthy eyes. RVM may be preferable to SVM, because it provides a Bayesian-derived probability of glaucoma as an output. These results suggest that these machine learning classifiers show good potential for glaucoma diagnosis. PMID:15790898
Identifying saltcedar with hyperspectral data and support vector machines
USDA-ARS?s Scientific Manuscript database
Saltcedar (Tamarix spp.) are a group of dense phreatophytic shrubs and trees that are invasive to riparian areas throughout the United States. This study determined the feasibility of using hyperspectral data and a support vector machine (SVM) classifier to discriminate saltcedar from other cover t...
Classification of a set of vectors using self-organizing map- and rule-based technique
NASA Astrophysics Data System (ADS)
Ae, Tadashi; Okaniwa, Kaishirou; Nosaka, Kenzaburou
2005-02-01
There exist various objects, such as pictures, music, texts, etc., around our environment. We have a view for these objects by looking, reading or listening. Our view is concerned with our behaviors deeply, and is very important to understand our behaviors. We have a view for an object, and decide the next action (data selection, etc.) with our view. Such a series of actions constructs a sequence. Therefore, we propose a method which acquires a view as a vector from several words for a view, and apply the vector to sequence generation. We focus on sequences of the data of which a user selects from a multimedia database containing pictures, music, movie, etc... These data cannot be stereotyped because user's view for them changes by each user. Therefore, we represent the structure of the multimedia database as the vector representing user's view and the stereotyped vector, and acquire sequences containing the structure as elements. Such a vector can be classified by SOM (Self-Organizing Map). Hidden Markov Model (HMM) is a method to generate sequences. Therefore, we use HMM of which a state corresponds to the representative vector of user's view, and acquire sequences containing the change of user's view. We call it Vector-state Markov Model (VMM). We introduce the rough set theory as a rule-base technique, which plays a role of classifying the sets of data such as the sets of "Tour".
Advanced Techniques for Scene Analysis
2010-06-01
robustness prefers a bigger intergration window to handle larger motions. The advantage of pyramidal implementation is that, while each motion vector dL...labeled SAR images. Now the previous algorithm leads to a more dedicated classifier for the particular target; however, our algorithm trades generality for...accuracy is traded for generality. 7.3.2 I-RELIEF Feature weighting transforms the original feature vector x into a new feature vector x′ by assigning each
The Probability of Exceedance as a Nonparametric Person-Fit Statistic for Tests of Moderate Length
ERIC Educational Resources Information Center
Tendeiro, Jorge N.; Meijer, Rob R.
2013-01-01
To classify an item score pattern as not fitting a nonparametric item response theory (NIRT) model, the probability of exceedance (PE) of an observed response vector x can be determined as the sum of the probabilities of all response vectors that are, at most, as likely as x, conditional on the test's total score. Vector x is to be considered…
Activity Recognition in Egocentric video using SVM, kNN and Combined SVMkNN Classifiers
NASA Astrophysics Data System (ADS)
Sanal Kumar, K. P.; Bhavani, R., Dr.
2017-08-01
Egocentric vision is a unique perspective in computer vision which is human centric. The recognition of egocentric actions is a challenging task which helps in assisting elderly people, disabled patients and so on. In this work, life logging activity videos are taken as input. There are 2 categories, first one is the top level and second one is second level. Here, the recognition is done using the features like Histogram of Oriented Gradients (HOG), Motion Boundary Histogram (MBH) and Trajectory. The features are fused together and it acts as a single feature. The extracted features are reduced using Principal Component Analysis (PCA). The features that are reduced are provided as input to the classifiers like Support Vector Machine (SVM), k nearest neighbor (kNN) and combined Support Vector Machine (SVM) and k Nearest Neighbor (kNN) (combined SVMkNN). These classifiers are evaluated and the combined SVMkNN provided better results than other classifiers in the literature.
A novel method for predicting kidney stone type using ensemble learning.
Kazemi, Yassaman; Mirroshandel, Seyed Abolghasem
2018-01-01
The high morbidity rate associated with kidney stone disease, which is a silent killer, is one of the main concerns in healthcare systems all over the world. Advanced data mining techniques such as classification can help in the early prediction of this disease and reduce its incidence and associated costs. The objective of the present study is to derive a model for the early detection of the type of kidney stone and the most influential parameters with the aim of providing a decision-support system. Information was collected from 936 patients with nephrolithiasis at the kidney center of the Razi Hospital in Rasht from 2012 through 2016. The prepared dataset included 42 features. Data pre-processing was the first step toward extracting the relevant features. The collected data was analyzed with Weka software, and various data mining models were used to prepare a predictive model. Various data mining algorithms such as the Bayesian model, different types of Decision Trees, Artificial Neural Networks, and Rule-based classifiers were used in these models. We also proposed four models based on ensemble learning to improve the accuracy of each learning algorithm. In addition, a novel technique for combining individual classifiers in ensemble learning was proposed. In this technique, for each individual classifier, a weight is assigned based on our proposed genetic algorithm based method. The generated knowledge was evaluated using a 10-fold cross-validation technique based on standard measures. However, the assessment of each feature for building a predictive model was another significant challenge. The predictive strength of each feature for creating a reproducible outcome was also investigated. Regarding the applied models, parameters such as sex, acid uric condition, calcium level, hypertension, diabetes, nausea and vomiting, flank pain, and urinary tract infection (UTI) were the most vital parameters for predicting the chance of nephrolithiasis. The final ensemble-based model (with an accuracy of 97.1%) was a robust one and could be safely applied to future studies to predict the chances of developing nephrolithiasis. This model provides a novel way to study stone disease by deciphering the complex interaction among different biological variables, thus helping in an early identification and reduction in diagnosis time. Copyright © 2017 Elsevier B.V. All rights reserved.
Tuyisenge, Viateur; Trebaul, Lena; Bhattacharjee, Manik; Chanteloup-Forêt, Blandine; Saubat-Guigui, Carole; Mîndruţă, Ioana; Rheims, Sylvain; Maillard, Louis; Kahane, Philippe; Taussig, Delphine; David, Olivier
2018-03-01
Intracranial electroencephalographic (iEEG) recordings contain "bad channels", which show non-neuronal signals. Here, we developed a new method that automatically detects iEEG bad channels using machine learning of seven signal features. The features quantified signals' variance, spatial-temporal correlation and nonlinear properties. Because the number of bad channels is usually much lower than the number of good channels, we implemented an ensemble bagging classifier known to be optimal in terms of stability and predictive accuracy for datasets with imbalanced class distributions. This method was applied on stereo-electroencephalographic (SEEG) signals recording during low frequency stimulations performed in 206 patients from 5 clinical centers. We found that the classification accuracy was extremely good: It increased with the number of subjects used to train the classifier and reached a plateau at 99.77% for 110 subjects. The classification performance was thus not impacted by the multicentric nature of data. The proposed method to automatically detect bad channels demonstrated convincing results and can be envisaged to be used on larger datasets for automatic quality control of iEEG data. This is the first method proposed to classify bad channels in iEEG and should allow to improve the data selection when reviewing iEEG signals. Copyright © 2017 International Federation of Clinical Neurophysiology. Published by Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Zhongqin, G.; Chen, Y.
2017-12-01
Abstract Quickly identify the spatial distribution of landslides automatically is essential for the prevention, mitigation and assessment of the landslide hazard. It's still a challenging job owing to the complicated characteristics and vague boundary of the landslide areas on the image. The high resolution remote sensing image has multi-scales, complex spatial distribution and abundant features, the object-oriented image classification methods can make full use of the above information and thus effectively detect the landslides after the hazard happened. In this research we present a new semi-supervised workflow, taking advantages of recent object-oriented image analysis and machine learning algorithms to quick locate the different origins of landslides of some areas on the southwest part of China. Besides a sequence of image segmentation, feature selection, object classification and error test, this workflow ensemble the feature selection and classifier selection. The feature this study utilized were normalized difference vegetation index (NDVI) change, textural feature derived from the gray level co-occurrence matrices (GLCM), spectral feature and etc. The improvement of this study shows this algorithm significantly removes some redundant feature and the classifiers get fully used. All these improvements lead to a higher accuracy on the determination of the shape of landslides on the high resolution remote sensing image, in particular the flexibility aimed at different kinds of landslides.
Jia, Cangzhi; Zuo, Yun; Zou, Quan; Hancock, John
2018-02-06
Protein O-GlcNAcylation (O-GlcNAc) is an important post-translational modification of serine (S)/threonine (T) residues that involves multiple molecular and cellular processes. Recent studies have suggested that abnormal O-G1cNAcylation causes many diseases, such as cancer and various neurodegenerative diseases. With the available protein O-G1cNAcylation sites experimentally verified, it is highly desired to develop automated methods to rapidly and effectively identify O-G1cNAcylation sites. Although some computational methods have been proposed, their performance has been unsatisfactory, particularly in terms of prediction sensitivity. In this study, we developed an ensemble model O-GlcNAcPRED-II to identify potential O-G1cNAcylation sites. A K-means principal component analysis oversampling technique (KPCA) and fuzzy undersampling method (FUS) were first proposed and incorporated to reduce the proportion of the original positive and negative training samples. Then, rotation forest, a type of classifier-integrated system, was adopted to divide the eight types of feature space into several subsets using four sub-classifiers: random forest, k-nearest neighbour, naive Bayesian and support vector machine. We observed that O-GlcNAcPRED-II achieved a sensitivity of 81.05%, specificity of 95.91%, accuracy of 91.43% and Matthew's correlation coefficient of 0.7928 for five-fold cross-validation run 10 times. Additionally, the results obtained by O-GlcNAcPRED-II on two independent datasets also indicated that the proposed predictor outperformed five published prediction tools. http://121.42.167.206/OGlcPred/. cangzhijia@dlmu.edu.cn or zouquan@nclab.net. © The Author (2018). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
NASA Astrophysics Data System (ADS)
Ruske, S. T.; Topping, D. O.; Foot, V. E.; Kaye, P. H.; Stanley, W. R.; Morse, A. P.; Crawford, I.; Gallagher, M. W.
2016-12-01
Characterisation of bio-aerosols has important implications within Environment and Public Health sectors. Recent developments in Ultra-Violet Light Induced Fluorescence (UV-LIF) detectors such as the Wideband Integrated bio-aerosol Spectrometer (WIBS) and the newly introduced Multiparameter bio-aerosol Spectrometer (MBS) has allowed for the real time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal Spores and pollen. This new generation of instruments has enabled ever-larger data sets to be compiled with the aim of studying more complex environments, yet the algorithms used for specie classification remain largely invalidated. It is therefore imperative that we validate the performance of different algorithms that can be used for the task of classification, which is the focus of this study. For unsupervised learning we test Hierarchical Agglomerative Clustering with various different linkages. For supervised learning, ten methods were tested; including decision trees, ensemble methods: Random Forests, Gradient Boosting and AdaBoost; two implementations for support vector machines: libsvm and liblinear; Gaussian methods: Gaussian naïve Bayesian, quadratic and linear discriminant analysis and finally the k-nearest neighbours algorithm. The methods were applied to two different data sets measured using a new Multiparameter bio-aerosol Spectrometer. We find that clustering, in general, performs slightly worse than the supervised learning methods correctly classifying, at best, only 72.7 and 91.1 percent for the two data sets. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 88.1 and 97.8 percent of the testing data respectively across the two data sets. We discuss the wider relevance of these results with regards to challenging existing classification in real-world environments.
Effective Feature Selection for Classification of Promoter Sequences.
K, Kouser; P G, Lavanya; Rangarajan, Lalitha; K, Acharya Kshitish
2016-01-01
Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of expression of genes into other functional molecules. Promoter regions vary greatly in their function based on the sequence of nucleotides and the arrangement of protein-binding short-regions called motifs. In fact, the regulatory nature of the promoters seems to be largely driven by the selective presence and/or the arrangement of these motifs. Here, we explore computational classification of promoter sequences based on the pattern of motif distributions, as such classification can pave a new way of functional analysis of promoters and to discover the functionally crucial motifs. We make use of Position Specific Motif Matrix (PSMM) features for exploring the possibility of accurately classifying promoter sequences using some of the popular classification techniques. The classification results on the complete feature set are low, perhaps due to the huge number of features. We propose two ways of reducing features. Our test results show improvement in the classification output after the reduction of features. The results also show that decision trees outperform SVM (Support Vector Machine), KNN (K Nearest Neighbor) and ensemble classifier LibD3C, particularly with reduced features. The proposed feature selection methods outperform some of the popular feature transformation methods such as PCA and SVD. Also, the methods proposed are as accurate as MRMR (feature selection method) but much faster than MRMR. Such methods could be useful to categorize new promoters and explore regulatory mechanisms of gene expressions in complex eukaryotic species.
Vector coding of wavelet-transformed images
NASA Astrophysics Data System (ADS)
Zhou, Jun; Zhi, Cheng; Zhou, Yuanhua
1998-09-01
Wavelet, as a brand new tool in signal processing, has got broad recognition. Using wavelet transform, we can get octave divided frequency band with specific orientation which combines well with the properties of Human Visual System. In this paper, we discuss the classified vector quantization method for multiresolution represented image.
NASA Astrophysics Data System (ADS)
Pham, Binh Thai; Prakash, Indra; Tien Bui, Dieu
2018-02-01
A hybrid machine learning approach of Random Subspace (RSS) and Classification And Regression Trees (CART) is proposed to develop a model named RSSCART for spatial prediction of landslides. This model is a combination of the RSS method which is known as an efficient ensemble technique and the CART which is a state of the art classifier. The Luc Yen district of Yen Bai province, a prominent landslide prone area of Viet Nam, was selected for the model development. Performance of the RSSCART model was evaluated through the Receiver Operating Characteristic (ROC) curve, statistical analysis methods, and the Chi Square test. Results were compared with other benchmark landslide models namely Support Vector Machines (SVM), single CART, Naïve Bayes Trees (NBT), and Logistic Regression (LR). In the development of model, ten important landslide affecting factors related with geomorphology, geology and geo-environment were considered namely slope angles, elevation, slope aspect, curvature, lithology, distance to faults, distance to rivers, distance to roads, and rainfall. Performance of the RSSCART model (AUC = 0.841) is the best compared with other popular landslide models namely SVM (0.835), single CART (0.822), NBT (0.821), and LR (0.723). These results indicate that performance of the RSSCART is a promising method for spatial landslide prediction.
Predicting human liver microsomal stability with machine learning techniques.
Sakiyama, Yojiro; Yuki, Hitomi; Moriya, Takashi; Hattori, Kazunari; Suzuki, Misaki; Shimada, Kaoru; Honma, Teruki
2008-02-01
To ensure a continuing pipeline in pharmaceutical research, lead candidates must possess appropriate metabolic stability in the drug discovery process. In vitro ADMET (absorption, distribution, metabolism, elimination, and toxicity) screening provides us with useful information regarding the metabolic stability of compounds. However, before the synthesis stage, an efficient process is required in order to deal with the vast quantity of data from large compound libraries and high-throughput screening. Here we have derived a relationship between the chemical structure and its metabolic stability for a data set of in-house compounds by means of various in silico machine learning such as random forest, support vector machine (SVM), logistic regression, and recursive partitioning. For model building, 1952 proprietary compounds comprising two classes (stable/unstable) were used with 193 descriptors calculated by Molecular Operating Environment. The results using test compounds have demonstrated that all classifiers yielded satisfactory results (accuracy > 0.8, sensitivity > 0.9, specificity > 0.6, and precision > 0.8). Above all, classification by random forest as well as SVM yielded kappa values of approximately 0.7 in an independent validation set, slightly higher than other classification tools. These results suggest that nonlinear/ensemble-based classification methods might prove useful in the area of in silico ADME modeling.
NASA Astrophysics Data System (ADS)
Park, Bosoon; Windham, William R.; Ladely, Scott R.; Gurram, Prudhvi; Kwon, Heesung; Yoon, Seung-Chul; Lawrence, Kurt C.; Narang, Neelam; Cray, William C.
2012-05-01
Non-O157:H7 Shiga toxin-producing Escherichia coli (STEC) strains such as O26, O45, O103, O111, O121 and O145 are recognized as serious outbreak to cause human illness due to their toxicity. A conventional microbiological method for cell counting is laborious and needs long time for the results. Since optical detection method is promising for realtime, in-situ foodborne pathogen detection, acousto-optical tunable filters (AOTF)-based hyperspectral microscopic imaging (HMI) method has been developed for identifying pathogenic bacteria because of its capability to differentiate both spatial and spectral characteristics of each bacterial cell from microcolony samples. Using the AOTF-based HMI method, 89 contiguous spectral images could be acquired within approximately 30 seconds with 250 ms exposure time. From this study, we have successfully developed the protocol for live-cell immobilization on glass slides to acquire quality spectral images from STEC bacterial cells using the modified dry method. Among the contiguous spectral imagery between 450 and 800 nm, the intensity of spectral images at 458, 498, 522, 546, 570, 586, 670 and 690 nm were distinctive for STEC bacteria. With two different classification algorithms, Support Vector Machine (SVM) and Sparse Kernel-based Ensemble Learning (SKEL), a STEC serotype O45 could be classified with 92% detection accuracy.
A Prototype SSVEP Based Real Time BCI Gaming System
Martišius, Ignas
2016-01-01
Although brain-computer interface technology is mainly designed with disabled people in mind, it can also be beneficial to healthy subjects, for example, in gaming or virtual reality systems. In this paper we discuss the typical architecture, paradigms, requirements, and limitations of electroencephalogram-based gaming systems. We have developed a prototype three-class brain-computer interface system, based on the steady state visually evoked potentials paradigm and the Emotiv EPOC headset. An online target shooting game, implemented in the OpenViBE environment, has been used for user feedback. The system utilizes wave atom transform for feature extraction, achieving an average accuracy of 78.2% using linear discriminant analysis classifier, 79.3% using support vector machine classifier with a linear kernel, and 80.5% using a support vector machine classifier with a radial basis function kernel. PMID:27051414
A Prototype SSVEP Based Real Time BCI Gaming System.
Martišius, Ignas; Damaševičius, Robertas
2016-01-01
Although brain-computer interface technology is mainly designed with disabled people in mind, it can also be beneficial to healthy subjects, for example, in gaming or virtual reality systems. In this paper we discuss the typical architecture, paradigms, requirements, and limitations of electroencephalogram-based gaming systems. We have developed a prototype three-class brain-computer interface system, based on the steady state visually evoked potentials paradigm and the Emotiv EPOC headset. An online target shooting game, implemented in the OpenViBE environment, has been used for user feedback. The system utilizes wave atom transform for feature extraction, achieving an average accuracy of 78.2% using linear discriminant analysis classifier, 79.3% using support vector machine classifier with a linear kernel, and 80.5% using a support vector machine classifier with a radial basis function kernel.
Ensemble Simulation of the Atmospheric Radionuclides Discharged by the Fukushima Nuclear Accident
NASA Astrophysics Data System (ADS)
Sekiyama, Thomas; Kajino, Mizuo; Kunii, Masaru
2013-04-01
Enormous amounts of radionuclides were discharged into the atmosphere by a nuclear accident at the Fukushima Daiichi nuclear power plant (FDNPP) after the earthquake and tsunami on 11 March 2011. The radionuclides were dispersed from the power plant and deposited mainly over eastern Japan and the North Pacific Ocean. A lot of numerical simulations of the radionuclide dispersion and deposition had been attempted repeatedly since the nuclear accident. However, none of them were able to perfectly simulate the distribution of dose rates observed after the accident over eastern Japan. This was partly due to the error of the wind vectors and precipitations used in the numerical simulations; unfortunately, their deterministic simulations could not deal with the probability distribution of the simulation results and errors. Therefore, an ensemble simulation of the atmospheric radionuclides was performed using the ensemble Kalman filter (EnKF) data assimilation system coupled with the Japan Meteorological Agency (JMA) non-hydrostatic mesoscale model (NHM); this mesoscale model has been used operationally for daily weather forecasts by JMA. Meteorological observations were provided to the EnKF data assimilation system from the JMA operational-weather-forecast dataset. Through this ensemble data assimilation, twenty members of the meteorological analysis over eastern Japan from 11 to 31 March 2011 were successfully obtained. Using these meteorological ensemble analysis members, the radionuclide behavior in the atmosphere such as advection, convection, diffusion, dry deposition, and wet deposition was simulated. This ensemble simulation provided the multiple results of the radionuclide dispersion and distribution. Because a large ensemble deviation indicates the low accuracy of the numerical simulation, the probabilistic information is obtainable from the ensemble simulation results. For example, the uncertainty of precipitation triggered the uncertainty of wet deposition; the uncertainty of wet deposition triggered the uncertainty of atmospheric radionuclide amounts. Then the remained radionuclides were transported downwind; consequently the uncertainty signal of the radionuclide amounts was propagated downwind. The signal propagation was seen in the ensemble simulation by the tracking of the large deviation areas of radionuclide concentration and deposition. These statistics are able to provide information useful for the probabilistic prediction of radionuclides.
Marafino, Ben J; Davies, Jason M; Bardach, Naomi S; Dean, Mitzi L; Dudley, R Adams
2014-01-01
Existing risk adjustment models for intensive care unit (ICU) outcomes rely on manual abstraction of patient-level predictors from medical charts. Developing an automated method for abstracting these data from free text might reduce cost and data collection times. To develop a support vector machine (SVM) classifier capable of identifying a range of procedures and diagnoses in ICU clinical notes for use in risk adjustment. We selected notes from 2001-2008 for 4191 neonatal ICU (NICU) and 2198 adult ICU patients from the MIMIC-II database from the Beth Israel Deaconess Medical Center. Using these notes, we developed an implementation of the SVM classifier to identify procedures (mechanical ventilation and phototherapy in NICU notes) and diagnoses (jaundice in NICU and intracranial hemorrhage (ICH) in adult ICU). On the jaundice classification task, we also compared classifier performance using n-gram features to unigrams with application of a negation algorithm (NegEx). Our classifier accurately identified mechanical ventilation (accuracy=0.982, F1=0.954) and phototherapy use (accuracy=0.940, F1=0.912), as well as jaundice (accuracy=0.898, F1=0.884) and ICH diagnoses (accuracy=0.938, F1=0.943). Including bigram features improved performance on the jaundice (accuracy=0.898 vs 0.865) and ICH (0.938 vs 0.927) tasks, and outperformed NegEx-derived unigram features (accuracy=0.898 vs 0.863) on the jaundice task. Overall, a classifier using n-gram support vectors displayed excellent performance characteristics. The classifier generalizes to diverse patient populations, diagnoses, and procedures. SVM-based classifiers can accurately identify procedure status and diagnoses among ICU patients, and including n-gram features improves performance, compared to existing methods. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Protein Sequence Classification with Improved Extreme Learning Machine Algorithms
2014-01-01
Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876
Real-data comparison of data mining methods in prediction of diabetes in iran.
Tapak, Lily; Mahjub, Hossein; Hamidi, Omid; Poorolajal, Jalal
2013-09-01
Diabetes is one of the most common non-communicable diseases in developing countries. Early screening and diagnosis play an important role in effective prevention strategies. This study compared two traditional classification methods (logistic regression and Fisher linear discriminant analysis) and four machine-learning classifiers (neural networks, support vector machines, fuzzy c-mean, and random forests) to classify persons with and without diabetes. The data set used in this study included 6,500 subjects from the Iranian national non-communicable diseases risk factors surveillance obtained through a cross-sectional survey. The obtained sample was based on cluster sampling of the Iran population which was conducted in 2005-2009 to assess the prevalence of major non-communicable disease risk factors. Ten risk factors that are commonly associated with diabetes were selected to compare the performance of six classifiers in terms of sensitivity, specificity, total accuracy, and area under the receiver operating characteristic (ROC) curve criteria. Support vector machines showed the highest total accuracy (0.986) as well as area under the ROC (0.979). Also, this method showed high specificity (1.000) and sensitivity (0.820). All other methods produced total accuracy of more than 85%, but for all methods, the sensitivity values were very low (less than 0.350). The results of this study indicate that, in terms of sensitivity, specificity, and overall classification accuracy, the support vector machine model ranks first among all the classifiers tested in the prediction of diabetes. Therefore, this approach is a promising classifier for predicting diabetes, and it should be further investigated for the prediction of other diseases.
NASA Technical Reports Server (NTRS)
Fronzek, Stefan; Pirttioja, Nina; Carter, Timothy R.; Bindi, Marco; Hoffmann, Holger; Palosuo, Taru; Ruiz-Ramos, Margarita; Tao, Fulu; Trnka, Miroslav; Acutis, Marco;
2017-01-01
Crop growth simulation models can differ greatly in their treatment of key processes and hence in their response to environmental conditions. Here, we used an ensemble of 26 process-based wheat models applied at sites across a European transect to compare their sensitivity to changes in temperature (minus 2 to plus 9 degrees Centigrade) and precipitation (minus 50 to plus 50 percent). Model results were analysed by plotting them as impact response surfaces (IRSs), classifying the IRS patterns of individual model simulations, describing these classes and analysing factors that may explain the major differences in model responses. The model ensemble was used to simulate yields of winter and spring wheat at four sites in Finland, Germany and Spain. Results were plotted as IRSs that show changes in yields relative to the baseline with respect to temperature and precipitation. IRSs of 30-year means and selected extreme years were classified using two approaches describing their pattern. The expert diagnostic approach (EDA) combines two aspects of IRS patterns: location of the maximum yield (nine classes) and strength of the yield response with respect to climate (four classes), resulting in a total of 36 combined classes defined using criteria pre-specified by experts. The statistical diagnostic approach (SDA) groups IRSs by comparing their pattern and magnitude, without attempting to interpret these features. It applies a hierarchical clustering method, grouping response patterns using a distance metric that combines the spatial correlation and Euclidian distance between IRS pairs. The two approaches were used to investigate whether different patterns of yield response could be related to different properties of the crop models, specifically their genealogy, calibration and process description. Although no single model property across a large model ensemble was found to explain the integrated yield response to temperature and precipitation perturbations, the application of the EDA and SDA approaches revealed their capability to distinguish: (i) stronger yield responses to precipitation for winter wheat than spring wheat; (ii) differing strengths of response to climate changes for years with anomalous weather conditions compared to period-average conditions; (iii) the influence of site conditions on yield patterns; (iv) similarities in IRS patterns among models with related genealogy; (v) similarities in IRS patterns for models with simpler process descriptions of root growth and water uptake compared to those with more complex descriptions; and (vi) a closer correspondence of IRS patterns in models using partitioning schemes to represent yield formation than in those using a harvest index. Such results can inform future crop modelling studies that seek to exploit the diversity of multi-model ensembles, by distinguishing ensemble members that span a wide range of responses as well as those that display implausible behaviour or strong mutual similarities.
Papanikolaou, Yannis; Tsoumakas, Grigorios; Laliotis, Manos; Markantonatos, Nikos; Vlahavas, Ioannis
2017-09-22
In this paper we present the approach that we employed to deal with large scale multi-label semantic indexing of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge (2013-2017), a challenge concerned with biomedical semantic indexing and question answering. Our main contribution is a MUlti-Label Ensemble method (MULE) that incorporates a McNemar statistical significance test in order to validate the combination of the constituent machine learning algorithms. Some secondary contributions include a study on the temporal aspects of the BioASQ corpus (observations apply also to the BioASQ's super-set, the PubMed articles collection) and the proper parametrization of the algorithms used to deal with this challenging classification task. The ensemble method that we developed is compared to other approaches in experimental scenarios with subsets of the BioASQ corpus giving positive results. In our participation in the BioASQ challenge we obtained the first place in 2013 and the second place in the four following years, steadily outperforming MTI, the indexing system of the National Library of Medicine (NLM). The results of our experimental comparisons, suggest that employing a statistical significance test to validate the ensemble method's choices, is the optimal approach for ensembling multi-label classifiers, especially in contexts with many rare labels.
Adaptive Bayes classifiers for remotely sensed data
NASA Technical Reports Server (NTRS)
Raulston, H. S.; Pace, M. O.; Gonzalez, R. C.
1975-01-01
An algorithm is developed for a learning, adaptive, statistical pattern classifier for remotely sensed data. The estimation procedure consists of two steps: (1) an optimal stochastic approximation of the parameters of interest, and (2) a projection of the parameters in time and space. The results reported are for Gaussian data in which the mean vector of each class may vary with time or position after the classifier is trained.
Generative Models for Similarity-based Classification
2007-01-01
NC), local nearest centroid (local NC), k-nearest neighbors ( kNN ), and condensed nearest neighbors (CNN) are all similarity-based classifiers which...vector machine to the k nearest neighbors of the test sample [80]. The SVM- KNN method was developed to address the robustness and dimensionality...concerns that afflict nearest neighbors and SVMs. Similarly to the nearest-means classifier, the SVM- KNN is a hybrid local and global classifier developed
SQUEEZE-E: The Optimal Solution for Molecular Simulations with Periodic Boundary Conditions.
Wassenaar, Tsjerk A; de Vries, Sjoerd; Bonvin, Alexandre M J J; Bekker, Henk
2012-10-09
In molecular simulations of macromolecules, it is desirable to limit the amount of solvent in the system to avoid spending computational resources on uninteresting solvent-solvent interactions. As a consequence, periodic boundary conditions are commonly used, with a simulation box chosen as small as possible, for a given minimal distance between images. Here, we describe how such a simulation cell can be set up for ensembles, taking into account a priori available or estimable information regarding conformational flexibility. Doing so ensures that any conformation present in the input ensemble will satisfy the distance criterion during the simulation. This helps avoid periodicity artifacts due to conformational changes. The method introduces three new approaches in computational geometry: (1) The first is the derivation of an optimal packing of ensembles, for which the mathematical framework is described. (2) A new method for approximating the α-hull and the contact body for single bodies and ensembles is presented, which is orders of magnitude faster than existing routines, allowing the calculation of packings of large ensembles and/or large bodies. 3. A routine is described for searching a combination of three vectors on a discretized contact body forming a reduced base for a lattice with minimal cell volume. The new algorithms reduce the time required to calculate packings of single bodies from minutes or hours to seconds. The use and efficacy of the method is demonstrated for ensembles obtained from NMR, MD simulations, and elastic network modeling. An implementation of the method has been made available online at http://haddock.chem.uu.nl/services/SQUEEZE/ and has been made available as an option for running simulations through the weNMR GRID MD server at http://haddock.science.uu.nl/enmr/services/GROMACS/main.php .
Minimum distance classification in remote sensing
NASA Technical Reports Server (NTRS)
Wacker, A. G.; Landgrebe, D. A.
1972-01-01
The utilization of minimum distance classification methods in remote sensing problems, such as crop species identification, is considered. Literature concerning both minimum distance classification problems and distance measures is reviewed. Experimental results are presented for several examples. The objective of these examples is to: (a) compare the sample classification accuracy of a minimum distance classifier, with the vector classification accuracy of a maximum likelihood classifier, and (b) compare the accuracy of a parametric minimum distance classifier with that of a nonparametric one. Results show the minimum distance classifier performance is 5% to 10% better than that of the maximum likelihood classifier. The nonparametric classifier is only slightly better than the parametric version.
Research on Classification of Chinese Text Data Based on SVM
NASA Astrophysics Data System (ADS)
Lin, Yuan; Yu, Hongzhi; Wan, Fucheng; Xu, Tao
2017-09-01
Data Mining has important application value in today’s industry and academia. Text classification is a very important technology in data mining. At present, there are many mature algorithms for text classification. KNN, NB, AB, SVM, decision tree and other classification methods all show good classification performance. Support Vector Machine’ (SVM) classification method is a good classifier in machine learning research. This paper will study the classification effect based on the SVM method in the Chinese text data, and use the support vector machine method in the chinese text to achieve the classify chinese text, and to able to combination of academia and practical application.
Human action recognition with group lasso regularized-support vector machine
NASA Astrophysics Data System (ADS)
Luo, Huiwu; Lu, Huanzhang; Wu, Yabei; Zhao, Fei
2016-05-01
The bag-of-visual-words (BOVW) and Fisher kernel are two popular models in human action recognition, and support vector machine (SVM) is the most commonly used classifier for the two models. We show two kinds of group structures in the feature representation constructed by BOVW and Fisher kernel, respectively, since the structural information of feature representation can be seen as a prior for the classifier and can improve the performance of the classifier, which has been verified in several areas. However, the standard SVM employs L2-norm regularization in its learning procedure, which penalizes each variable individually and cannot express the structural information of feature representation. We replace the L2-norm regularization with group lasso regularization in standard SVM, and a group lasso regularized-support vector machine (GLRSVM) is proposed. Then, we embed the group structural information of feature representation into GLRSVM. Finally, we introduce an algorithm to solve the optimization problem of GLRSVM by alternating directions method of multipliers. The experiments evaluated on KTH, YouTube, and Hollywood2 datasets show that our method achieves promising results and improves the state-of-the-art methods on KTH and YouTube datasets.
A hybrid method for classifying cognitive states from fMRI data.
Parida, S; Dehuri, S; Cho, S-B; Cacha, L A; Poznanski, R R
2015-09-01
Functional magnetic resonance imaging (fMRI) makes it possible to detect brain activities in order to elucidate cognitive-states. The complex nature of fMRI data requires under-standing of the analyses applied to produce possible avenues for developing models of cognitive state classification and improving brain activity prediction. While many models of classification task of fMRI data analysis have been developed, in this paper, we present a novel hybrid technique through combining the best attributes of genetic algorithms (GAs) and ensemble decision tree technique that consistently outperforms all other methods which are being used for cognitive-state classification. Specifically, this paper illustrates the combined effort of decision-trees ensemble and GAs for feature selection through an extensive simulation study and discusses the classification performance with respect to fMRI data. We have shown that our proposed method exhibits significant reduction of the number of features with clear edge classification accuracy over ensemble of decision-trees.
Thanh Noi, Phan; Kappas, Martin
2017-01-01
In previous classification studies, three non-parametric classifiers, Random Forest (RF), k-Nearest Neighbor (kNN), and Support Vector Machine (SVM), were reported as the foremost classifiers at producing high accuracies. However, only a few studies have compared the performances of these classifiers with different training sample sizes for the same remote sensing images, particularly the Sentinel-2 Multispectral Imager (MSI). In this study, we examined and compared the performances of the RF, kNN, and SVM classifiers for land use/cover classification using Sentinel-2 image data. An area of 30 × 30 km2 within the Red River Delta of Vietnam with six land use/cover types was classified using 14 different training sample sizes, including balanced and imbalanced, from 50 to over 1250 pixels/class. All classification results showed a high overall accuracy (OA) ranging from 90% to 95%. Among the three classifiers and 14 sub-datasets, SVM produced the highest OA with the least sensitivity to the training sample sizes, followed consecutively by RF and kNN. In relation to the sample size, all three classifiers showed a similar and high OA (over 93.85%) when the training sample size was large enough, i.e., greater than 750 pixels/class or representing an area of approximately 0.25% of the total study area. The high accuracy was achieved with both imbalanced and balanced datasets. PMID:29271909
Thanh Noi, Phan; Kappas, Martin
2017-12-22
In previous classification studies, three non-parametric classifiers, Random Forest (RF), k-Nearest Neighbor (kNN), and Support Vector Machine (SVM), were reported as the foremost classifiers at producing high accuracies. However, only a few studies have compared the performances of these classifiers with different training sample sizes for the same remote sensing images, particularly the Sentinel-2 Multispectral Imager (MSI). In this study, we examined and compared the performances of the RF, kNN, and SVM classifiers for land use/cover classification using Sentinel-2 image data. An area of 30 × 30 km² within the Red River Delta of Vietnam with six land use/cover types was classified using 14 different training sample sizes, including balanced and imbalanced, from 50 to over 1250 pixels/class. All classification results showed a high overall accuracy (OA) ranging from 90% to 95%. Among the three classifiers and 14 sub-datasets, SVM produced the highest OA with the least sensitivity to the training sample sizes, followed consecutively by RF and kNN. In relation to the sample size, all three classifiers showed a similar and high OA (over 93.85%) when the training sample size was large enough, i.e., greater than 750 pixels/class or representing an area of approximately 0.25% of the total study area. The high accuracy was achieved with both imbalanced and balanced datasets.
Discontinuity Detection in the Shield Metal Arc Welding Process
Cocota, José Alberto Naves; Garcia, Gabriel Carvalho; da Costa, Adilson Rodrigues; de Lima, Milton Sérgio Fernandes; Rocha, Filipe Augusto Santos; Freitas, Gustavo Medeiros
2017-01-01
This work proposes a new methodology for the detection of discontinuities in the weld bead applied in Shielded Metal Arc Welding (SMAW) processes. The detection system is based on two sensors—a microphone and piezoelectric—that acquire acoustic emissions generated during the welding. The feature vectors extracted from the sensor dataset are used to construct classifier models. The approaches based on Artificial Neural Network (ANN) and Support Vector Machine (SVM) classifiers are able to identify with a high accuracy the three proposed weld bead classes: desirable weld bead, shrinkage cavity and burn through discontinuities. Experimental results illustrate the system’s high accuracy, greater than 90% for each class. A novel Hierarchical Support Vector Machine (HSVM) structure is proposed to make feasible the use of this system in industrial environments. This approach presented 96.6% overall accuracy. Given the simplicity of the equipment involved, this system can be applied in the metal transformation industries. PMID:28489045
Discontinuity Detection in the Shield Metal Arc Welding Process.
Cocota, José Alberto Naves; Garcia, Gabriel Carvalho; da Costa, Adilson Rodrigues; de Lima, Milton Sérgio Fernandes; Rocha, Filipe Augusto Santos; Freitas, Gustavo Medeiros
2017-05-10
This work proposes a new methodology for the detection of discontinuities in the weld bead applied in Shielded Metal Arc Welding (SMAW) processes. The detection system is based on two sensors-a microphone and piezoelectric-that acquire acoustic emissions generated during the welding. The feature vectors extracted from the sensor dataset are used to construct classifier models. The approaches based on Artificial Neural Network (ANN) and Support Vector Machine (SVM) classifiers are able to identify with a high accuracy the three proposed weld bead classes: desirable weld bead, shrinkage cavity and burn through discontinuities. Experimental results illustrate the system's high accuracy, greater than 90% for each class. A novel Hierarchical Support Vector Machine (HSVM) structure is proposed to make feasible the use of this system in industrial environments. This approach presented 96.6% overall accuracy. Given the simplicity of the equipment involved, this system can be applied in the metal transformation industries.
Spatial and spectral analysis of corneal epithelium injury using hyperspectral images
NASA Astrophysics Data System (ADS)
Md Noor, Siti Salwa; Michael, Kaleena; Marshall, Stephen; Ren, Jinchang
2017-12-01
Eye assessment is essential in preventing blindness. Currently, the existing methods to assess corneal epithelium injury are complex and require expert knowledge. Hence, we have introduced a non-invasive technique using hyperspectral imaging (HSI) and an image analysis algorithm of corneal epithelium injury. Three groups of images were compared and analyzed, namely healthy eyes, injured eyes, and injured eyes with stain. Dimensionality reduction using principal component analysis (PCA) was applied to reduce massive data and redundancies. The first 10 principal components (PCs) were selected for further processing. The mean vector of 10 PCs with 45 pairs of all combinations was computed and sent to two classifiers. A quadratic Bayes normal classifier (QDC) and a support vector classifier (SVC) were used in this study to discriminate the eleven eyes into three groups. As a result, the combined classifier of QDC and SVC showed optimal performance with 2D PCA features (2DPCA-QDSVC) and was utilized to classify normal and abnormal tissues, using color image segmentation. The result was compared with human segmentation. The outcome showed that the proposed algorithm produced extremely promising results to assist the clinician in quantifying a cornea injury.
NASA Astrophysics Data System (ADS)
Ohba, Masamichi; Nohara, Daisuke; Kadokura, Shinji
2016-04-01
Severe storms or other extreme weather events can interrupt the spin of wind turbines in large scale that cause unexpected "wind ramp events". In this study, we present an application of self-organizing maps (SOMs) for climatological attribution of the wind ramp events and their probabilistic prediction. The SOM is an automatic data-mining clustering technique, which allows us to summarize a high-dimensional data space in terms of a set of reference vectors. The SOM is applied to analyze and connect the relationship between atmospheric patterns over Japan and wind power generation. SOM is employed on sea level pressure derived from the JRA55 reanalysis over the target area (Tohoku region in Japan), whereby a two-dimensional lattice of weather patterns (WPs) classified during the 1977-2013 period is obtained. To compare with the atmospheric data, the long-term wind power generation is reconstructed by using a high-resolution surface observation network AMeDAS (Automated Meteorological Data Acquisition System) in Japan. Our analysis extracts seven typical WPs, which are linked to frequent occurrences of wind ramp events. Probabilistic forecasts to wind power generation and ramps are conducted by using the obtained SOM. The probability are derived from the multiple SOM lattices based on the matching of output from TIGGE multi-model global forecast to the WPs on the lattices. Since this method effectively takes care of the empirical uncertainties from the historical data, wind power generation and ramp is probabilistically forecasted from the forecasts of global models. The predictability skill of the forecasts for the wind power generation and ramp events show the relatively good skill score under the downscaling technique. It is expected that the results of this study provides better guidance to the user community and contribute to future development of system operation model for the transmission grid operator.
Detection of distorted frames in retinal video-sequences via machine learning
NASA Astrophysics Data System (ADS)
Kolar, Radim; Liberdova, Ivana; Odstrcilik, Jan; Hracho, Michal; Tornow, Ralf P.
2017-07-01
This paper describes detection of distorted frames in retinal sequences based on set of global features extracted from each frame. The feature vector is consequently used in classification step, in which three types of classifiers are tested. The best classification accuracy 96% has been achieved with support vector machine approach.