Semi-supervised prediction of gene regulatory networks using machine learning algorithms.
Patel, Nihir; Wang, Jason T L
2015-10-01
Use of computational methods to predict gene regulatory networks (GRNs) from gene expression data is a challenging task. Many studies have been conducted using unsupervised methods to fulfill the task; however, such methods usually yield low prediction accuracies due to the lack of training data. In this article, we propose semi-supervised methods for GRN prediction by utilizing two machine learning algorithms, namely, support vector machines (SVM) and random forests (RF). The semi-supervised methods make use of unlabelled data for training. We investigated inductive and transductive learning approaches, both of which adopt an iterative procedure to obtain reliable negative training data from the unlabelled data. We then applied our semi-supervised methods to gene expression data of Escherichia coli and Saccharomyces cerevisiae, and evaluated the performance of our methods using the expression data. Our analysis indicated that the transductive learning approach outperformed the inductive learning approach for both organisms. However, there was no conclusive difference identified in the performance of SVM and RF. Experimental results also showed that the proposed semi-supervised methods performed better than existing supervised methods for both organisms.
Machine learning applications in genetics and genomics.
Libbrecht, Maxwell W; Noble, William Stafford
2015-06-01
The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.
Active semi-supervised learning method with hybrid deep belief networks.
Zhou, Shusen; Chen, Qingcai; Wang, Xiaolong
2014-01-01
In this paper, we develop a novel semi-supervised learning algorithm called active hybrid deep belief networks (AHD), to address the semi-supervised sentiment classification problem with deep learning. First, we construct the previous several hidden layers using restricted Boltzmann machines (RBM), which can reduce the dimension and abstract the information of the reviews quickly. Second, we construct the following hidden layers using convolutional restricted Boltzmann machines (CRBM), which can abstract the information of reviews effectively. Third, the constructed deep architecture is fine-tuned by gradient-descent based supervised learning with an exponential loss function. Finally, active learning method is combined based on the proposed deep architecture. We did several experiments on five sentiment classification datasets, and show that AHD is competitive with previous semi-supervised learning algorithm. Experiments are also conducted to verify the effectiveness of our proposed method with different number of labeled reviews and unlabeled reviews respectively.
Cao, Peng; Liu, Xiaoli; Bao, Hang; Yang, Jinzhu; Zhao, Dazhe
2015-01-01
The false-positive reduction (FPR) is a crucial step in the computer aided detection system for the breast. The issues of imbalanced data distribution and the limitation of labeled samples complicate the classification procedure. To overcome these challenges, we propose oversampling and semi-supervised learning methods based on the restricted Boltzmann machines (RBMs) to solve the classification of imbalanced data with a few labeled samples. To evaluate the proposed method, we conducted a comprehensive performance study and compared its results with the commonly used techniques. Experiments on benchmark dataset of DDSM demonstrate the effectiveness of the RBMs based oversampling and semi-supervised learning method in terms of geometric mean (G-mean) for false positive reduction in Breast CAD.
An immune-inspired semi-supervised algorithm for breast cancer diagnosis.
Peng, Lingxi; Chen, Wenbin; Zhou, Wubai; Li, Fufang; Yang, Jin; Zhang, Jiandong
2016-10-01
Breast cancer is the most frequently and world widely diagnosed life-threatening cancer, which is the leading cause of cancer death among women. Early accurate diagnosis can be a big plus in treating breast cancer. Researchers have approached this problem using various data mining and machine learning techniques such as support vector machine, artificial neural network, etc. The computer immunology is also an intelligent method inspired by biological immune system, which has been successfully applied in pattern recognition, combination optimization, machine learning, etc. However, most of these diagnosis methods belong to a supervised diagnosis method. It is very expensive to obtain labeled data in biology and medicine. In this paper, we seamlessly integrate the state-of-the-art research on life science with artificial intelligence, and propose a semi-supervised learning algorithm to reduce the need for labeled data. We use two well-known benchmark breast cancer datasets in our study, which are acquired from the UCI machine learning repository. Extensive experiments are conducted and evaluated on those two datasets. Our experimental results demonstrate the effectiveness and efficiency of our proposed algorithm, which proves that our algorithm is a promising automatic diagnosis method for breast cancer. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Lu, Shen; Xia, Yong; Cai, Tom Weidong; Feng, David Dagan
2015-01-01
Dementia, Alzheimer's disease (AD) in particular is a global problem and big threat to the aging population. An image based computer-aided dementia diagnosis method is needed to providing doctors help during medical image examination. Many machine learning based dementia classification methods using medical imaging have been proposed and most of them achieve accurate results. However, most of these methods make use of supervised learning requiring fully labeled image dataset, which usually is not practical in real clinical environment. Using large amount of unlabeled images can improve the dementia classification performance. In this study we propose a new semi-supervised dementia classification method based on random manifold learning with affinity regularization. Three groups of spatial features are extracted from positron emission tomography (PET) images to construct an unsupervised random forest which is then used to regularize the manifold learning objective function. The proposed method, stat-of-the-art Laplacian support vector machine (LapSVM) and supervised SVM are applied to classify AD and normal controls (NC). The experiment results show that learning with unlabeled images indeed improves the classification performance. And our method outperforms LapSVM on the same dataset.
Perry, Thomas Ernest; Zha, Hongyuan; Zhou, Ke; Frias, Patricio; Zeng, Dadan; Braunstein, Mark
2014-02-01
Electronic health records possess critical predictive information for machine-learning-based diagnostic aids. However, many traditional machine learning methods fail to simultaneously integrate textual data into the prediction process because of its high dimensionality. In this paper, we present a supervised method using Laplacian Eigenmaps to enable existing machine learning methods to estimate both low-dimensional representations of textual data and accurate predictors based on these low-dimensional representations at the same time. We present a supervised Laplacian Eigenmap method to enhance predictive models by embedding textual predictors into a low-dimensional latent space, which preserves the local similarities among textual data in high-dimensional space. The proposed implementation performs alternating optimization using gradient descent. For the evaluation, we applied our method to over 2000 patient records from a large single-center pediatric cardiology practice to predict if patients were diagnosed with cardiac disease. In our experiments, we consider relatively short textual descriptions because of data availability. We compared our method with latent semantic indexing, latent Dirichlet allocation, and local Fisher discriminant analysis. The results were assessed using four metrics: the area under the receiver operating characteristic curve (AUC), Matthews correlation coefficient (MCC), specificity, and sensitivity. The results indicate that supervised Laplacian Eigenmaps was the highest performing method in our study, achieving 0.782 and 0.374 for AUC and MCC, respectively. Supervised Laplacian Eigenmaps showed an increase of 8.16% in AUC and 20.6% in MCC over the baseline that excluded textual data and a 2.69% and 5.35% increase in AUC and MCC, respectively, over unsupervised Laplacian Eigenmaps. As a solution, we present a supervised Laplacian Eigenmap method to embed textual predictors into a low-dimensional Euclidean space. This method allows many existing machine learning predictors to effectively and efficiently capture the potential of textual predictors, especially those based on short texts.
In-situ trainable intrusion detection system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Symons, Christopher T.; Beaver, Justin M.; Gillen, Rob
A computer implemented method detects intrusions using a computer by analyzing network traffic. The method includes a semi-supervised learning module connected to a network node. The learning module uses labeled and unlabeled data to train a semi-supervised machine learning sensor. The method records events that include a feature set made up of unauthorized intrusions and benign computer requests. The method identifies at least some of the benign computer requests that occur during the recording of the events while treating the remainder of the data as unlabeled. The method trains the semi-supervised learning module at the network node in-situ, such thatmore » the semi-supervised learning modules may identify malicious traffic without relying on specific rules, signatures, or anomaly detection.« less
Newton Methods for Large Scale Problems in Machine Learning
ERIC Educational Resources Information Center
Hansen, Samantha Leigh
2014-01-01
The focus of this thesis is on practical ways of designing optimization algorithms for minimizing large-scale nonlinear functions with applications in machine learning. Chapter 1 introduces the overarching ideas in the thesis. Chapters 2 and 3 are geared towards supervised machine learning applications that involve minimizing a sum of loss…
Elkhoudary, Mahmoud M; Naguib, Ibrahim A; Abdel Salam, Randa A; Hadad, Ghada M
2017-05-01
Four accurate, sensitive and reliable stability indicating chemometric methods were developed for the quantitative determination of Agomelatine (AGM) whether in pure form or in pharmaceutical formulations. Two supervised learning machines' methods; linear artificial neural networks (PC-linANN) preceded by principle component analysis and linear support vector regression (linSVR), were compared with two principle component based methods; principle component regression (PCR) as well as partial least squares (PLS) for the spectrofluorimetric determination of AGM and its degradants. The results showed the benefits behind using linear learning machines' methods and the inherent merits of their algorithms in handling overlapped noisy spectral data especially during the challenging determination of AGM alkaline and acidic degradants (DG1 and DG2). Relative mean squared error of prediction (RMSEP) for the proposed models in the determination of AGM were 1.68, 1.72, 0.68 and 0.22 for PCR, PLS, SVR and PC-linANN; respectively. The results showed the superiority of supervised learning machines' methods over principle component based methods. Besides, the results suggested that linANN is the method of choice for determination of components in low amounts with similar overlapped spectra and narrow linearity range. Comparison between the proposed chemometric models and a reported HPLC method revealed the comparable performance and quantification power of the proposed models.
Multilayer Extreme Learning Machine With Subnetwork Nodes for Representation Learning.
Yang, Yimin; Wu, Q M Jonathan
2016-11-01
The extreme learning machine (ELM), which was originally proposed for "generalized" single-hidden layer feedforward neural networks, provides efficient unified learning solutions for the applications of clustering, regression, and classification. It presents competitive accuracy with superb efficiency in many applications. However, ELM with subnetwork nodes architecture has not attracted much research attentions. Recently, many methods have been proposed for supervised/unsupervised dimension reduction or representation learning, but these methods normally only work for one type of problem. This paper studies the general architecture of multilayer ELM (ML-ELM) with subnetwork nodes, showing that: 1) the proposed method provides a representation learning platform with unsupervised/supervised and compressed/sparse representation learning and 2) experimental results on ten image datasets and 16 classification datasets show that, compared to other conventional feature learning methods, the proposed ML-ELM with subnetwork nodes performs competitively or much better than other feature learning methods.
Machine learning for neuroimaging with scikit-learn.
Abraham, Alexandre; Pedregosa, Fabian; Eickenberg, Michael; Gervais, Philippe; Mueller, Andreas; Kossaifi, Jean; Gramfort, Alexandre; Thirion, Bertrand; Varoquaux, Gaël
2014-01-01
Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model high-dimensional datasets, e.g., multivariate analysis of activation images or resting-state time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learning can uncover hidden structures in sets of images (e.g., resting state functional MRI) or find sub-populations in large cohorts. By considering different functional neuroimaging applications, we illustrate how scikit-learn, a Python machine learning library, can be used to perform some key analysis steps. Scikit-learn contains a very large set of statistical learning algorithms, both supervised and unsupervised, and its application to neuroimaging data provides a versatile tool to study the brain.
Machine learning for neuroimaging with scikit-learn
Abraham, Alexandre; Pedregosa, Fabian; Eickenberg, Michael; Gervais, Philippe; Mueller, Andreas; Kossaifi, Jean; Gramfort, Alexandre; Thirion, Bertrand; Varoquaux, Gaël
2014-01-01
Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model high-dimensional datasets, e.g., multivariate analysis of activation images or resting-state time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learning can uncover hidden structures in sets of images (e.g., resting state functional MRI) or find sub-populations in large cohorts. By considering different functional neuroimaging applications, we illustrate how scikit-learn, a Python machine learning library, can be used to perform some key analysis steps. Scikit-learn contains a very large set of statistical learning algorithms, both supervised and unsupervised, and its application to neuroimaging data provides a versatile tool to study the brain. PMID:24600388
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pullum, Laura L; Symons, Christopher T
2011-01-01
Machine learning is used in many applications, from machine vision to speech recognition to decision support systems, and is used to test applications. However, though much has been done to evaluate the performance of machine learning algorithms, little has been done to verify the algorithms or examine their failure modes. Moreover, complex learning frameworks often require stepping beyond black box evaluation to distinguish between errors based on natural limits on learning and errors that arise from mistakes in implementation. We present a conceptual architecture, failure model and taxonomy, and failure modes and effects analysis (FMEA) of a semi-supervised, multi-modal learningmore » system, and provide specific examples from its use in a radiological analysis assistant system. The goal of the research described in this paper is to provide a foundation from which dependability analysis of systems using semi-supervised, multi-modal learning can be conducted. The methods presented provide a first step towards that overall goal.« less
Sweeney, Elizabeth M.; Vogelstein, Joshua T.; Cuzzocreo, Jennifer L.; Calabresi, Peter A.; Reich, Daniel S.; Crainiceanu, Ciprian M.; Shinohara, Russell T.
2014-01-01
Machine learning is a popular method for mining and analyzing large collections of medical data. We focus on a particular problem from medical research, supervised multiple sclerosis (MS) lesion segmentation in structural magnetic resonance imaging (MRI). We examine the extent to which the choice of machine learning or classification algorithm and feature extraction function impacts the performance of lesion segmentation methods. As quantitative measures derived from structural MRI are important clinical tools for research into the pathophysiology and natural history of MS, the development of automated lesion segmentation methods is an active research field. Yet, little is known about what drives performance of these methods. We evaluate the performance of automated MS lesion segmentation methods, which consist of a supervised classification algorithm composed with a feature extraction function. These feature extraction functions act on the observed T1-weighted (T1-w), T2-weighted (T2-w) and fluid-attenuated inversion recovery (FLAIR) MRI voxel intensities. Each MRI study has a manual lesion segmentation that we use to train and validate the supervised classification algorithms. Our main finding is that the differences in predictive performance are due more to differences in the feature vectors, rather than the machine learning or classification algorithms. Features that incorporate information from neighboring voxels in the brain were found to increase performance substantially. For lesion segmentation, we conclude that it is better to use simple, interpretable, and fast algorithms, such as logistic regression, linear discriminant analysis, and quadratic discriminant analysis, and to develop the features to improve performance. PMID:24781953
Sweeney, Elizabeth M; Vogelstein, Joshua T; Cuzzocreo, Jennifer L; Calabresi, Peter A; Reich, Daniel S; Crainiceanu, Ciprian M; Shinohara, Russell T
2014-01-01
Machine learning is a popular method for mining and analyzing large collections of medical data. We focus on a particular problem from medical research, supervised multiple sclerosis (MS) lesion segmentation in structural magnetic resonance imaging (MRI). We examine the extent to which the choice of machine learning or classification algorithm and feature extraction function impacts the performance of lesion segmentation methods. As quantitative measures derived from structural MRI are important clinical tools for research into the pathophysiology and natural history of MS, the development of automated lesion segmentation methods is an active research field. Yet, little is known about what drives performance of these methods. We evaluate the performance of automated MS lesion segmentation methods, which consist of a supervised classification algorithm composed with a feature extraction function. These feature extraction functions act on the observed T1-weighted (T1-w), T2-weighted (T2-w) and fluid-attenuated inversion recovery (FLAIR) MRI voxel intensities. Each MRI study has a manual lesion segmentation that we use to train and validate the supervised classification algorithms. Our main finding is that the differences in predictive performance are due more to differences in the feature vectors, rather than the machine learning or classification algorithms. Features that incorporate information from neighboring voxels in the brain were found to increase performance substantially. For lesion segmentation, we conclude that it is better to use simple, interpretable, and fast algorithms, such as logistic regression, linear discriminant analysis, and quadratic discriminant analysis, and to develop the features to improve performance.
Supervised Machine Learning for Population Genetics: A New Paradigm
Schrider, Daniel R.; Kern, Andrew D.
2018-01-01
As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent applications of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics. PMID:29331490
Self-Supervised Chinese Ontology Learning from Online Encyclopedias
Shao, Zhiqing; Ruan, Tong
2014-01-01
Constructing ontology manually is a time-consuming, error-prone, and tedious task. We present SSCO, a self-supervised learning based chinese ontology, which contains about 255 thousand concepts, 5 million entities, and 40 million facts. We explore the three largest online Chinese encyclopedias for ontology learning and describe how to transfer the structured knowledge in encyclopedias, including article titles, category labels, redirection pages, taxonomy systems, and InfoBox modules, into ontological form. In order to avoid the errors in encyclopedias and enrich the learnt ontology, we also apply some machine learning based methods. First, we proof that the self-supervised machine learning method is practicable in Chinese relation extraction (at least for synonymy and hyponymy) statistically and experimentally and train some self-supervised models (SVMs and CRFs) for synonymy extraction, concept-subconcept relation extraction, and concept-instance relation extraction; the advantages of our methods are that all training examples are automatically generated from the structural information of encyclopedias and a few general heuristic rules. Finally, we evaluate SSCO in two aspects, scale and precision; manual evaluation results show that the ontology has excellent precision, and high coverage is concluded by comparing SSCO with other famous ontologies and knowledge bases; the experiment results also indicate that the self-supervised models obviously enrich SSCO. PMID:24715819
Self-supervised Chinese ontology learning from online encyclopedias.
Hu, Fanghuai; Shao, Zhiqing; Ruan, Tong
2014-01-01
Constructing ontology manually is a time-consuming, error-prone, and tedious task. We present SSCO, a self-supervised learning based chinese ontology, which contains about 255 thousand concepts, 5 million entities, and 40 million facts. We explore the three largest online Chinese encyclopedias for ontology learning and describe how to transfer the structured knowledge in encyclopedias, including article titles, category labels, redirection pages, taxonomy systems, and InfoBox modules, into ontological form. In order to avoid the errors in encyclopedias and enrich the learnt ontology, we also apply some machine learning based methods. First, we proof that the self-supervised machine learning method is practicable in Chinese relation extraction (at least for synonymy and hyponymy) statistically and experimentally and train some self-supervised models (SVMs and CRFs) for synonymy extraction, concept-subconcept relation extraction, and concept-instance relation extraction; the advantages of our methods are that all training examples are automatically generated from the structural information of encyclopedias and a few general heuristic rules. Finally, we evaluate SSCO in two aspects, scale and precision; manual evaluation results show that the ontology has excellent precision, and high coverage is concluded by comparing SSCO with other famous ontologies and knowledge bases; the experiment results also indicate that the self-supervised models obviously enrich SSCO.
Source localization in an ocean waveguide using supervised machine learning.
Niu, Haiqiang; Reeves, Emma; Gerstoft, Peter
2017-09-01
Source localization in ocean acoustics is posed as a machine learning problem in which data-driven methods learn source ranges directly from observed acoustic data. The pressure received by a vertical linear array is preprocessed by constructing a normalized sample covariance matrix and used as the input for three machine learning methods: feed-forward neural networks (FNN), support vector machines (SVM), and random forests (RF). The range estimation problem is solved both as a classification problem and as a regression problem by these three machine learning algorithms. The results of range estimation for the Noise09 experiment are compared for FNN, SVM, RF, and conventional matched-field processing and demonstrate the potential of machine learning for underwater source localization.
Human semi-supervised learning.
Gibson, Bryan R; Rogers, Timothy T; Zhu, Xiaojin
2013-01-01
Most empirical work in human categorization has studied learning in either fully supervised or fully unsupervised scenarios. Most real-world learning scenarios, however, are semi-supervised: Learners receive a great deal of unlabeled information from the world, coupled with occasional experiences in which items are directly labeled by a knowledgeable source. A large body of work in machine learning has investigated how learning can exploit both labeled and unlabeled data provided to a learner. Using equivalences between models found in human categorization and machine learning research, we explain how these semi-supervised techniques can be applied to human learning. A series of experiments are described which show that semi-supervised learning models prove useful for explaining human behavior when exposed to both labeled and unlabeled data. We then discuss some machine learning models that do not have familiar human categorization counterparts. Finally, we discuss some challenges yet to be addressed in the use of semi-supervised models for modeling human categorization. Copyright © 2013 Cognitive Science Society, Inc.
NASA Astrophysics Data System (ADS)
Govorov, Michael; Gienko, Gennady; Putrenko, Viktor
2018-05-01
In this paper, several supervised machine learning algorithms were explored to define homogeneous regions of con-centration of uranium in surface waters in Ukraine using multiple environmental parameters. The previous study was focused on finding the primary environmental parameters related to uranium in ground waters using several methods of spatial statistics and unsupervised classification. At this step, we refined the regionalization using Artifi-cial Neural Networks (ANN) techniques including Multilayer Perceptron (MLP), Radial Basis Function (RBF), and Convolutional Neural Network (CNN). The study is focused on building local ANN models which may significantly improve the prediction results of machine learning algorithms by taking into considerations non-stationarity and autocorrelation in spatial data.
Detecting Visually Observable Disease Symptoms from Faces.
Wang, Kuan; Luo, Jiebo
2016-12-01
Recent years have witnessed an increasing interest in the application of machine learning to clinical informatics and healthcare systems. A significant amount of research has been done on healthcare systems based on supervised learning. In this study, we present a generalized solution to detect visually observable symptoms on faces using semi-supervised anomaly detection combined with machine vision algorithms. We rely on the disease-related statistical facts to detect abnormalities and classify them into multiple categories to narrow down the possible medical reasons of detecting. Our method is in contrast with most existing approaches, which are limited by the availability of labeled training data required for supervised learning, and therefore offers the major advantage of flagging any unusual and visually observable symptoms.
A deep learning and novelty detection framework for rapid phenotyping in high-content screening
Sommer, Christoph; Hoefler, Rudolf; Samwer, Matthias; Gerlich, Daniel W.
2017-01-01
Supervised machine learning is a powerful and widely used method for analyzing high-content screening data. Despite its accuracy, efficiency, and versatility, supervised machine learning has drawbacks, most notably its dependence on a priori knowledge of expected phenotypes and time-consuming classifier training. We provide a solution to these limitations with CellCognition Explorer, a generic novelty detection and deep learning framework. Application to several large-scale screening data sets on nuclear and mitotic cell morphologies demonstrates that CellCognition Explorer enables discovery of rare phenotypes without user training, which has broad implications for improved assay development in high-content screening. PMID:28954863
Salvatore, C; Cerasa, A; Castiglioni, I; Gallivanone, F; Augimeri, A; Lopez, M; Arabia, G; Morelli, M; Gilardi, M C; Quattrone, A
2014-01-30
Supervised machine learning has been proposed as a revolutionary approach for identifying sensitive medical image biomarkers (or combination of them) allowing for automatic diagnosis of individual subjects. The aim of this work was to assess the feasibility of a supervised machine learning algorithm for the assisted diagnosis of patients with clinically diagnosed Parkinson's disease (PD) and Progressive Supranuclear Palsy (PSP). Morphological T1-weighted Magnetic Resonance Images (MRIs) of PD patients (28), PSP patients (28) and healthy control subjects (28) were used by a supervised machine learning algorithm based on the combination of Principal Components Analysis as feature extraction technique and on Support Vector Machines as classification algorithm. The algorithm was able to obtain voxel-based morphological biomarkers of PD and PSP. The algorithm allowed individual diagnosis of PD versus controls, PSP versus controls and PSP versus PD with an Accuracy, Specificity and Sensitivity>90%. Voxels influencing classification between PD and PSP patients involved midbrain, pons, corpus callosum and thalamus, four critical regions known to be strongly involved in the pathophysiological mechanisms of PSP. Classification accuracy of individual PSP patients was consistent with previous manual morphological metrics and with other supervised machine learning application to MRI data, whereas accuracy in the detection of individual PD patients was significantly higher with our classification method. The algorithm provides excellent discrimination of PD patients from PSP patients at an individual level, thus encouraging the application of computer-based diagnosis in clinical practice. Copyright © 2013 Elsevier B.V. All rights reserved.
A review of supervised machine learning applied to ageing research.
Fabris, Fabio; Magalhães, João Pedro de; Freitas, Alex A
2017-04-01
Broadly speaking, supervised machine learning is the computational task of learning correlations between variables in annotated data (the training set), and using this information to create a predictive model capable of inferring annotations for new data, whose annotations are not known. Ageing is a complex process that affects nearly all animal species. This process can be studied at several levels of abstraction, in different organisms and with different objectives in mind. Not surprisingly, the diversity of the supervised machine learning algorithms applied to answer biological questions reflects the complexities of the underlying ageing processes being studied. Many works using supervised machine learning to study the ageing process have been recently published, so it is timely to review these works, to discuss their main findings and weaknesses. In summary, the main findings of the reviewed papers are: the link between specific types of DNA repair and ageing; ageing-related proteins tend to be highly connected and seem to play a central role in molecular pathways; ageing/longevity is linked with autophagy and apoptosis, nutrient receptor genes, and copper and iron ion transport. Additionally, several biomarkers of ageing were found by machine learning. Despite some interesting machine learning results, we also identified a weakness of current works on this topic: only one of the reviewed papers has corroborated the computational results of machine learning algorithms through wet-lab experiments. In conclusion, supervised machine learning has contributed to advance our knowledge and has provided novel insights on ageing, yet future work should have a greater emphasis in validating the predictions.
Feng, Jingwen; Feng, Tong; Yang, Chengwen; Wang, Wei; Sa, Yu; Feng, Yuanming
2018-06-01
This study was to explore the feasibility of prediction and classification of cells in different stages of apoptosis with a stain-free method based on diffraction images and supervised machine learning. Apoptosis was induced in human chronic myelogenous leukemia K562 cells by cis-platinum (DDP). A newly developed technique of polarization diffraction imaging flow cytometry (p-DIFC) was performed to acquire diffraction images of the cells in three different statuses (viable, early apoptotic and late apoptotic/necrotic) after cell separation through fluorescence activated cell sorting with Annexin V-PE and SYTOX® Green double staining. The texture features of the diffraction images were extracted with in-house software based on the Gray-level co-occurrence matrix algorithm to generate datasets for cell classification with supervised machine learning method. Therefore, this new method has been verified in hydrogen peroxide induced apoptosis model of HL-60. Results show that accuracy of higher than 90% was achieved respectively in independent test datasets from each cell type based on logistic regression with ridge estimators, which indicated that p-DIFC system has a great potential in predicting and classifying cells in different stages of apoptosis.
Parodi, Stefano; Manneschi, Chiara; Verda, Damiano; Ferrari, Enrico; Muselli, Marco
2018-03-01
This study evaluates the performance of a set of machine learning techniques in predicting the prognosis of Hodgkin's lymphoma using clinical factors and gene expression data. Analysed samples from 130 Hodgkin's lymphoma patients included a small set of clinical variables and more than 54,000 gene features. Machine learning classifiers included three black-box algorithms ( k-nearest neighbour, Artificial Neural Network, and Support Vector Machine) and two methods based on intelligible rules (Decision Tree and the innovative Logic Learning Machine method). Support Vector Machine clearly outperformed any of the other methods. Among the two rule-based algorithms, Logic Learning Machine performed better and identified a set of simple intelligible rules based on a combination of clinical variables and gene expressions. Decision Tree identified a non-coding gene ( XIST) involved in the early phases of X chromosome inactivation that was overexpressed in females and in non-relapsed patients. XIST expression might be responsible for the better prognosis of female Hodgkin's lymphoma patients.
Extracting microRNA-gene relations from biomedical literature using distant supervision
Clarke, Luka A.; Couto, Francisco M.
2017-01-01
Many biomedical relation extraction approaches are based on supervised machine learning, requiring an annotated corpus. Distant supervision aims at training a classifier by combining a knowledge base with a corpus, reducing the amount of manual effort necessary. This is particularly useful for biomedicine because many databases and ontologies have been made available for many biological processes, while the availability of annotated corpora is still limited. We studied the extraction of microRNA-gene relations from text. MicroRNA regulation is an important biological process due to its close association with human diseases. The proposed method, IBRel, is based on distantly supervised multi-instance learning. We evaluated IBRel on three datasets, and the results were compared with a co-occurrence approach as well as a supervised machine learning algorithm. While supervised learning outperformed on two of those datasets, IBRel obtained an F-score 28.3 percentage points higher on the dataset for which there was no training set developed specifically. To demonstrate the applicability of IBRel, we used it to extract 27 miRNA-gene relations from recently published papers about cystic fibrosis. Our results demonstrate that our method can be successfully used to extract relations from literature about a biological process without an annotated corpus. The source code and data used in this study are available at https://github.com/AndreLamurias/IBRel. PMID:28263989
Extracting microRNA-gene relations from biomedical literature using distant supervision.
Lamurias, Andre; Clarke, Luka A; Couto, Francisco M
2017-01-01
Many biomedical relation extraction approaches are based on supervised machine learning, requiring an annotated corpus. Distant supervision aims at training a classifier by combining a knowledge base with a corpus, reducing the amount of manual effort necessary. This is particularly useful for biomedicine because many databases and ontologies have been made available for many biological processes, while the availability of annotated corpora is still limited. We studied the extraction of microRNA-gene relations from text. MicroRNA regulation is an important biological process due to its close association with human diseases. The proposed method, IBRel, is based on distantly supervised multi-instance learning. We evaluated IBRel on three datasets, and the results were compared with a co-occurrence approach as well as a supervised machine learning algorithm. While supervised learning outperformed on two of those datasets, IBRel obtained an F-score 28.3 percentage points higher on the dataset for which there was no training set developed specifically. To demonstrate the applicability of IBRel, we used it to extract 27 miRNA-gene relations from recently published papers about cystic fibrosis. Our results demonstrate that our method can be successfully used to extract relations from literature about a biological process without an annotated corpus. The source code and data used in this study are available at https://github.com/AndreLamurias/IBRel.
Supervised DNA Barcodes species classification: analysis, comparisons and results
2014-01-01
Background Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms. Methods In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods. Results A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods. Conclusions The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community. PMID:24721333
NASA Technical Reports Server (NTRS)
Garay, Michael J.; Mazzoni, Dominic; Davies, Roger; Wagstaff, Kiri
2004-01-01
Support Vector Machines (SVMs) are a type of supervised learning algorith,, other examples of which are Artificial Neural Networks (ANNs), Decision Trees, and Naive Bayesian Classifiers. Supervised learning algorithms are used to classify objects labled by a 'supervisor' - typically a human 'expert.'.
Wire connector classification with machine vision and a novel hybrid SVM
NASA Astrophysics Data System (ADS)
Chauhan, Vedang; Joshi, Keyur D.; Surgenor, Brian W.
2018-04-01
A machine vision-based system has been developed and tested that uses a novel hybrid Support Vector Machine (SVM) in a part inspection application with clear plastic wire connectors. The application required the system to differentiate between 4 different known styles of connectors plus one unknown style, for a total of 5 classes. The requirement to handle an unknown class is what necessitated the hybrid approach. The system was trained with the 4 known classes and tested with 5 classes (the 4 known plus the 1 unknown). The hybrid classification approach used two layers of SVMs: one layer was semi-supervised and the other layer was supervised. The semi-supervised SVM was a special case of unsupervised machine learning that classified test images as one of the 4 known classes (to accept) or as the unknown class (to reject). The supervised SVM classified test images as one of the 4 known classes and consequently would give false positives (FPs). Two methods were tested. The difference between the methods was that the order of the layers was switched. The method with the semi-supervised layer first gave an accuracy of 80% with 20% FPs. The method with the supervised layer first gave an accuracy of 98% with 0% FPs. Further work is being conducted to see if the hybrid approach works with other applications that have an unknown class requirement.
Use of Machine Learning to Identify Children with Autism and Their Motor Abnormalities
ERIC Educational Resources Information Center
Crippa, Alessandro; Salvatore, Christian; Perego, Paolo; Forti, Sara; Nobile, Maria; Molteni, Massimo; Castiglioni, Isabella
2015-01-01
In the present work, we have undertaken a proof-of-concept study to determine whether a simple upper-limb movement could be useful to accurately classify low-functioning children with autism spectrum disorder (ASD) aged 2-4. To answer this question, we developed a supervised machine-learning method to correctly discriminate 15 preschool children…
Assessment of various supervised learning algorithms using different performance metrics
NASA Astrophysics Data System (ADS)
Susheel Kumar, S. M.; Laxkar, Deepak; Adhikari, Sourav; Vijayarajan, V.
2017-11-01
Our work brings out comparison based on the performance of supervised machine learning algorithms on a binary classification task. The supervised machine learning algorithms which are taken into consideration in the following work are namely Support Vector Machine(SVM), Decision Tree(DT), K Nearest Neighbour (KNN), Naïve Bayes(NB) and Random Forest(RF). This paper mostly focuses on comparing the performance of above mentioned algorithms on one binary classification task by analysing the Metrics such as Accuracy, F-Measure, G-Measure, Precision, Misclassification Rate, False Positive Rate, True Positive Rate, Specificity, Prevalence.
Quantum-Enhanced Machine Learning
NASA Astrophysics Data System (ADS)
Dunjko, Vedran; Taylor, Jacob M.; Briegel, Hans J.
2016-09-01
The emerging field of quantum machine learning has the potential to substantially aid in the problems and scope of artificial intelligence. This is only enhanced by recent successes in the field of classical machine learning. In this work we propose an approach for the systematic treatment of machine learning, from the perspective of quantum information. Our approach is general and covers all three main branches of machine learning: supervised, unsupervised, and reinforcement learning. While quantum improvements in supervised and unsupervised learning have been reported, reinforcement learning has received much less attention. Within our approach, we tackle the problem of quantum enhancements in reinforcement learning as well, and propose a systematic scheme for providing improvements. As an example, we show that quadratic improvements in learning efficiency, and exponential improvements in performance over limited time periods, can be obtained for a broad class of learning problems.
Entanglement-Based Machine Learning on a Quantum Computer
NASA Astrophysics Data System (ADS)
Cai, X.-D.; Wu, D.; Su, Z.-E.; Chen, M.-C.; Wang, X.-L.; Li, Li; Liu, N.-L.; Lu, C.-Y.; Pan, J.-W.
2015-03-01
Machine learning, a branch of artificial intelligence, learns from previous experience to optimize performance, which is ubiquitous in various fields such as computer sciences, financial analysis, robotics, and bioinformatics. A challenge is that machine learning with the rapidly growing "big data" could become intractable for classical computers. Recently, quantum machine learning algorithms [Lloyd, Mohseni, and Rebentrost, arXiv.1307.0411] were proposed which could offer an exponential speedup over classical algorithms. Here, we report the first experimental entanglement-based classification of two-, four-, and eight-dimensional vectors to different clusters using a small-scale photonic quantum computer, which are then used to implement supervised and unsupervised machine learning. The results demonstrate the working principle of using quantum computers to manipulate and classify high-dimensional vectors, the core mathematical routine in machine learning. The method can, in principle, be scaled to larger numbers of qubits, and may provide a new route to accelerate machine learning.
Johnson, Nathan T; Dhroso, Andi; Hughes, Katelyn J; Korkin, Dmitry
2018-06-25
The extent to which the genes are expressed in the cell can be simplistically defined as a function of one or more factors of the environment, lifestyle, and genetics. RNA sequencing (RNA-Seq) is becoming a prevalent approach to quantify gene expression, and is expected to gain better insights to a number of biological and biomedical questions, compared to the DNA microarrays. Most importantly, RNA-Seq allows to quantify expression at the gene and alternative splicing isoform levels. However, leveraging the RNA-Seq data requires development of new data mining and analytics methods. Supervised machine learning methods are commonly used approaches for biological data analysis, and have recently gained attention for their applications to the RNA-Seq data. In this work, we assess the utility of supervised learning methods trained on RNA-Seq data for a diverse range of biological classification tasks. We hypothesize that the isoform-level expression data is more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment is done through utilizing multiple datasets, organisms, lab groups, and RNA-Seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-Seq datasets and include over 2,000 samples that come from multiple organisms, lab groups, and RNA-Seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes and, the pathological tumor stage for the samples from the cancerous tissue. For each classification problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the isoform-based classifiers outperform or are comparable with gene expression based methods. The top-performing supervised learning techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-Seq based data analysis. Published by Cold Spring Harbor Laboratory Press for the RNA Society.
NASA Astrophysics Data System (ADS)
Jarabo-Amores, María-Pilar; la Mata-Moya, David de; Gil-Pita, Roberto; Rosa-Zurera, Manuel
2013-12-01
The application of supervised learning machines trained to minimize the Cross-Entropy error to radar detection is explored in this article. The detector is implemented with a learning machine that implements a discriminant function, which output is compared to a threshold selected to fix a desired probability of false alarm. The study is based on the calculation of the function the learning machine approximates to during training, and the application of a sufficient condition for a discriminant function to be used to approximate the optimum Neyman-Pearson (NP) detector. In this article, the function a supervised learning machine approximates to after being trained to minimize the Cross-Entropy error is obtained. This discriminant function can be used to implement the NP detector, which maximizes the probability of detection, maintaining the probability of false alarm below or equal to a predefined value. Some experiments about signal detection using neural networks are also presented to test the validity of the study.
Machine learning methods in chemoinformatics
Mitchell, John B O
2014-01-01
Machine learning algorithms are generally developed in computer science or adjacent disciplines and find their way into chemical modeling by a process of diffusion. Though particular machine learning methods are popular in chemoinformatics and quantitative structure–activity relationships (QSAR), many others exist in the technical literature. This discussion is methods-based and focused on some algorithms that chemoinformatics researchers frequently use. It makes no claim to be exhaustive. We concentrate on methods for supervised learning, predicting the unknown property values of a test set of instances, usually molecules, based on the known values for a training set. Particularly relevant approaches include Artificial Neural Networks, Random Forest, Support Vector Machine, k-Nearest Neighbors and naïve Bayes classifiers. WIREs Comput Mol Sci 2014, 4:468–481. How to cite this article: WIREs Comput Mol Sci 2014, 4:468–481. doi:10.1002/wcms.1183 PMID:25285160
Pekkala, Timo; Hall, Anette; Lötjönen, Jyrki; Mattila, Jussi; Soininen, Hilkka; Ngandu, Tiia; Laatikainen, Tiina; Kivipelto, Miia; Solomon, Alina
2017-01-01
This study aimed to develop a late-life dementia prediction model using a novel validated supervised machine learning method, the Disease State Index (DSI), in the Finnish population-based CAIDE study. The CAIDE study was based on previous population-based midlife surveys. CAIDE participants were re-examined twice in late-life, and the first late-life re-examination was used as baseline for the present study. The main study population included 709 cognitively normal subjects at first re-examination who returned to the second re-examination up to 10 years later (incident dementia n = 39). An extended population (n = 1009, incident dementia 151) included non-participants/non-survivors (national registers data). DSI was used to develop a dementia index based on first re-examination assessments. Performance in predicting dementia was assessed as area under the ROC curve (AUC). AUCs for DSI were 0.79 and 0.75 for main and extended populations. Included predictors were cognition, vascular factors, age, subjective memory complaints, and APOE genotype. The supervised machine learning method performed well in identifying comprehensive profiles for predicting dementia development up to 10 years later. DSI could thus be useful for identifying individuals who are most at risk and may benefit from dementia prevention interventions.
Supervised machine learning and active learning in classification of radiology reports.
Nguyen, Dung H M; Patrick, Jon D
2014-01-01
This paper presents an automated system for classifying the results of imaging examinations (CT, MRI, positron emission tomography) into reportable and non-reportable cancer cases. This system is part of an industrial-strength processing pipeline built to extract content from radiology reports for use in the Victorian Cancer Registry. In addition to traditional supervised learning methods such as conditional random fields and support vector machines, active learning (AL) approaches were investigated to optimize training production and further improve classification performance. The project involved two pilot sites in Victoria, Australia (Lake Imaging (Ballarat) and Peter MacCallum Cancer Centre (Melbourne)) and, in collaboration with the NSW Central Registry, one pilot site at Westmead Hospital (Sydney). The reportability classifier performance achieved 98.25% sensitivity and 96.14% specificity on the cancer registry's held-out test set. Up to 92% of training data needed for supervised machine learning can be saved by AL. AL is a promising method for optimizing the supervised training production used in classification of radiology reports. When an AL strategy is applied during the data selection process, the cost of manual classification can be reduced significantly. The most important practical application of the reportability classifier is that it can dramatically reduce human effort in identifying relevant reports from the large imaging pool for further investigation of cancer. The classifier is built on a large real-world dataset and can achieve high performance in filtering relevant reports to support cancer registries. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Pekkala, Timo; Hall, Anette; Lötjönen, Jyrki; Mattila, Jussi; Soininen, Hilkka; Ngandu, Tiia; Laatikainen, Tiina; Kivipelto, Miia; Solomon, Alina
2016-01-01
Background and objective: This study aimed to develop a late-life dementia prediction model using a novel validated supervised machine learning method, the Disease State Index (DSI), in the Finnish population-based CAIDE study. Methods: The CAIDE study was based on previous population-based midlife surveys. CAIDE participants were re-examined twice in late-life, and the first late-life re-examination was used as baseline for the present study. The main study population included 709 cognitively normal subjects at first re-examination who returned to the second re-examination up to 10 years later (incident dementia n = 39). An extended population (n = 1009, incident dementia 151) included non-participants/non-survivors (national registers data). DSI was used to develop a dementia index based on first re-examination assessments. Performance in predicting dementia was assessed as area under the ROC curve (AUC). Results: AUCs for DSI were 0.79 and 0.75 for main and extended populations. Included predictors were cognition, vascular factors, age, subjective memory complaints, and APOE genotype. Conclusion: The supervised machine learning method performed well in identifying comprehensive profiles for predicting dementia development up to 10 years later. DSI could thus be useful for identifying individuals who are most at risk and may benefit from dementia prevention interventions. PMID:27802228
Semi-supervised protein subcellular localization.
Xu, Qian; Hu, Derek Hao; Xue, Hong; Yu, Weichuan; Yang, Qiang
2009-01-30
Protein subcellular localization is concerned with predicting the location of a protein within a cell using computational method. The location information can indicate key functionalities of proteins. Accurate predictions of subcellular localizations of protein can aid the prediction of protein function and genome annotation, as well as the identification of drug targets. Computational methods based on machine learning, such as support vector machine approaches, have already been widely used in the prediction of protein subcellular localization. However, a major drawback of these machine learning-based approaches is that a large amount of data should be labeled in order to let the prediction system learn a classifier of good generalization ability. However, in real world cases, it is laborious, expensive and time-consuming to experimentally determine the subcellular localization of a protein and prepare instances of labeled data. In this paper, we present an approach based on a new learning framework, semi-supervised learning, which can use much fewer labeled instances to construct a high quality prediction model. We construct an initial classifier using a small set of labeled examples first, and then use unlabeled instances to refine the classifier for future predictions. Experimental results show that our methods can effectively reduce the workload for labeling data using the unlabeled data. Our method is shown to enhance the state-of-the-art prediction results of SVM classifiers by more than 10%.
NASA Astrophysics Data System (ADS)
Dang, Nguyen Tuan; Akai-Kasada, Megumi; Asai, Tetsuya; Saito, Akira; Kuwahara, Yuji; Hokkaido University Collaboration
2015-03-01
Machine learning using the artificial neuron network research is supposed to be the best way to understand how the human brain trains itself to process information. In this study, we have successfully developed the programs using supervised machine learning algorithm. However, these supervised learning processes for the neuron network required the very strong computing configuration. Derivation from the necessity of increasing in computing ability and in reduction of power consumption, accelerator circuits become critical. To develop such accelerator circuits using supervised machine learning algorithm, conducting polymer micro/nanowires growing process was realized and applied as a synaptic weigh controller. In this work, high conductivity Polypyrrole (PPy) and Poly (3, 4 - ethylenedioxythiophene) PEDOT wires were potentiostatically grown crosslinking the designated electrodes, which were prefabricated by lithography, when appropriate square wave AC voltage and appropriate frequency were applied. Micro/nanowire growing process emulated the neurotransmitter release process of synapses inside a biological neuron and wire's resistance variation during the growing process was preferred to as the variation of synaptic weigh in machine learning algorithm. In a cooperation with Graduate School of Information Science and Technology, Hokkaido University.
Semi-supervised and unsupervised extreme learning machines.
Huang, Gao; Song, Shiji; Gupta, Jatinder N D; Wu, Cheng
2014-12-01
Extreme learning machines (ELMs) have proven to be efficient and effective learning mechanisms for pattern classification and regression. However, ELMs are primarily applied to supervised learning problems. Only a few existing research papers have used ELMs to explore unlabeled data. In this paper, we extend ELMs for both semi-supervised and unsupervised tasks based on the manifold regularization, thus greatly expanding the applicability of ELMs. The key advantages of the proposed algorithms are as follows: 1) both the semi-supervised ELM (SS-ELM) and the unsupervised ELM (US-ELM) exhibit learning capability and computational efficiency of ELMs; 2) both algorithms naturally handle multiclass classification or multicluster clustering; and 3) both algorithms are inductive and can handle unseen data at test time directly. Moreover, it is shown in this paper that all the supervised, semi-supervised, and unsupervised ELMs can actually be put into a unified framework. This provides new perspectives for understanding the mechanism of random feature mapping, which is the key concept in ELM theory. Empirical study on a wide range of data sets demonstrates that the proposed algorithms are competitive with the state-of-the-art semi-supervised or unsupervised learning algorithms in terms of accuracy and efficiency.
A Hybrid Supervised/Unsupervised Machine Learning Approach to Solar Flare Prediction
NASA Astrophysics Data System (ADS)
Benvenuto, Federico; Piana, Michele; Campi, Cristina; Massone, Anna Maria
2018-01-01
This paper introduces a novel method for flare forecasting, combining prediction accuracy with the ability to identify the most relevant predictive variables. This result is obtained by means of a two-step approach: first, a supervised regularization method for regression, namely, LASSO is applied, where a sparsity-enhancing penalty term allows the identification of the significance with which each data feature contributes to the prediction; then, an unsupervised fuzzy clustering technique for classification, namely, Fuzzy C-Means, is applied, where the regression outcome is partitioned through the minimization of a cost function and without focusing on the optimization of a specific skill score. This approach is therefore hybrid, since it combines supervised and unsupervised learning; realizes classification in an automatic, skill-score-independent way; and provides effective prediction performances even in the case of imbalanced data sets. Its prediction power is verified against NOAA Space Weather Prediction Center data, using as a test set, data in the range between 1996 August and 2010 December and as training set, data in the range between 1988 December and 1996 June. To validate the method, we computed several skill scores typically utilized in flare prediction and compared the values provided by the hybrid approach with the ones provided by several standard (non-hybrid) machine learning methods. The results showed that the hybrid approach performs classification better than all other supervised methods and with an effectiveness comparable to the one of clustering methods; but, in addition, it provides a reliable ranking of the weights with which the data properties contribute to the forecast.
Supervised machine learning for analysing spectra of exoplanetary atmospheres
NASA Astrophysics Data System (ADS)
Márquez-Neila, Pablo; Fisher, Chloe; Sznitman, Raphael; Heng, Kevin
2018-06-01
The use of machine learning is becoming ubiquitous in astronomy1-3, but remains rare in the study of the atmospheres of exoplanets. Given the spectrum of an exoplanetary atmosphere, a multi-parameter space is swept through in real time to find the best-fit model4-6. Known as atmospheric retrieval, this technique originates in the Earth and planetary sciences7. Such methods are very time-consuming, and by necessity there is a compromise between physical and chemical realism and computational feasibility. Machine learning has previously been used to determine which molecules to include in the model, but the retrieval itself was still performed using standard methods8. Here, we report an adaptation of the `random forest' method of supervised machine learning9,10, trained on a precomputed grid of atmospheric models, which retrieves full posterior distributions of the abundances of molecules and the cloud opacity. The use of a precomputed grid allows a large part of the computational burden to be shifted offline. We demonstrate our technique on a transmission spectrum of the hot gas-giant exoplanet WASP-12b using a five-parameter model (temperature, a constant cloud opacity and the volume mixing ratios or relative abundances of molecules of water, ammonia and hydrogen cyanide)11. We obtain results consistent with the standard nested-sampling retrieval method. We also estimate the sensitivity of the measured spectrum to the model parameters, and we are able to quantify the information content of the spectrum. Our method can be straightforwardly applied using more sophisticated atmospheric models to interpret an ensemble of spectra without having to retrain the random forest.
Machine Learning Methods for Attack Detection in the Smart Grid.
Ozay, Mete; Esnaola, Inaki; Yarman Vural, Fatos Tunay; Kulkarni, Sanjeev R; Poor, H Vincent
2016-08-01
Attack detection problems in the smart grid are posed as statistical learning problems for different attack scenarios in which the measurements are observed in batch or online settings. In this approach, machine learning algorithms are used to classify measurements as being either secure or attacked. An attack detection framework is provided to exploit any available prior knowledge about the system and surmount constraints arising from the sparse structure of the problem in the proposed approach. Well-known batch and online learning algorithms (supervised and semisupervised) are employed with decision- and feature-level fusion to model the attack detection problem. The relationships between statistical and geometric properties of attack vectors employed in the attack scenarios and learning algorithms are analyzed to detect unobservable attacks using statistical learning methods. The proposed algorithms are examined on various IEEE test systems. Experimental analyses show that machine learning algorithms can detect attacks with performances higher than attack detection algorithms that employ state vector estimation methods in the proposed attack detection framework.
Classification of ROTSE Variable Stars using Machine Learning
NASA Astrophysics Data System (ADS)
Wozniak, P. R.; Akerlof, C.; Amrose, S.; Brumby, S.; Casperson, D.; Gisler, G.; Kehoe, R.; Lee, B.; Marshall, S.; McGowan, K. E.; McKay, T.; Perkins, S.; Priedhorsky, W.; Rykoff, E.; Smith, D. A.; Theiler, J.; Vestrand, W. T.; Wren, J.; ROTSE Collaboration
2001-12-01
We evaluate several Machine Learning algorithms as potential tools for automated classification of variable stars. Using the ROTSE sample of ~1800 variables from a pilot study of 5% of the whole sky, we compare the effectiveness of a supervised technique (Support Vector Machines, SVM) versus unsupervised methods (K-means and Autoclass). There are 8 types of variables in the sample: RR Lyr AB, RR Lyr C, Delta Scuti, Cepheids, detached eclipsing binaries, contact binaries, Miras and LPVs. Preliminary results suggest a very high ( ~95%) efficiency of SVM in isolating a few best defined classes against the rest of the sample, and good accuracy ( ~70-75%) for all classes considered simultaneously. This includes some degeneracies, irreducible with the information at hand. Supervised methods naturally outperform unsupervised methods, in terms of final error rate, but unsupervised methods offer many advantages for large sets of unlabeled data. Therefore, both types of methods should be considered as promising tools for mining vast variability surveys. We project that there are more than 30,000 periodic variables in the ROTSE-I data base covering the entire local sky between V=10 and 15.5 mag. This sample size is already stretching the time capabilities of human analysts.
Barua, Shaibal; Begum, Shahina; Ahmed, Mobyen Uddin
2015-01-01
Machine learning algorithms play an important role in computer science research. Recent advancement in sensor data collection in clinical sciences lead to a complex, heterogeneous data processing, and analysis for patient diagnosis and prognosis. Diagnosis and treatment of patients based on manual analysis of these sensor data are difficult and time consuming. Therefore, development of Knowledge-based systems to support clinicians in decision-making is important. However, it is necessary to perform experimental work to compare performances of different machine learning methods to help to select appropriate method for a specific characteristic of data sets. This paper compares classification performance of three popular machine learning methods i.e., case-based reasoning, neutral networks and support vector machine to diagnose stress of vehicle drivers using finger temperature and heart rate variability. The experimental results show that case-based reasoning outperforms other two methods in terms of classification accuracy. Case-based reasoning has achieved 80% and 86% accuracy to classify stress using finger temperature and heart rate variability. On contrary, both neural network and support vector machine have achieved less than 80% accuracy by using both physiological signals.
Held, Elizabeth; Cape, Joshua; Tintle, Nathan
2016-01-01
Machine learning methods continue to show promise in the analysis of data from genetic association studies because of the high number of variables relative to the number of observations. However, few best practices exist for the application of these methods. We extend a recently proposed supervised machine learning approach for predicting disease risk by genotypes to be able to incorporate gene expression data and rare variants. We then apply 2 different versions of the approach (radial and linear support vector machines) to simulated data from Genetic Analysis Workshop 19 and compare performance to logistic regression. Method performance was not radically different across the 3 methods, although the linear support vector machine tended to show small gains in predictive ability relative to a radial support vector machine and logistic regression. Importantly, as the number of genes in the models was increased, even when those genes contained causal rare variants, model predictive ability showed a statistically significant decrease in performance for both the radial support vector machine and logistic regression. The linear support vector machine showed more robust performance to the inclusion of additional genes. Further work is needed to evaluate machine learning approaches on larger samples and to evaluate the relative improvement in model prediction from the incorporation of gene expression data.
Kruse, Christian
2018-06-01
To review current practices and technologies within the scope of "Big Data" that can further our understanding of diabetes mellitus and osteoporosis from large volumes of data. "Big Data" techniques involving supervised machine learning, unsupervised machine learning, and deep learning image analysis are presented with examples of current literature. Supervised machine learning can allow us to better predict diabetes-induced osteoporosis and understand relative predictor importance of diabetes-affected bone tissue. Unsupervised machine learning can allow us to understand patterns in data between diabetic pathophysiology and altered bone metabolism. Image analysis using deep learning can allow us to be less dependent on surrogate predictors and use large volumes of images to classify diabetes-induced osteoporosis and predict future outcomes directly from images. "Big Data" techniques herald new possibilities to understand diabetes-induced osteoporosis and ascertain our current ability to classify, understand, and predict this condition.
Weakly supervised classification in high energy physics
Dery, Lucio Mwinmaarong; Nachman, Benjamin; Rubbo, Francesco; ...
2017-05-01
As machine learning algorithms become increasingly sophisticated to exploit subtle features of the data, they often become more dependent on simulations. Here, this paper presents a new approach called weakly supervised classification in which class proportions are the only input into the machine learning algorithm. Using one of the most challenging binary classification tasks in high energy physics $-$ quark versus gluon tagging $-$ we show that weakly supervised classification can match the performance of fully supervised algorithms. Furthermore, by design, the new algorithm is insensitive to any mis-modeling of discriminating features in the data by the simulation. Weakly supervisedmore » classification is a general procedure that can be applied to a wide variety of learning problems to boost performance and robustness when detailed simulations are not reliable or not available.« less
Weakly supervised classification in high energy physics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dery, Lucio Mwinmaarong; Nachman, Benjamin; Rubbo, Francesco
As machine learning algorithms become increasingly sophisticated to exploit subtle features of the data, they often become more dependent on simulations. Here, this paper presents a new approach called weakly supervised classification in which class proportions are the only input into the machine learning algorithm. Using one of the most challenging binary classification tasks in high energy physics $-$ quark versus gluon tagging $-$ we show that weakly supervised classification can match the performance of fully supervised algorithms. Furthermore, by design, the new algorithm is insensitive to any mis-modeling of discriminating features in the data by the simulation. Weakly supervisedmore » classification is a general procedure that can be applied to a wide variety of learning problems to boost performance and robustness when detailed simulations are not reliable or not available.« less
(Machine) learning to do more with less
NASA Astrophysics Data System (ADS)
Cohen, Timothy; Freytsis, Marat; Ostdiek, Bryan
2018-02-01
Determining the best method for training a machine learning algorithm is critical to maximizing its ability to classify data. In this paper, we compare the standard "fully supervised" approach (which relies on knowledge of event-by-event truth-level labels) with a recent proposal that instead utilizes class ratios as the only discriminating information provided during training. This so-called "weakly supervised" technique has access to less information than the fully supervised method and yet is still able to yield impressive discriminating power. In addition, weak supervision seems particularly well suited to particle physics since quantum mechanics is incompatible with the notion of mapping an individual event onto any single Feynman diagram. We examine the technique in detail — both analytically and numerically — with a focus on the robustness to issues of mischaracterizing the training samples. Weakly supervised networks turn out to be remarkably insensitive to a class of systematic mismodeling. Furthermore, we demonstrate that the event level outputs for weakly versus fully supervised networks are probing different kinematics, even though the numerical quality metrics are essentially identical. This implies that it should be possible to improve the overall classification ability by combining the output from the two types of networks. For concreteness, we apply this technology to a signature of beyond the Standard Model physics to demonstrate that all these impressive features continue to hold in a scenario of relevance to the LHC. Example code is provided on GitHub.
Supervised Learning Based Hypothesis Generation from Biomedical Literature.
Sang, Shengtian; Yang, Zhihao; Li, Zongyao; Lin, Hongfei
2015-01-01
Nowadays, the amount of biomedical literatures is growing at an explosive speed, and there is much useful knowledge undiscovered in this literature. Researchers can form biomedical hypotheses through mining these works. In this paper, we propose a supervised learning based approach to generate hypotheses from biomedical literature. This approach splits the traditional processing of hypothesis generation with classic ABC model into AB model and BC model which are constructed with supervised learning method. Compared with the concept cooccurrence and grammar engineering-based approaches like SemRep, machine learning based models usually can achieve better performance in information extraction (IE) from texts. Then through combining the two models, the approach reconstructs the ABC model and generates biomedical hypotheses from literature. The experimental results on the three classic Swanson hypotheses show that our approach outperforms SemRep system.
Optimizing area under the ROC curve using semi-supervised learning
Wang, Shijun; Li, Diana; Petrick, Nicholas; Sahiner, Berkman; Linguraru, Marius George; Summers, Ronald M.
2014-01-01
Receiver operating characteristic (ROC) analysis is a standard methodology to evaluate the performance of a binary classification system. The area under the ROC curve (AUC) is a performance metric that summarizes how well a classifier separates two classes. Traditional AUC optimization techniques are supervised learning methods that utilize only labeled data (i.e., the true class is known for all data) to train the classifiers. In this work, inspired by semi-supervised and transductive learning, we propose two new AUC optimization algorithms hereby referred to as semi-supervised learning receiver operating characteristic (SSLROC) algorithms, which utilize unlabeled test samples in classifier training to maximize AUC. Unlabeled samples are incorporated into the AUC optimization process, and their ranking relationships to labeled positive and negative training samples are considered as optimization constraints. The introduced test samples will cause the learned decision boundary in a multidimensional feature space to adapt not only to the distribution of labeled training data, but also to the distribution of unlabeled test data. We formulate the semi-supervised AUC optimization problem as a semi-definite programming problem based on the margin maximization theory. The proposed methods SSLROC1 (1-norm) and SSLROC2 (2-norm) were evaluated using 34 (determined by power analysis) randomly selected datasets from the University of California, Irvine machine learning repository. Wilcoxon signed rank tests showed that the proposed methods achieved significant improvement compared with state-of-the-art methods. The proposed methods were also applied to a CT colonography dataset for colonic polyp classification and showed promising results.1 PMID:25395692
Optimizing area under the ROC curve using semi-supervised learning.
Wang, Shijun; Li, Diana; Petrick, Nicholas; Sahiner, Berkman; Linguraru, Marius George; Summers, Ronald M
2015-01-01
Receiver operating characteristic (ROC) analysis is a standard methodology to evaluate the performance of a binary classification system. The area under the ROC curve (AUC) is a performance metric that summarizes how well a classifier separates two classes. Traditional AUC optimization techniques are supervised learning methods that utilize only labeled data (i.e., the true class is known for all data) to train the classifiers. In this work, inspired by semi-supervised and transductive learning, we propose two new AUC optimization algorithms hereby referred to as semi-supervised learning receiver operating characteristic (SSLROC) algorithms, which utilize unlabeled test samples in classifier training to maximize AUC. Unlabeled samples are incorporated into the AUC optimization process, and their ranking relationships to labeled positive and negative training samples are considered as optimization constraints. The introduced test samples will cause the learned decision boundary in a multidimensional feature space to adapt not only to the distribution of labeled training data, but also to the distribution of unlabeled test data. We formulate the semi-supervised AUC optimization problem as a semi-definite programming problem based on the margin maximization theory. The proposed methods SSLROC1 (1-norm) and SSLROC2 (2-norm) were evaluated using 34 (determined by power analysis) randomly selected datasets from the University of California, Irvine machine learning repository. Wilcoxon signed rank tests showed that the proposed methods achieved significant improvement compared with state-of-the-art methods. The proposed methods were also applied to a CT colonography dataset for colonic polyp classification and showed promising results.
Memarian, Negar; Kim, Sally; Dewar, Sandra; Engel, Jerome; Staba, Richard J
2015-09-01
This study sought to predict postsurgical seizure freedom from pre-operative diagnostic test results and clinical information using a rapid automated approach, based on supervised learning methods in patients with drug-resistant focal seizures suspected to begin in temporal lobe. We applied machine learning, specifically a combination of mutual information-based feature selection and supervised learning classifiers on multimodal data, to predict surgery outcome retrospectively in 20 presurgical patients (13 female; mean age±SD, in years 33±9.7 for females, and 35.3±9.4 for males) who were diagnosed with mesial temporal lobe epilepsy (MTLE) and subsequently underwent standard anteromesial temporal lobectomy. The main advantage of the present work over previous studies is the inclusion of the extent of ipsilateral neocortical gray matter atrophy and spatiotemporal properties of depth electrode-recorded seizures as training features for individual patient surgery planning. A maximum relevance minimum redundancy (mRMR) feature selector identified the following features as the most informative predictors of postsurgical seizure freedom in this study's sample of patients: family history of epilepsy, ictal EEG onset pattern (positive correlation with seizure freedom), MRI-based gray matter thickness reduction in the hemisphere ipsilateral to seizure onset, proportion of seizures that first appeared in ipsilateral amygdala to total seizures, age, epilepsy duration, delay in the spread of ipsilateral ictal discharges from site of onset, gender, and number of electrode contacts at seizure onset (negative correlation with seizure freedom). Using these features in combination with a least square support vector machine (LS-SVM) classifier compared to other commonly used classifiers resulted in very high surgical outcome prediction accuracy (95%). Supervised machine learning using multimodal compared to unimodal data accurately predicted postsurgical outcome in patients with atypical MTLE. Published by Elsevier Ltd.
Learning About Climate and Atmospheric Models Through Machine Learning
NASA Astrophysics Data System (ADS)
Lucas, D. D.
2017-12-01
From the analysis of ensemble variability to improving simulation performance, machine learning algorithms can play a powerful role in understanding the behavior of atmospheric and climate models. To learn about model behavior, we create training and testing data sets through ensemble techniques that sample different model configurations and values of input parameters, and then use supervised machine learning to map the relationships between the inputs and outputs. Following this procedure, we have used support vector machines, random forests, gradient boosting and other methods to investigate a variety of atmospheric and climate model phenomena. We have used machine learning to predict simulation crashes, estimate the probability density function of climate sensitivity, optimize simulations of the Madden Julian oscillation, assess the impacts of weather and emissions uncertainty on atmospheric dispersion, and quantify the effects of model resolution changes on precipitation. This presentation highlights recent examples of our applications of machine learning to improve the understanding of climate and atmospheric models. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Joint learning of labels and distance metric.
Liu, Bo; Wang, Meng; Hong, Richang; Zha, Zhengjun; Hua, Xian-Sheng
2010-06-01
Machine learning algorithms frequently suffer from the insufficiency of training data and the usage of inappropriate distance metric. In this paper, we propose a joint learning of labels and distance metric (JLLDM) approach, which is able to simultaneously address the two difficulties. In comparison with the existing semi-supervised learning and distance metric learning methods that focus only on label prediction or distance metric construction, the JLLDM algorithm optimizes the labels of unlabeled samples and a Mahalanobis distance metric in a unified scheme. The advantage of JLLDM is multifold: 1) the problem of training data insufficiency can be tackled; 2) a good distance metric can be constructed with only very few training samples; and 3) no radius parameter is needed since the algorithm automatically determines the scale of the metric. Extensive experiments are conducted to compare the JLLDM approach with different semi-supervised learning and distance metric learning methods, and empirical results demonstrate its effectiveness.
Shahriyari, Leili
2017-11-03
One of the main challenges in machine learning (ML) is choosing an appropriate normalization method. Here, we examine the effect of various normalization methods on analyzing FPKM upper quartile (FPKM-UQ) RNA sequencing data sets. We collect the HTSeq-FPKM-UQ files of patients with colon adenocarcinoma from TCGA-COAD project. We compare three most common normalization methods: scaling, standardizing using z-score and vector normalization by visualizing the normalized data set and evaluating the performance of 12 supervised learning algorithms on the normalized data set. Additionally, for each of these normalization methods, we use two different normalization strategies: normalizing samples (files) or normalizing features (genes). Regardless of normalization methods, a support vector machine (SVM) model with the radial basis function kernel had the maximum accuracy (78%) in predicting the vital status of the patients. However, the fitting time of SVM depended on the normalization methods, and it reached its minimum fitting time when files were normalized to the unit length. Furthermore, among all 12 learning algorithms and 6 different normalization techniques, the Bernoulli naive Bayes model after standardizing files had the best performance in terms of maximizing the accuracy as well as minimizing the fitting time. We also investigated the effect of dimensionality reduction methods on the performance of the supervised ML algorithms. Reducing the dimension of the data set did not increase the maximum accuracy of 78%. However, it leaded to discovery of the 7SK RNA gene expression as a predictor of survival in patients with colon adenocarcinoma with accuracy of 78%. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
NASA Astrophysics Data System (ADS)
Ruske, S. T.; Topping, D. O.; Foot, V. E.; Kaye, P. H.; Stanley, W. R.; Morse, A. P.; Crawford, I.; Gallagher, M. W.
2016-12-01
Characterisation of bio-aerosols has important implications within Environment and Public Health sectors. Recent developments in Ultra-Violet Light Induced Fluorescence (UV-LIF) detectors such as the Wideband Integrated bio-aerosol Spectrometer (WIBS) and the newly introduced Multiparameter bio-aerosol Spectrometer (MBS) has allowed for the real time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal Spores and pollen. This new generation of instruments has enabled ever-larger data sets to be compiled with the aim of studying more complex environments, yet the algorithms used for specie classification remain largely invalidated. It is therefore imperative that we validate the performance of different algorithms that can be used for the task of classification, which is the focus of this study. For unsupervised learning we test Hierarchical Agglomerative Clustering with various different linkages. For supervised learning, ten methods were tested; including decision trees, ensemble methods: Random Forests, Gradient Boosting and AdaBoost; two implementations for support vector machines: libsvm and liblinear; Gaussian methods: Gaussian naïve Bayesian, quadratic and linear discriminant analysis and finally the k-nearest neighbours algorithm. The methods were applied to two different data sets measured using a new Multiparameter bio-aerosol Spectrometer. We find that clustering, in general, performs slightly worse than the supervised learning methods correctly classifying, at best, only 72.7 and 91.1 percent for the two data sets. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 88.1 and 97.8 percent of the testing data respectively across the two data sets. We discuss the wider relevance of these results with regards to challenging existing classification in real-world environments.
Intelligible machine learning with malibu.
Langlois, Robert E; Lu, Hui
2008-01-01
malibu is an open-source machine learning work-bench developed in C/C++ for high-performance real-world applications, namely bioinformatics and medical informatics. It leverages third-party machine learning implementations for more robust bug-free software. This workbench handles several well-studied supervised machine learning problems including classification, regression, importance-weighted classification and multiple-instance learning. The malibu interface was designed to create reproducible experiments ideally run in a remote and/or command line environment. The software can be found at: http://proteomics.bioengr. uic.edu/malibu/index.html.
NASA Astrophysics Data System (ADS)
Cruz-Roa, Angel; Arevalo, John; Basavanhally, Ajay; Madabhushi, Anant; González, Fabio
2015-01-01
Learning data representations directly from the data itself is an approach that has shown great success in different pattern recognition problems, outperforming state-of-the-art feature extraction schemes for different tasks in computer vision, speech recognition and natural language processing. Representation learning applies unsupervised and supervised machine learning methods to large amounts of data to find building-blocks that better represent the information in it. Digitized histopathology images represents a very good testbed for representation learning since it involves large amounts of high complex, visual data. This paper presents a comparative evaluation of different supervised and unsupervised representation learning architectures to specifically address open questions on what type of learning architectures (deep or shallow), type of learning (unsupervised or supervised) is optimal. In this paper we limit ourselves to addressing these questions in the context of distinguishing between anaplastic and non-anaplastic medulloblastomas from routine haematoxylin and eosin stained images. The unsupervised approaches evaluated were sparse autoencoders and topographic reconstruct independent component analysis, and the supervised approach was convolutional neural networks. Experimental results show that shallow architectures with more neurons are better than deeper architectures without taking into account local space invariances and that topographic constraints provide useful invariant features in scale and rotations for efficient tumor differentiation.
Contemporary machine learning: techniques for practitioners in the physical sciences
NASA Astrophysics Data System (ADS)
Spears, Brian
2017-10-01
Machine learning is the science of using computers to find relationships in data without explicitly knowing or programming those relationships in advance. Often without realizing it, we employ machine learning every day as we use our phones or drive our cars. Over the last few years, machine learning has found increasingly broad application in the physical sciences. This most often involves building a model relationship between a dependent, measurable output and an associated set of controllable, but complicated, independent inputs. The methods are applicable both to experimental observations and to databases of simulated output from large, detailed numerical simulations. In this tutorial, we will present an overview of current tools and techniques in machine learning - a jumping-off point for researchers interested in using machine learning to advance their work. We will discuss supervised learning techniques for modeling complicated functions, beginning with familiar regression schemes, then advancing to more sophisticated decision trees, modern neural networks, and deep learning methods. Next, we will cover unsupervised learning and techniques for reducing the dimensionality of input spaces and for clustering data. We'll show example applications from both magnetic and inertial confinement fusion. Along the way, we will describe methods for practitioners to help ensure that their models generalize from their training data to as-yet-unseen test data. We will finally point out some limitations to modern machine learning and speculate on some ways that practitioners from the physical sciences may be particularly suited to help. This work was performed by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
de Ávila, Maurício Boff; Xavier, Mariana Morrone; Pintro, Val Oliveira; de Azevedo, Walter Filgueira
2017-12-09
Here we report the development of a machine-learning model to predict binding affinity based on the crystallographic structures of protein-ligand complexes. We used an ensemble of crystallographic structures (resolution better than 1.5 Å resolution) for which half-maximal inhibitory concentration (IC 50 ) data is available. Polynomial scoring functions were built using as explanatory variables the energy terms present in the MolDock and PLANTS scoring functions. Prediction performance was tested and the supervised machine learning models showed improvement in the prediction power, when compared with PLANTS and MolDock scoring functions. In addition, the machine-learning model was applied to predict binding affinity of CDK2, which showed a better performance when compared with AutoDock4, AutoDock Vina, MolDock, and PLANTS scores. Copyright © 2017 Elsevier Inc. All rights reserved.
An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.
Liu, Yuzhe; Gopalakrishnan, Vanathi
2017-03-01
Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.
NASA Astrophysics Data System (ADS)
Rai, A.; Minsker, B. S.
2016-12-01
In this work we introduce a novel dataset GRID: GReen Infrastructure Detection Dataset and a framework for identifying urban green storm water infrastructure (GI) designs (wetlands/ponds, urban trees, and rain gardens/bioswales) from social media and satellite aerial images using computer vision and machine learning methods. Along with the hydrologic benefits of GI, such as reducing runoff volumes and urban heat islands, GI also provides important socio-economic benefits such as stress recovery and community cohesion. However, GI is installed by many different parties and cities typically do not know where GI is located, making study of its impacts or siting new GI difficult. We use object recognition learning methods (template matching, sliding window approach, and Random Hough Forest method) and supervised machine learning algorithms (e.g., support vector machines) as initial screening approaches to detect potential GI sites, which can then be investigated in more detail using on-site surveys. Training data were collected from GPS locations of Flickr and Instagram image postings and Amazon Mechanical Turk identification of each GI type. Sliding window method outperformed other methods and achieved an average F measure, which is combined metric for precision and recall performance measure of 0.78.
Porosity estimation by semi-supervised learning with sparsely available labeled samples
NASA Astrophysics Data System (ADS)
Lima, Luiz Alberto; Görnitz, Nico; Varella, Luiz Eduardo; Vellasco, Marley; Müller, Klaus-Robert; Nakajima, Shinichi
2017-09-01
This paper addresses the porosity estimation problem from seismic impedance volumes and porosity samples located in a small group of exploratory wells. Regression methods, trained on the impedance as inputs and the porosity as output labels, generally suffer from extremely expensive (and hence sparsely available) porosity samples. To optimally make use of the valuable porosity data, a semi-supervised machine learning method was proposed, Transductive Conditional Random Field Regression (TCRFR), showing good performance (Görnitz et al., 2017). TCRFR, however, still requires more labeled data than those usually available, which creates a gap when applying the method to the porosity estimation problem in realistic situations. In this paper, we aim to fill this gap by introducing two graph-based preprocessing techniques, which adapt the original TCRFR for extremely weakly supervised scenarios. Our new method outperforms the previous automatic estimation methods on synthetic data and provides a comparable result to the manual labored, time-consuming geostatistics approach on real data, proving its potential as a practical industrial tool.
Artificial Intelligence in Cardiology.
Johnson, Kipp W; Torres Soto, Jessica; Glicksberg, Benjamin S; Shameer, Khader; Miotto, Riccardo; Ali, Mohsin; Ashley, Euan; Dudley, Joel T
2018-06-12
Artificial intelligence and machine learning are poised to influence nearly every aspect of the human condition, and cardiology is not an exception to this trend. This paper provides a guide for clinicians on relevant aspects of artificial intelligence and machine learning, reviews selected applications of these methods in cardiology to date, and identifies how cardiovascular medicine could incorporate artificial intelligence in the future. In particular, the paper first reviews predictive modeling concepts relevant to cardiology such as feature selection and frequent pitfalls such as improper dichotomization. Second, it discusses common algorithms used in supervised learning and reviews selected applications in cardiology and related disciplines. Third, it describes the advent of deep learning and related methods collectively called unsupervised learning, provides contextual examples both in general medicine and in cardiovascular medicine, and then explains how these methods could be applied to enable precision cardiology and improve patient outcomes. Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.
Timoshenko, Janis; Lu, Deyu; Lin, Yuewei; ...
2017-09-29
Tracking the structure of heterogeneous catalysts under operando conditions remains a challenge due to the paucity of experimental techniques that can provide atomic-level information for catalytic metal species. Here we report on the use of X-ray absorption near edge structure (XANES) spectroscopy and supervised machine learning (SML) for refining the three-dimensional geometry of metal catalysts. SML is used to unravel the hidden relationship between the XANES features and catalyst geometry. To train our SML method, we rely on ab-initio XANES simulations. Our approach allows one to solve the structure of a metal catalyst from its experimental XANES, as demonstrated heremore » by reconstructing the average size, shape and morphology of well-defined platinum nanoparticles. This method is applicable to the determination of the nanoparticle structure in operando studies and can be generalized to other nanoscale systems. In conclusion, it also allows on-the-fly XANES analysis, and is a promising approach for high-throughput and time-dependent studies.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Timoshenko, Janis; Lu, Deyu; Lin, Yuewei
Tracking the structure of heterogeneous catalysts under operando conditions remains a challenge due to the paucity of experimental techniques that can provide atomic-level information for catalytic metal species. Here we report on the use of X-ray absorption near edge structure (XANES) spectroscopy and supervised machine learning (SML) for refining the three-dimensional geometry of metal catalysts. SML is used to unravel the hidden relationship between the XANES features and catalyst geometry. To train our SML method, we rely on ab-initio XANES simulations. Our approach allows one to solve the structure of a metal catalyst from its experimental XANES, as demonstrated heremore » by reconstructing the average size, shape and morphology of well-defined platinum nanoparticles. This method is applicable to the determination of the nanoparticle structure in operando studies and can be generalized to other nanoscale systems. In conclusion, it also allows on-the-fly XANES analysis, and is a promising approach for high-throughput and time-dependent studies.« less
Dual Low-Rank Pursuit: Learning Salient Features for Saliency Detection.
Lang, Congyan; Feng, Jiashi; Feng, Songhe; Wang, Jingdong; Yan, Shuicheng
2016-06-01
Saliency detection is an important procedure for machines to understand visual world as humans do. In this paper, we consider a specific saliency detection problem of predicting human eye fixations when they freely view natural images, and propose a novel dual low-rank pursuit (DLRP) method. DLRP learns saliency-aware feature transformations by utilizing available supervision information and constructs discriminative bases for effectively detecting human fixation points under the popular low-rank and sparsity-pursuit framework. Benefiting from the embedded high-level information in the supervised learning process, DLRP is able to predict fixations accurately without performing the expensive object segmentation as in the previous works. Comprehensive experiments clearly show the superiority of the proposed DLRP method over the established state-of-the-art methods. We also empirically demonstrate that DLRP provides stronger generalization performance across different data sets and inherits the advantages of both the bottom-up- and top-down-based saliency detection methods.
Ellis, Katherine; Godbole, Suneeta; Marshall, Simon; Lanckriet, Gert; Staudenmayer, John; Kerr, Jacqueline
2014-01-01
Active travel is an important area in physical activity research, but objective measurement of active travel is still difficult. Automated methods to measure travel behaviors will improve research in this area. In this paper, we present a supervised machine learning method for transportation mode prediction from global positioning system (GPS) and accelerometer data. We collected a dataset of about 150 h of GPS and accelerometer data from two research assistants following a protocol of prescribed trips consisting of five activities: bicycling, riding in a vehicle, walking, sitting, and standing. We extracted 49 features from 1-min windows of this data. We compared the performance of several machine learning algorithms and chose a random forest algorithm to classify the transportation mode. We used a moving average output filter to smooth the output predictions over time. The random forest algorithm achieved 89.8% cross-validated accuracy on this dataset. Adding the moving average filter to smooth output predictions increased the cross-validated accuracy to 91.9%. Machine learning methods are a viable approach for automating measurement of active travel, particularly for measuring travel activities that traditional accelerometer data processing methods misclassify, such as bicycling and vehicle travel.
NASA Astrophysics Data System (ADS)
Bonfanti, C. E.; Stewart, J.; Lee, Y. J.; Govett, M.; Trailovic, L.; Etherton, B.
2017-12-01
One of the National Oceanic and Atmospheric Administration (NOAA) goals is to provide timely and reliable weather forecasts to support important decisions when and where people need it for safety, emergencies, planning for day-to-day activities. Satellite data is essential for areas lacking in-situ observations for use as initial conditions in Numerical Weather Prediction (NWP) Models, such as spans of the ocean or remote areas of land. Currently only about 7% of total received satellite data is selected for use and from that, an even smaller percentage ever are assimilated into NWP models. With machine learning, the computational and time costs needed for satellite data selection can be greatly reduced. We study various machine learning approaches to process orders of magnitude more satellite data in significantly less time allowing for a greater quantity and more intelligent selection of data to be used for assimilation purposes. Given the future launches of satellites in the upcoming years, machine learning is capable of being applied for better selection of Regions of Interest (ROI) in the magnitudes more of satellite data that will be received. This paper discusses the background of machine learning methods as applied to weather forecasting and the challenges of creating a "labeled dataset" for training and testing purposes. In the training stage of supervised machine learning, labeled data are important to identify a ROI as either true or false so that the model knows what signatures in satellite data to identify. Authors have selected cyclones, including tropical cyclones and mid-latitude lows, as ROI for their machine learning purposes and created a labeled dataset of true or false for ROI from Global Forecast System (GFS) reanalysis data. A dataset like this does not yet exist and given the need for a high quantity of samples, is was decided this was best done with automation. This process was done by developing a program similar to the National Center for Environmental Prediction (NCEP) tropical cyclone tracker by Marchok that was used to identify cyclones based off its physical characteristics. We will discuss the methods and challenges to creating this dataset and the dataset's use for our current supervised machine learning model as well as use for future work on events such as convection initiation.
NASA Astrophysics Data System (ADS)
Pickman, Yishai; Dunn-Walters, Deborah; Mehr, Ramit
2013-10-01
Complementarity-determining region 3 (CDR3) is the most hyper-variable region in B cell receptor (BCR) and T cell receptor (TCR) genes, and the most critical structure in antigen recognition and thereby in determining the fates of developing and responding lymphocytes. There are millions of different TCR Vβ chain or BCR heavy chain CDR3 sequences in human blood. Even now, when high-throughput sequencing becomes widely used, CDR3 length distributions (also called spectratypes) are still a much quicker and cheaper method of assessing repertoire diversity. However, distribution complexity and the large amount of information per sample (e.g. 32 distributions of the TCRα chain, and 24 of TCRβ) calls for the use of machine learning tools for full exploration. We have examined the ability of supervised machine learning, which uses computational models to find hidden patterns in predefined biological groups, to analyze CDR3 length distributions from various sources, and distinguish between experimental groups. We found that (a) splenic BCR CDR3 length distributions are characterized by low standard deviations and few local maxima, compared to peripheral blood distributions; (b) healthy elderly people's BCR CDR3 length distributions can be distinguished from those of the young; and (c) a machine learning model based on TCR CDR3 distribution features can detect myelodysplastic syndrome with approximately 93% accuracy. Overall, we demonstrate that using supervised machine learning methods can contribute to our understanding of lymphocyte repertoire diversity.
Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.
Chen, Qingyu; Zobel, Justin; Zhang, Xiuzhen; Verspoor, Karin
2016-01-01
First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.
Lynch, Chip M; Abdollahi, Behnaz; Fuqua, Joshua D; de Carlo, Alexandra R; Bartholomai, James A; Balgemann, Rayeanne N; van Berkel, Victor H; Frieboes, Hermann B
2017-12-01
Outcomes for cancer patients have been previously estimated by applying various machine learning techniques to large datasets such as the Surveillance, Epidemiology, and End Results (SEER) program database. In particular for lung cancer, it is not well understood which types of techniques would yield more predictive information, and which data attributes should be used in order to determine this information. In this study, a number of supervised learning techniques is applied to the SEER database to classify lung cancer patients in terms of survival, including linear regression, Decision Trees, Gradient Boosting Machines (GBM), Support Vector Machines (SVM), and a custom ensemble. Key data attributes in applying these methods include tumor grade, tumor size, gender, age, stage, and number of primaries, with the goal to enable comparison of predictive power between the various methods The prediction is treated like a continuous target, rather than a classification into categories, as a first step towards improving survival prediction. The results show that the predicted values agree with actual values for low to moderate survival times, which constitute the majority of the data. The best performing technique was the custom ensemble with a Root Mean Square Error (RMSE) value of 15.05. The most influential model within the custom ensemble was GBM, while Decision Trees may be inapplicable as it had too few discrete outputs. The results further show that among the five individual models generated, the most accurate was GBM with an RMSE value of 15.32. Although SVM underperformed with an RMSE value of 15.82, statistical analysis singles the SVM as the only model that generated a distinctive output. The results of the models are consistent with a classical Cox proportional hazards model used as a reference technique. We conclude that application of these supervised learning techniques to lung cancer data in the SEER database may be of use to estimate patient survival time with the ultimate goal to inform patient care decisions, and that the performance of these techniques with this particular dataset may be on par with that of classical methods. Copyright © 2017 Elsevier B.V. All rights reserved.
Memarian, Negar; Torre, Jared B.; Haltom, Kate E.; Stanton, Annette L.
2017-01-01
Abstract Affect labeling (putting feelings into words) is a form of incidental emotion regulation that could underpin some benefits of expressive writing (i.e. writing about negative experiences). Here, we show that neural responses during affect labeling predicted changes in psychological and physical well-being outcome measures 3 months later. Furthermore, neural activity of specific frontal regions and amygdala predicted those outcomes as a function of expressive writing. Using supervised learning (support vector machines regression), improvements in four measures of psychological and physical health (physical symptoms, depression, anxiety and life satisfaction) after an expressive writing intervention were predicted with an average of 0.85% prediction error [root mean square error (RMSE) %]. The predictions were significantly more accurate with machine learning than with the conventional generalized linear model method (average RMSE: 1.3%). Consistent with affect labeling research, right ventrolateral prefrontal cortex (RVLPFC) and amygdalae were top predictors of improvement in the four outcomes. Moreover, RVLPFC and left amygdala predicted benefits due to expressive writing in satisfaction with life and depression outcome measures, respectively. This study demonstrates the substantial merit of supervised machine learning for real-world outcome prediction in social and affective neuroscience. PMID:28992270
An introduction to kernel-based learning algorithms.
Müller, K R; Mika, S; Rätsch, G; Tsuda, K; Schölkopf, B
2001-01-01
This paper provides an introduction to support vector machines, kernel Fisher discriminant analysis, and kernel principal component analysis, as examples for successful kernel-based learning methods. We first give a short background about Vapnik-Chervonenkis theory and kernel feature spaces and then proceed to kernel based learning in supervised and unsupervised scenarios including practical and algorithmic considerations. We illustrate the usefulness of kernel algorithms by discussing applications such as optical character recognition and DNA analysis.
SemiBoost: boosting for semi-supervised learning.
Mallapragada, Pavan Kumar; Jin, Rong; Jain, Anil K; Liu, Yi
2009-11-01
Semi-supervised learning has attracted a significant amount of attention in pattern recognition and machine learning. Most previous studies have focused on designing special algorithms to effectively exploit the unlabeled data in conjunction with labeled data. Our goal is to improve the classification accuracy of any given supervised learning algorithm by using the available unlabeled examples. We call this as the Semi-supervised improvement problem, to distinguish the proposed approach from the existing approaches. We design a metasemi-supervised learning algorithm that wraps around the underlying supervised algorithm and improves its performance using unlabeled data. This problem is particularly important when we need to train a supervised learning algorithm with a limited number of labeled examples and a multitude of unlabeled examples. We present a boosting framework for semi-supervised learning, termed as SemiBoost. The key advantages of the proposed semi-supervised learning approach are: 1) performance improvement of any supervised learning algorithm with a multitude of unlabeled data, 2) efficient computation by the iterative boosting algorithm, and 3) exploiting both manifold and cluster assumption in training classification models. An empirical study on 16 different data sets and text categorization demonstrates that the proposed framework improves the performance of several commonly used supervised learning algorithms, given a large number of unlabeled examples. We also show that the performance of the proposed algorithm, SemiBoost, is comparable to the state-of-the-art semi-supervised learning algorithms.
Extracting laboratory test information from biomedical text
Kang, Yanna Shen; Kayaalp, Mehmet
2013-01-01
Background: No previous study reported the efficacy of current natural language processing (NLP) methods for extracting laboratory test information from narrative documents. This study investigates the pathology informatics question of how accurately such information can be extracted from text with the current tools and techniques, especially machine learning and symbolic NLP methods. The study data came from a text corpus maintained by the U.S. Food and Drug Administration, containing a rich set of information on laboratory tests and test devices. Methods: The authors developed a symbolic information extraction (SIE) system to extract device and test specific information about four types of laboratory test entities: Specimens, analytes, units of measures and detection limits. They compared the performance of SIE and three prominent machine learning based NLP systems, LingPipe, GATE and BANNER, each implementing a distinct supervised machine learning method, hidden Markov models, support vector machines and conditional random fields, respectively. Results: Machine learning systems recognized laboratory test entities with moderately high recall, but low precision rates. Their recall rates were relatively higher when the number of distinct entity values (e.g., the spectrum of specimens) was very limited or when lexical morphology of the entity was distinctive (as in units of measures), yet SIE outperformed them with statistically significant margins on extracting specimen, analyte and detection limit information in both precision and F-measure. Its high recall performance was statistically significant on analyte information extraction. Conclusions: Despite its shortcomings against machine learning methods, a well-tailored symbolic system may better discern relevancy among a pile of information of the same type and may outperform a machine learning system by tapping into lexically non-local contextual information such as the document structure. PMID:24083058
NASA Astrophysics Data System (ADS)
Uezu, Tatsuya; Kiyokawa, Shuji
2016-06-01
We investigate the supervised batch learning of Boolean functions expressed by a two-layer perceptron with a tree-like structure. We adopt continuous weights (spherical model) and the Gibbs algorithm. We study the Parity and And machines and two types of noise, input and output noise, together with the noiseless case. We assume that only the teacher suffers from noise. By using the replica method, we derive the saddle point equations for order parameters under the replica symmetric (RS) ansatz. We study the critical value αC of the loading rate α above which the learning phase exists for cases with and without noise. We find that αC is nonzero for the Parity machine, while it is zero for the And machine. We derive the exponents barβ of order parameters expressed as (α - α C)bar{β} when α is near to αC. Furthermore, in the Parity machine, when noise exists, we find a spin glass solution, in which the overlap between the teacher and student vectors is zero but that between student vectors is nonzero. We perform Markov chain Monte Carlo simulations by simulated annealing and also by exchange Monte Carlo simulations in both machines. In the Parity machine, we study the de Almeida-Thouless stability, and by comparing theoretical and numerical results, we find that there exist parameter regions where the RS solution is unstable, and that the spin glass solution is metastable or unstable. We also study asymptotic learning behavior for large α and derive the exponents hat{β } of order parameters expressed as α - hat{β } when α is large in both machines. By simulated annealing simulations, we confirm these results and conclude that learning takes place for the input noise case with any noise amplitude and for the output noise case when the probability that the teacher's output is reversed is less than one-half.
Jiang, Yizhang; Wu, Dongrui; Deng, Zhaohong; Qian, Pengjiang; Wang, Jun; Wang, Guanjin; Chung, Fu-Lai; Choi, Kup-Sze; Wang, Shitong
2017-12-01
Recognition of epileptic seizures from offline EEG signals is very important in clinical diagnosis of epilepsy. Compared with manual labeling of EEG signals by doctors, machine learning approaches can be faster and more consistent. However, the classification accuracy is usually not satisfactory for two main reasons: the distributions of the data used for training and testing may be different, and the amount of training data may not be enough. In addition, most machine learning approaches generate black-box models that are difficult to interpret. In this paper, we integrate transductive transfer learning, semi-supervised learning and TSK fuzzy system to tackle these three problems. More specifically, we use transfer learning to reduce the discrepancy in data distribution between the training and testing data, employ semi-supervised learning to use the unlabeled testing data to remedy the shortage of training data, and adopt TSK fuzzy system to increase model interpretability. Two learning algorithms are proposed to train the system. Our experimental results show that the proposed approaches can achieve better performance than many state-of-the-art seizure classification algorithms.
Implementing Machine Learning in Radiology Practice and Research.
Kohli, Marc; Prevedello, Luciano M; Filice, Ross W; Geis, J Raymond
2017-04-01
The purposes of this article are to describe concepts that radiologists should understand to evaluate machine learning projects, including common algorithms, supervised as opposed to unsupervised techniques, statistical pitfalls, and data considerations for training and evaluation, and to briefly describe ethical dilemmas and legal risk. Machine learning includes a broad class of computer programs that improve with experience. The complexity of creating, training, and monitoring machine learning indicates that the success of the algorithms will require radiologist involvement for years to come, leading to engagement rather than replacement.
Testing and Validating Machine Learning Classifiers by Metamorphic Testing☆
Xie, Xiaoyuan; Ho, Joshua W. K.; Murphy, Christian; Kaiser, Gail; Xu, Baowen; Chen, Tsong Yueh
2011-01-01
Machine Learning algorithms have provided core functionality to many application domains - such as bioinformatics, computational linguistics, etc. However, it is difficult to detect faults in such applications because often there is no “test oracle” to verify the correctness of the computed outputs. To help address the software quality, in this paper we present a technique for testing the implementations of machine learning classification algorithms which support such applications. Our approach is based on the technique “metamorphic testing”, which has been shown to be effective to alleviate the oracle problem. Also presented include a case study on a real-world machine learning application framework, and a discussion of how programmers implementing machine learning algorithms can avoid the common pitfalls discovered in our study. We also conduct mutation analysis and cross-validation, which reveal that our method has high effectiveness in killing mutants, and that observing expected cross-validation result alone is not sufficiently effective to detect faults in a supervised classification program. The effectiveness of metamorphic testing is further confirmed by the detection of real faults in a popular open-source classification program. PMID:21532969
Ellis, Katherine; Godbole, Suneeta; Marshall, Simon; Lanckriet, Gert; Staudenmayer, John; Kerr, Jacqueline
2014-01-01
Background: Active travel is an important area in physical activity research, but objective measurement of active travel is still difficult. Automated methods to measure travel behaviors will improve research in this area. In this paper, we present a supervised machine learning method for transportation mode prediction from global positioning system (GPS) and accelerometer data. Methods: We collected a dataset of about 150 h of GPS and accelerometer data from two research assistants following a protocol of prescribed trips consisting of five activities: bicycling, riding in a vehicle, walking, sitting, and standing. We extracted 49 features from 1-min windows of this data. We compared the performance of several machine learning algorithms and chose a random forest algorithm to classify the transportation mode. We used a moving average output filter to smooth the output predictions over time. Results: The random forest algorithm achieved 89.8% cross-validated accuracy on this dataset. Adding the moving average filter to smooth output predictions increased the cross-validated accuracy to 91.9%. Conclusion: Machine learning methods are a viable approach for automating measurement of active travel, particularly for measuring travel activities that traditional accelerometer data processing methods misclassify, such as bicycling and vehicle travel. PMID:24795875
Employing Machine-Learning Methods to Study Young Stellar Objects
NASA Astrophysics Data System (ADS)
Moore, Nicholas
2018-01-01
Vast amounts of data exist in the astronomical data archives, and yet a large number of sources remain unclassified. We developed a multi-wavelength pipeline to classify infrared sources. The pipeline uses supervised machine learning methods to classify objects into the appropriate categories. The program is fed data that is already classified to train it, and is then applied to unknown catalogues. The primary use for such a pipeline is the rapid classification and cataloging of data that would take a much longer time to classify otherwise. While our primary goal is to study young stellar objects (YSOs), the applications extend beyond the scope of this project. We present preliminary results from our analysis and discuss future applications.
Object recognition through a multi-mode fiber
NASA Astrophysics Data System (ADS)
Takagi, Ryosuke; Horisaki, Ryoichi; Tanida, Jun
2017-04-01
We present a method of recognizing an object through a multi-mode fiber. A number of speckle patterns transmitted through a multi-mode fiber are provided to a classifier based on machine learning. We experimentally demonstrated binary classification of face and non-face targets based on the method. The measurement process of the experimental setup was random and nonlinear because a multi-mode fiber is a typical strongly scattering medium and any reference light was not used in our setup. Comparisons between three supervised learning methods, support vector machine, adaptive boosting, and neural network, are also provided. All of those learning methods achieved high accuracy rates at about 90% for the classification. The approach presented here can realize a compact and smart optical sensor. It is practically useful for medical applications, such as endoscopy. Also our study indicated a promising utilization of artificial intelligence, which has rapidly progressed, for reducing optical and computational costs in optical sensing systems.
Amis, Gregory P; Carpenter, Gail A
2010-03-01
Computational models of learning typically train on labeled input patterns (supervised learning), unlabeled input patterns (unsupervised learning), or a combination of the two (semi-supervised learning). In each case input patterns have a fixed number of features throughout training and testing. Human and machine learning contexts present additional opportunities for expanding incomplete knowledge from formal training, via self-directed learning that incorporates features not previously experienced. This article defines a new self-supervised learning paradigm to address these richer learning contexts, introducing a neural network called self-supervised ARTMAP. Self-supervised learning integrates knowledge from a teacher (labeled patterns with some features), knowledge from the environment (unlabeled patterns with more features), and knowledge from internal model activation (self-labeled patterns). Self-supervised ARTMAP learns about novel features from unlabeled patterns without destroying partial knowledge previously acquired from labeled patterns. A category selection function bases system predictions on known features, and distributed network activation scales unlabeled learning to prediction confidence. Slow distributed learning on unlabeled patterns focuses on novel features and confident predictions, defining classification boundaries that were ambiguous in the labeled patterns. Self-supervised ARTMAP improves test accuracy on illustrative low-dimensional problems and on high-dimensional benchmarks. Model code and benchmark data are available from: http://techlab.eu.edu/SSART/. Copyright 2009 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Re, Matteo; Valentini, Giorgio
2012-03-01
Ensemble methods are statistical and computational learning procedures reminiscent of the human social learning behavior of seeking several opinions before making any crucial decision. The idea of combining the opinions of different "experts" to obtain an overall “ensemble” decision is rooted in our culture at least from the classical age of ancient Greece, and it has been formalized during the Enlightenment with the Condorcet Jury Theorem[45]), which proved that the judgment of a committee is superior to those of individuals, provided the individuals have reasonable competence. Ensembles are sets of learning machines that combine in some way their decisions, or their learning algorithms, or different views of data, or other specific characteristics to obtain more reliable and more accurate predictions in supervised and unsupervised learning problems [48,116]. A simple example is represented by the majority vote ensemble, by which the decisions of different learning machines are combined, and the class that receives the majority of “votes” (i.e., the class predicted by the majority of the learning machines) is the class predicted by the overall ensemble [158]. In the literature, a plethora of terms other than ensembles has been used, such as fusion, combination, aggregation, and committee, to indicate sets of learning machines that work together to solve a machine learning problem [19,40,56,66,99,108,123], but in this chapter we maintain the term ensemble in its widest meaning, in order to include the whole range of combination methods. Nowadays, ensemble methods represent one of the main current research lines in machine learning [48,116], and the interest of the research community on ensemble methods is witnessed by conferences and workshops specifically devoted to ensembles, first of all the multiple classifier systems (MCS) conference organized by Roli, Kittler, Windeatt, and other researchers of this area [14,62,85,149,173]. Several theories have been proposed to explain the characteristics and the successful application of ensembles to different application domains. For instance, Allwein, Schapire, and Singer interpreted the improved generalization capabilities of ensembles of learning machines in the framework of large margin classifiers [4,177], Kleinberg in the context of stochastic discrimination theory [112], and Breiman and Friedman in the light of the bias-variance analysis borrowed from classical statistics [21,70]. Empirical studies showed that both in classification and regression problems, ensembles improve on single learning machines, and moreover large experimental studies compared the effectiveness of different ensemble methods on benchmark data sets [10,11,49,188]. The interest in this research area is motivated also by the availability of very fast computers and networks of workstations at a relatively low cost that allow the implementation and the experimentation of complex ensemble methods using off-the-shelf computer platforms. However, as explained in Section 26.2 there are deeper reasons to use ensembles of learning machines, motivated by the intrinsic characteristics of the ensemble methods. The main aim of this chapter is to introduce ensemble methods and to provide an overview and a bibliography of the main areas of research, without pretending to be exhaustive or to explain the detailed characteristics of each ensemble method. The paper is organized as follows. In the next section, the main theoretical and practical reasons for combining multiple learners are introduced. Section 26.3 depicts the main taxonomies on ensemble methods proposed in the literature. In Section 26.4 and 26.5, we present an overview of the main supervised ensemble methods reported in the literature, adopting a simple taxonomy, originally proposed in Ref. [201]. Applications of ensemble methods are only marginally considered, but a specific section on some relevant applications of ensemble methods in astronomy and astrophysics has been added (Section 26.6). The conclusion (Section 26.7) ends this paper and lists some issues not covered in this work.
Machine learning for outcome prediction of acute ischemic stroke post intra-arterial therapy.
Asadi, Hamed; Dowling, Richard; Yan, Bernard; Mitchell, Peter
2014-01-01
Stroke is a major cause of death and disability. Accurately predicting stroke outcome from a set of predictive variables may identify high-risk patients and guide treatment approaches, leading to decreased morbidity. Logistic regression models allow for the identification and validation of predictive variables. However, advanced machine learning algorithms offer an alternative, in particular, for large-scale multi-institutional data, with the advantage of easily incorporating newly available data to improve prediction performance. Our aim was to design and compare different machine learning methods, capable of predicting the outcome of endovascular intervention in acute anterior circulation ischaemic stroke. We conducted a retrospective study of a prospectively collected database of acute ischaemic stroke treated by endovascular intervention. Using SPSS®, MATLAB®, and Rapidminer®, classical statistics as well as artificial neural network and support vector algorithms were applied to design a supervised machine capable of classifying these predictors into potential good and poor outcomes. These algorithms were trained, validated and tested using randomly divided data. We included 107 consecutive acute anterior circulation ischaemic stroke patients treated by endovascular technique. Sixty-six were male and the mean age of 65.3. All the available demographic, procedural and clinical factors were included into the models. The final confusion matrix of the neural network, demonstrated an overall congruency of ∼ 80% between the target and output classes, with favourable receiving operative characteristics. However, after optimisation, the support vector machine had a relatively better performance, with a root mean squared error of 2.064 (SD: ± 0.408). We showed promising accuracy of outcome prediction, using supervised machine learning algorithms, with potential for incorporation of larger multicenter datasets, likely further improving prediction. Finally, we propose that a robust machine learning system can potentially optimise the selection process for endovascular versus medical treatment in the management of acute stroke.
NASA Technical Reports Server (NTRS)
Oza, Nikunj C.
2011-01-01
A supervised learning task involves constructing a mapping from input data (normally described by several features) to the appropriate outputs. Within supervised learning, one type of task is a classification learning task, in which each output is one or more classes to which the input belongs. In supervised learning, a set of training examples---examples with known output values---is used by a learning algorithm to generate a model. This model is intended to approximate the mapping between the inputs and outputs. This model can be used to generate predicted outputs for inputs that have not been seen before. For example, we may have data consisting of observations of sunspots. In a classification learning task, our goal may be to learn to classify sunspots into one of several types. Each example may correspond to one candidate sunspot with various measurements or just an image. A learning algorithm would use the supplied examples to generate a model that approximates the mapping between each supplied set of measurements and the type of sunspot. This model can then be used to classify previously unseen sunspots based on the candidate's measurements. This chapter discusses methods to perform machine learning, with examples involving astronomy.
Conditional High-Order Boltzmann Machines for Supervised Relation Learning.
Huang, Yan; Wang, Wei; Wang, Liang; Tan, Tieniu
2017-09-01
Relation learning is a fundamental problem in many vision tasks. Recently, high-order Boltzmann machine and its variants have shown their great potentials in learning various types of data relation in a range of tasks. But most of these models are learned in an unsupervised way, i.e., without using relation class labels, which are not very discriminative for some challenging tasks, e.g., face verification. In this paper, with the goal to perform supervised relation learning, we introduce relation class labels into conventional high-order multiplicative interactions with pairwise input samples, and propose a conditional high-order Boltzmann Machine (CHBM), which can learn to classify the data relation in a binary classification way. To be able to deal with more complex data relation, we develop two improved variants of CHBM: 1) latent CHBM, which jointly performs relation feature learning and classification, by using a set of latent variables to block the pathway from pairwise input samples to output relation labels and 2) gated CHBM, which untangles factors of variation in data relation, by exploiting a set of latent variables to multiplicatively gate the classification of CHBM. To reduce the large number of model parameters generated by the multiplicative interactions, we approximately factorize high-order parameter tensors into multiple matrices. Then, we develop efficient supervised learning algorithms, by first pretraining the models using joint likelihood to provide good parameter initialization, and then finetuning them using conditional likelihood to enhance the discriminant ability. We apply the proposed models to a series of tasks including invariant recognition, face verification, and action similarity labeling. Experimental results demonstrate that by exploiting supervised relation labels, our models can greatly improve the performance.
Semi-supervised vibration-based classification and condition monitoring of compressors
NASA Astrophysics Data System (ADS)
Potočnik, Primož; Govekar, Edvard
2017-09-01
Semi-supervised vibration-based classification and condition monitoring of the reciprocating compressors installed in refrigeration appliances is proposed in this paper. The method addresses the problem of industrial condition monitoring where prior class definitions are often not available or difficult to obtain from local experts. The proposed method combines feature extraction, principal component analysis, and statistical analysis for the extraction of initial class representatives, and compares the capability of various classification methods, including discriminant analysis (DA), neural networks (NN), support vector machines (SVM), and extreme learning machines (ELM). The use of the method is demonstrated on a case study which was based on industrially acquired vibration measurements of reciprocating compressors during the production of refrigeration appliances. The paper presents a comparative qualitative analysis of the applied classifiers, confirming the good performance of several nonlinear classifiers. If the model parameters are properly selected, then very good classification performance can be obtained from NN trained by Bayesian regularization, SVM and ELM classifiers. The method can be effectively applied for the industrial condition monitoring of compressors.
NASA Technical Reports Server (NTRS)
Mazzoni, Dominic; Wagstaff, Kiri; Bornstein, Benjamin; Tang, Nghia; Roden, Joseph
2006-01-01
PixelLearn is an integrated user-interface computer program for classifying pixels in scientific images. Heretofore, training a machine-learning algorithm to classify pixels in images has been tedious and difficult. PixelLearn provides a graphical user interface that makes it faster and more intuitive, leading to more interactive exploration of image data sets. PixelLearn also provides image-enhancement controls to make it easier to see subtle details in images. PixelLearn opens images or sets of images in a variety of common scientific file formats and enables the user to interact with several supervised or unsupervised machine-learning pixel-classifying algorithms while the user continues to browse through the images. The machinelearning algorithms in PixelLearn use advanced clustering and classification methods that enable accuracy much higher than is achievable by most other software previously available for this purpose. PixelLearn is written in portable C++ and runs natively on computers running Linux, Windows, or Mac OS X.
Applying active learning to supervised word sense disambiguation in MEDLINE
Chen, Yukun; Cao, Hongxin; Mei, Qiaozhu; Zheng, Kai; Xu, Hua
2013-01-01
Objectives This study was to assess whether active learning strategies can be integrated with supervised word sense disambiguation (WSD) methods, thus reducing the number of annotated samples, while keeping or improving the quality of disambiguation models. Methods We developed support vector machine (SVM) classifiers to disambiguate 197 ambiguous terms and abbreviations in the MSH WSD collection. Three different uncertainty sampling-based active learning algorithms were implemented with the SVM classifiers and were compared with a passive learner (PL) based on random sampling. For each ambiguous term and each learning algorithm, a learning curve that plots the accuracy computed from the test set as a function of the number of annotated samples used in the model was generated. The area under the learning curve (ALC) was used as the primary metric for evaluation. Results Our experiments demonstrated that active learners (ALs) significantly outperformed the PL, showing better performance for 177 out of 197 (89.8%) WSD tasks. Further analysis showed that to achieve an average accuracy of 90%, the PL needed 38 annotated samples, while the ALs needed only 24, a 37% reduction in annotation effort. Moreover, we analyzed cases where active learning algorithms did not achieve superior performance and identified three causes: (1) poor models in the early learning stage; (2) easy WSD cases; and (3) difficult WSD cases, which provide useful insight for future improvements. Conclusions This study demonstrated that integrating active learning strategies with supervised WSD methods could effectively reduce annotation cost and improve the disambiguation models. PMID:23364851
Brain-state invariant thalamo-cortical coordination revealed by non-linear encoders.
Viejo, Guillaume; Cortier, Thomas; Peyrache, Adrien
2018-03-01
Understanding how neurons cooperate to integrate sensory inputs and guide behavior is a fundamental problem in neuroscience. A large body of methods have been developed to study neuronal firing at the single cell and population levels, generally seeking interpretability as well as predictivity. However, these methods are usually confronted with the lack of ground-truth necessary to validate the approach. Here, using neuronal data from the head-direction (HD) system, we present evidence demonstrating how gradient boosted trees, a non-linear and supervised Machine Learning tool, can learn the relationship between behavioral parameters and neuronal responses with high accuracy by optimizing the information rate. Interestingly, and unlike other classes of Machine Learning methods, the intrinsic structure of the trees can be interpreted in relation to behavior (e.g. to recover the tuning curves) or to study how neurons cooperate with their peers in the network. We show how the method, unlike linear analysis, reveals that the coordination in thalamo-cortical circuits is qualitatively the same during wakefulness and sleep, indicating a brain-state independent feed-forward circuit. Machine Learning tools thus open new avenues for benchmarking model-based characterization of spike trains.
Brain-state invariant thalamo-cortical coordination revealed by non-linear encoders
Cortier, Thomas; Peyrache, Adrien
2018-01-01
Understanding how neurons cooperate to integrate sensory inputs and guide behavior is a fundamental problem in neuroscience. A large body of methods have been developed to study neuronal firing at the single cell and population levels, generally seeking interpretability as well as predictivity. However, these methods are usually confronted with the lack of ground-truth necessary to validate the approach. Here, using neuronal data from the head-direction (HD) system, we present evidence demonstrating how gradient boosted trees, a non-linear and supervised Machine Learning tool, can learn the relationship between behavioral parameters and neuronal responses with high accuracy by optimizing the information rate. Interestingly, and unlike other classes of Machine Learning methods, the intrinsic structure of the trees can be interpreted in relation to behavior (e.g. to recover the tuning curves) or to study how neurons cooperate with their peers in the network. We show how the method, unlike linear analysis, reveals that the coordination in thalamo-cortical circuits is qualitatively the same during wakefulness and sleep, indicating a brain-state independent feed-forward circuit. Machine Learning tools thus open new avenues for benchmarking model-based characterization of spike trains. PMID:29565979
Sampling algorithms for validation of supervised learning models for Ising-like systems
NASA Astrophysics Data System (ADS)
Portman, Nataliya; Tamblyn, Isaac
2017-12-01
In this paper, we build and explore supervised learning models of ferromagnetic system behavior, using Monte-Carlo sampling of the spin configuration space generated by the 2D Ising model. Given the enormous size of the space of all possible Ising model realizations, the question arises as to how to choose a reasonable number of samples that will form physically meaningful and non-intersecting training and testing datasets. Here, we propose a sampling technique called ;ID-MH; that uses the Metropolis-Hastings algorithm creating Markov process across energy levels within the predefined configuration subspace. We show that application of this method retains phase transitions in both training and testing datasets and serves the purpose of validation of a machine learning algorithm. For larger lattice dimensions, ID-MH is not feasible as it requires knowledge of the complete configuration space. As such, we develop a new ;block-ID; sampling strategy: it decomposes the given structure into square blocks with lattice dimension N ≤ 5 and uses ID-MH sampling of candidate blocks. Further comparison of the performance of commonly used machine learning methods such as random forests, decision trees, k nearest neighbors and artificial neural networks shows that the PCA-based Decision Tree regressor is the most accurate predictor of magnetizations of the Ising model. For energies, however, the accuracy of prediction is not satisfactory, highlighting the need to consider more algorithmically complex methods (e.g., deep learning).
Memarian, Negar; Torre, Jared B; Haltom, Kate E; Stanton, Annette L; Lieberman, Matthew D
2017-09-01
Affect labeling (putting feelings into words) is a form of incidental emotion regulation that could underpin some benefits of expressive writing (i.e. writing about negative experiences). Here, we show that neural responses during affect labeling predicted changes in psychological and physical well-being outcome measures 3 months later. Furthermore, neural activity of specific frontal regions and amygdala predicted those outcomes as a function of expressive writing. Using supervised learning (support vector machines regression), improvements in four measures of psychological and physical health (physical symptoms, depression, anxiety and life satisfaction) after an expressive writing intervention were predicted with an average of 0.85% prediction error [root mean square error (RMSE) %]. The predictions were significantly more accurate with machine learning than with the conventional generalized linear model method (average RMSE: 1.3%). Consistent with affect labeling research, right ventrolateral prefrontal cortex (RVLPFC) and amygdalae were top predictors of improvement in the four outcomes. Moreover, RVLPFC and left amygdala predicted benefits due to expressive writing in satisfaction with life and depression outcome measures, respectively. This study demonstrates the substantial merit of supervised machine learning for real-world outcome prediction in social and affective neuroscience. © The Author (2017). Published by Oxford University Press.
Auction dynamics: A volume constrained MBO scheme
NASA Astrophysics Data System (ADS)
Jacobs, Matt; Merkurjev, Ekaterina; Esedoǧlu, Selim
2018-02-01
We show how auction algorithms, originally developed for the assignment problem, can be utilized in Merriman, Bence, and Osher's threshold dynamics scheme to simulate multi-phase motion by mean curvature in the presence of equality and inequality volume constraints on the individual phases. The resulting algorithms are highly efficient and robust, and can be used in simulations ranging from minimal partition problems in Euclidean space to semi-supervised machine learning via clustering on graphs. In the case of the latter application, numerous experimental results on benchmark machine learning datasets show that our approach exceeds the performance of current state-of-the-art methods, while requiring a fraction of the computation time.
Automatic microseismic event picking via unsupervised machine learning
NASA Astrophysics Data System (ADS)
Chen, Yangkang
2018-01-01
Effective and efficient arrival picking plays an important role in microseismic and earthquake data processing and imaging. Widely used short-term-average long-term-average ratio (STA/LTA) based arrival picking algorithms suffer from the sensitivity to moderate-to-strong random ambient noise. To make the state-of-the-art arrival picking approaches effective, microseismic data need to be first pre-processed, for example, removing sufficient amount of noise, and second analysed by arrival pickers. To conquer the noise issue in arrival picking for weak microseismic or earthquake event, I leverage the machine learning techniques to help recognizing seismic waveforms in microseismic or earthquake data. Because of the dependency of supervised machine learning algorithm on large volume of well-designed training data, I utilize an unsupervised machine learning algorithm to help cluster the time samples into two groups, that is, waveform points and non-waveform points. The fuzzy clustering algorithm has been demonstrated to be effective for such purpose. A group of synthetic, real microseismic and earthquake data sets with different levels of complexity show that the proposed method is much more robust than the state-of-the-art STA/LTA method in picking microseismic events, even in the case of moderately strong background noise.
NASA Astrophysics Data System (ADS)
Zhao, Lili; Yin, Jianping; Yuan, Lihuan; Liu, Qiang; Li, Kuan; Qiu, Minghui
2017-07-01
Automatic detection of abnormal cells from cervical smear images is extremely demanded in annual diagnosis of women's cervical cancer. For this medical cell recognition problem, there are three different feature sections, namely cytology morphology, nuclear chromatin pathology and region intensity. The challenges of this problem come from feature combination s and classification accurately and efficiently. Thus, we propose an efficient abnormal cervical cell detection system based on multi-instance extreme learning machine (MI-ELM) to deal with above two questions in one unified framework. MI-ELM is one of the most promising supervised learning classifiers which can deal with several feature sections and realistic classification problems analytically. Experiment results over Herlev dataset demonstrate that the proposed method outperforms three traditional methods for two-class classification in terms of well accuracy and less time.
Prototype Vector Machine for Large Scale Semi-Supervised Learning
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Kai; Kwok, James T.; Parvin, Bahram
2009-04-29
Practicaldataminingrarelyfalls exactlyinto the supervisedlearning scenario. Rather, the growing amount of unlabeled data poses a big challenge to large-scale semi-supervised learning (SSL). We note that the computationalintensivenessofgraph-based SSLarises largely from the manifold or graph regularization, which in turn lead to large models that are dificult to handle. To alleviate this, we proposed the prototype vector machine (PVM), a highlyscalable,graph-based algorithm for large-scale SSL. Our key innovation is the use of"prototypes vectors" for effcient approximation on both the graph-based regularizer and model representation. The choice of prototypes are grounded upon two important criteria: they not only perform effective low-rank approximation of themore » kernel matrix, but also span a model suffering the minimum information loss compared with the complete model. We demonstrate encouraging performance and appealing scaling properties of the PVM on a number of machine learning benchmark data sets.« less
NASA Astrophysics Data System (ADS)
Fatehi, Moslem; Asadi, Hooshang H.
2017-04-01
In this study, the application of a transductive support vector machine (TSVM), an innovative semi-supervised learning algorithm, has been proposed for mapping the potential drill targets at a detailed exploration stage. The semi-supervised learning method is a hybrid of supervised and unsupervised learning approach that simultaneously uses both training and non-training data to design a classifier. By using the TSVM algorithm, exploration layers at the Dalli porphyry Cu-Au deposit in the central Iran were integrated to locate the boundary of the Cu-Au mineralization for further drilling. By applying this algorithm on the non-training (unlabeled) and limited training (labeled) Dalli exploration data, the study area was classified in two domains of Cu-Au ore and waste. Then, the results were validated by the earlier block models created, using the available borehole and trench data. In addition to TSVM, the support vector machine (SVM) algorithm was also implemented on the study area for comparison. Thirty percent of the labeled exploration data was used to evaluate the performance of these two algorithms. The results revealed 87 percent correct recognition accuracy for the TSVM algorithm and 82 percent for the SVM algorithm. The deepest inclined borehole, recently drilled in the western part of the Dalli deposit, indicated that the boundary of Cu-Au mineralization, as identified by the TSVM algorithm, was only 15 m off from the actual boundary intersected by this borehole. According to the results of the TSVM algorithm, six new boreholes were suggested for further drilling at the Dalli deposit. This study showed that the TSVM algorithm could be a useful tool for enhancing the mineralization zones and consequently, ensuring a more accurate drill hole planning.
Awaysheh, Abdullah; Wilcke, Jeffrey; Elvinger, François; Rees, Loren; Fan, Weiguo; Zimmerman, Kurt L
2016-11-01
Inflammatory bowel disease (IBD) and alimentary lymphoma (ALA) are common gastrointestinal diseases in cats. The very similar clinical signs and histopathologic features of these diseases make the distinction between them diagnostically challenging. We tested the use of supervised machine-learning algorithms to differentiate between the 2 diseases using data generated from noninvasive diagnostic tests. Three prediction models were developed using 3 machine-learning algorithms: naive Bayes, decision trees, and artificial neural networks. The models were trained and tested on data from complete blood count (CBC) and serum chemistry (SC) results for the following 3 groups of client-owned cats: normal, inflammatory bowel disease (IBD), or alimentary lymphoma (ALA). Naive Bayes and artificial neural networks achieved higher classification accuracy (sensitivities of 70.8% and 69.2%, respectively) than the decision tree algorithm (63%, p < 0.0001). The areas under the receiver-operating characteristic curve for classifying cases into the 3 categories was 83% by naive Bayes, 79% by decision tree, and 82% by artificial neural networks. Prediction models using machine learning provided a method for distinguishing between ALA-IBD, ALA-normal, and IBD-normal. The naive Bayes and artificial neural networks classifiers used 10 and 4 of the CBC and SC variables, respectively, to outperform the C4.5 decision tree, which used 5 CBC and SC variables in classifying cats into the 3 classes. These models can provide another noninvasive diagnostic tool to assist clinicians with differentiating between IBD and ALA, and between diseased and nondiseased cats. © 2016 The Author(s).
Assessing the druggability of protein-protein interactions by a supervised machine-learning method.
Sugaya, Nobuyoshi; Ikeda, Kazuyoshi
2009-08-25
Protein-protein interactions (PPIs) are challenging but attractive targets of small molecule drugs for therapeutic interventions of human diseases. In this era of rapid accumulation of PPI data, there is great need for a methodology that can efficiently select drug target PPIs by holistically assessing the druggability of PPIs. To address this need, we propose here a novel approach based on a supervised machine-learning method, support vector machine (SVM). To assess the druggability of the PPIs, 69 attributes were selected to cover a wide range of structural, drug and chemical, and functional information on the PPIs. These attributes were used as feature vectors in the SVM-based method. Thirty PPIs known to be druggable were carefully selected from previous studies; these were used as positive instances. Our approach was applied to 1,295 human PPIs with tertiary structures of their protein complexes already solved. The best SVM model constructed discriminated the already-known target PPIs from others at an accuracy of 81% (sensitivity, 82%; specificity, 79%) in cross-validation. Among the attributes, the two with the greatest discriminative power in the best SVM model were the number of interacting proteins and the number of pathways. Using the model, we predicted several promising candidates for druggable PPIs, such as SMAD4/SKI. As more PPI data are accumulated in the near future, our method will have increased ability to accelerate the discovery of druggable PPIs.
Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning
Goerner-Potvin, Patricia; Morin, Andreanne; Shao, Xiaojian; Pastinen, Tomi
2017-01-01
Motivation: Many peak detection algorithms have been proposed for ChIP-seq data analysis, but it is not obvious which algorithm and what parameters are optimal for any given dataset. In contrast, regions with and without obvious peaks can be easily labeled by visual inspection of aligned read counts in a genome browser. We propose a supervised machine learning approach for ChIP-seq data analysis, using labels that encode qualitative judgments about which genomic regions contain or do not contain peaks. The main idea is to manually label a small subset of the genome, and then learn a model that makes consistent peak predictions on the rest of the genome. Results: We created 7 new histone mark datasets with 12 826 visually determined labels, and analyzed 3 existing transcription factor datasets. We observed that default peak detection parameters yield high false positive rates, which can be reduced by learning parameters using a relatively small training set of labeled data from the same experiment type. We also observed that labels from different people are highly consistent. Overall, these data indicate that our supervised labeling method is useful for quantitatively training and testing peak detection algorithms. Availability and Implementation: Labeled histone mark data http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/, R package to compute the label error of predicted peaks https://github.com/tdhock/PeakError Contacts: toby.hocking@mail.mcgill.ca or guil.bourque@mcgill.ca Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27797775
Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning.
Hocking, Toby Dylan; Goerner-Potvin, Patricia; Morin, Andreanne; Shao, Xiaojian; Pastinen, Tomi; Bourque, Guillaume
2017-02-15
Many peak detection algorithms have been proposed for ChIP-seq data analysis, but it is not obvious which algorithm and what parameters are optimal for any given dataset. In contrast, regions with and without obvious peaks can be easily labeled by visual inspection of aligned read counts in a genome browser. We propose a supervised machine learning approach for ChIP-seq data analysis, using labels that encode qualitative judgments about which genomic regions contain or do not contain peaks. The main idea is to manually label a small subset of the genome, and then learn a model that makes consistent peak predictions on the rest of the genome. We created 7 new histone mark datasets with 12 826 visually determined labels, and analyzed 3 existing transcription factor datasets. We observed that default peak detection parameters yield high false positive rates, which can be reduced by learning parameters using a relatively small training set of labeled data from the same experiment type. We also observed that labels from different people are highly consistent. Overall, these data indicate that our supervised labeling method is useful for quantitatively training and testing peak detection algorithms. Labeled histone mark data http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/ , R package to compute the label error of predicted peaks https://github.com/tdhock/PeakError. toby.hocking@mail.mcgill.ca or guil.bourque@mcgill.ca. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Indonesian name matching using machine learning supervised approach
NASA Astrophysics Data System (ADS)
Alifikri, Mohamad; Arif Bijaksana, Moch.
2018-03-01
Most existing name matching methods are developed for English language and so they cover the characteristics of this language. Up to this moment, there is no specific one has been designed and implemented for Indonesian names. The purpose of this thesis is to develop Indonesian name matching dataset as a contribution to academic research and to propose suitable feature set by utilizing combination of context of name strings and its permute-winkler score. Machine learning classification algorithms is taken as the method for performing name matching. Based on the experiments, by using tuned Random Forest algorithm and proposed features, there is an improvement of matching performance by approximately 1.7% and it is able to reduce until 70% misclassification result of the state of the arts methods. This improving performance makes the matching system more effective and reduces the risk of misclassified matches.
Semi-Supervised Multi-View Learning for Gene Network Reconstruction
Ceci, Michelangelo; Pio, Gianvito; Kuzmanovski, Vladimir; Džeroski, Sašo
2015-01-01
The task of gene regulatory network reconstruction from high-throughput data is receiving increasing attention in recent years. As a consequence, many inference methods for solving this task have been proposed in the literature. It has been recently observed, however, that no single inference method performs optimally across all datasets. It has also been shown that the integration of predictions from multiple inference methods is more robust and shows high performance across diverse datasets. Inspired by this research, in this paper, we propose a machine learning solution which learns to combine predictions from multiple inference methods. While this approach adds additional complexity to the inference process, we expect it would also carry substantial benefits. These would come from the automatic adaptation to patterns on the outputs of individual inference methods, so that it is possible to identify regulatory interactions more reliably when these patterns occur. This article demonstrates the benefits (in terms of accuracy of the reconstructed networks) of the proposed method, which exploits an iterative, semi-supervised ensemble-based algorithm. The algorithm learns to combine the interactions predicted by many different inference methods in the multi-view learning setting. The empirical evaluation of the proposed algorithm on a prokaryotic model organism (E. coli) and on a eukaryotic model organism (S. cerevisiae) clearly shows improved performance over the state of the art methods. The results indicate that gene regulatory network reconstruction for the real datasets is more difficult for S. cerevisiae than for E. coli. The software, all the datasets used in the experiments and all the results are available for download at the following link: http://figshare.com/articles/Semi_supervised_Multi_View_Learning_for_Gene_Network_Reconstruction/1604827. PMID:26641091
Extracting PICO Sentences from Clinical Trial Reports using Supervised Distant Supervision
Wallace, Byron C.; Kuiper, Joël; Sharma, Aakash; Zhu, Mingxi (Brian); Marshall, Iain J.
2016-01-01
Systematic reviews underpin Evidence Based Medicine (EBM) by addressing precise clinical questions via comprehensive synthesis of all relevant published evidence. Authors of systematic reviews typically define a Population/Problem, Intervention, Comparator, and Outcome (a PICO criteria) of interest, and then retrieve, appraise and synthesize results from all reports of clinical trials that meet these criteria. Identifying PICO elements in the full-texts of trial reports is thus a critical yet time-consuming step in the systematic review process. We seek to expedite evidence synthesis by developing machine learning models to automatically extract sentences from articles relevant to PICO elements. Collecting a large corpus of training data for this task would be prohibitively expensive. Therefore, we derive distant supervision (DS) with which to train models using previously conducted reviews. DS entails heuristically deriving ‘soft’ labels from an available structured resource. However, we have access only to unstructured, free-text summaries of PICO elements for corresponding articles; we must derive from these the desired sentence-level annotations. To this end, we propose a novel method – supervised distant supervision (SDS) – that uses a small amount of direct supervision to better exploit a large corpus of distantly labeled instances by learning to pseudo-annotate articles using the available DS. We show that this approach tends to outperform existing methods with respect to automated PICO extraction. PMID:27746703
Applying active learning to supervised word sense disambiguation in MEDLINE.
Chen, Yukun; Cao, Hongxin; Mei, Qiaozhu; Zheng, Kai; Xu, Hua
2013-01-01
This study was to assess whether active learning strategies can be integrated with supervised word sense disambiguation (WSD) methods, thus reducing the number of annotated samples, while keeping or improving the quality of disambiguation models. We developed support vector machine (SVM) classifiers to disambiguate 197 ambiguous terms and abbreviations in the MSH WSD collection. Three different uncertainty sampling-based active learning algorithms were implemented with the SVM classifiers and were compared with a passive learner (PL) based on random sampling. For each ambiguous term and each learning algorithm, a learning curve that plots the accuracy computed from the test set as a function of the number of annotated samples used in the model was generated. The area under the learning curve (ALC) was used as the primary metric for evaluation. Our experiments demonstrated that active learners (ALs) significantly outperformed the PL, showing better performance for 177 out of 197 (89.8%) WSD tasks. Further analysis showed that to achieve an average accuracy of 90%, the PL needed 38 annotated samples, while the ALs needed only 24, a 37% reduction in annotation effort. Moreover, we analyzed cases where active learning algorithms did not achieve superior performance and identified three causes: (1) poor models in the early learning stage; (2) easy WSD cases; and (3) difficult WSD cases, which provide useful insight for future improvements. This study demonstrated that integrating active learning strategies with supervised WSD methods could effectively reduce annotation cost and improve the disambiguation models.
Spatially Regularized Machine Learning for Task and Resting-state fMRI
Song, Xiaomu; Panych, Lawrence P.; Chen, Nan-kuei
2015-01-01
Background Reliable mapping of brain function across sessions and/or subjects in task- and resting-state has been a critical challenge for quantitative fMRI studies although it has been intensively addressed in the past decades. New Method A spatially regularized support vector machine (SVM) technique was developed for the reliable brain mapping in task- and resting-state. Unlike most existing SVM-based brain mapping techniques, which implement supervised classifications of specific brain functional states or disorders, the proposed method performs a semi-supervised classification for the general brain function mapping where spatial correlation of fMRI is integrated into the SVM learning. The method can adapt to intra- and inter-subject variations induced by fMRI nonstationarity, and identify a true boundary between active and inactive voxels, or between functionally connected and unconnected voxels in a feature space. Results The method was evaluated using synthetic and experimental data at the individual and group level. Multiple features were evaluated in terms of their contributions to the spatially regularized SVM learning. Reliable mapping results in both task- and resting-state were obtained from individual subjects and at the group level. Comparison with Existing Methods A comparison study was performed with independent component analysis, general linear model, and correlation analysis methods. Experimental results indicate that the proposed method can provide a better or comparable mapping performance at the individual and group level. Conclusions The proposed method can provide accurate and reliable mapping of brain function in task- and resting-state, and is applicable to a variety of quantitative fMRI studies. PMID:26470627
Application of Metamorphic Testing to Supervised Classifiers
Xie, Xiaoyuan; Ho, Joshua; Kaiser, Gail; Xu, Baowen; Chen, Tsong Yueh
2010-01-01
Many applications in the field of scientific computing - such as computational biology, computational linguistics, and others - depend on Machine Learning algorithms to provide important core functionality to support solutions in the particular problem domains. However, it is difficult to test such applications because often there is no “test oracle” to indicate what the correct output should be for arbitrary input. To help address the quality of such software, in this paper we present a technique for testing the implementations of supervised machine learning classification algorithms on which such scientific computing software depends. Our technique is based on an approach called “metamorphic testing”, which has been shown to be effective in such cases. More importantly, we demonstrate that our technique not only serves the purpose of verification, but also can be applied in validation. In addition to presenting our technique, we describe a case study we performed on a real-world machine learning application framework, and discuss how programmers implementing machine learning algorithms can avoid the common pitfalls discovered in our study. We also discuss how our findings can be of use to other areas outside scientific computing, as well. PMID:21243103
Das, Samarjit; Amoedo, Breogan; De la Torre, Fernando; Hodgins, Jessica
2012-01-01
In this paper, we propose to use a weakly supervised machine learning framework for automatic detection of Parkinson's Disease motor symptoms in daily living environments. Our primary goal is to develop a monitoring system capable of being used outside of controlled laboratory settings. Such a system would enable us to track medication cycles at home and provide valuable clinical feedback. Most of the relevant prior works involve supervised learning frameworks (e.g., Support Vector Machines). However, in-home monitoring provides only coarse ground truth information about symptom occurrences, making it very hard to adapt and train supervised learning classifiers for symptom detection. We address this challenge by formulating symptom detection under incomplete ground truth information as a multiple instance learning (MIL) problem. MIL is a weakly supervised learning framework that does not require exact instances of symptom occurrences for training; rather, it learns from approximate time intervals within which a symptom might or might not have occurred on a given day. Once trained, the MIL detector was able to spot symptom-prone time windows on other days and approximately localize the symptom instances. We monitored two Parkinson's disease (PD) patients, each for four days with a set of five triaxial accelerometers and utilized a MIL algorithm based on axis parallel rectangle (APR) fitting in the feature space. We were able to detect subject specific symptoms (e.g. dyskinesia) that conformed with a daily log maintained by the patients.
Plant MicroRNA Prediction by Supervised Machine Learning Using C5.0 Decision Trees.
Williams, Philip H; Eyles, Rod; Weiller, Georg
2012-01-01
MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require "read count" to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA(∗) duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation.
Machine learning approach for the outcome prediction of temporal lobe epilepsy surgery.
Armañanzas, Rubén; Alonso-Nanclares, Lidia; Defelipe-Oroquieta, Jesús; Kastanauskaite, Asta; de Sola, Rafael G; Defelipe, Javier; Bielza, Concha; Larrañaga, Pedro
2013-01-01
Epilepsy surgery is effective in reducing both the number and frequency of seizures, particularly in temporal lobe epilepsy (TLE). Nevertheless, a significant proportion of these patients continue suffering seizures after surgery. Here we used a machine learning approach to predict the outcome of epilepsy surgery based on supervised classification data mining taking into account not only the common clinical variables, but also pathological and neuropsychological evaluations. We have generated models capable of predicting whether a patient with TLE secondary to hippocampal sclerosis will fully recover from epilepsy or not. The machine learning analysis revealed that outcome could be predicted with an estimated accuracy of almost 90% using some clinical and neuropsychological features. Importantly, not all the features were needed to perform the prediction; some of them proved to be irrelevant to the prognosis. Personality style was found to be one of the key features to predict the outcome. Although we examined relatively few cases, findings were verified across all data, showing that the machine learning approach described in the present study may be a powerful method. Since neuropsychological assessment of epileptic patients is a standard protocol in the pre-surgical evaluation, we propose to include these specific psychological tests and machine learning tools to improve the selection of candidates for epilepsy surgery.
Machine Learning Approach for the Outcome Prediction of Temporal Lobe Epilepsy Surgery
DeFelipe-Oroquieta, Jesús; Kastanauskaite, Asta; de Sola, Rafael G.; DeFelipe, Javier; Bielza, Concha; Larrañaga, Pedro
2013-01-01
Epilepsy surgery is effective in reducing both the number and frequency of seizures, particularly in temporal lobe epilepsy (TLE). Nevertheless, a significant proportion of these patients continue suffering seizures after surgery. Here we used a machine learning approach to predict the outcome of epilepsy surgery based on supervised classification data mining taking into account not only the common clinical variables, but also pathological and neuropsychological evaluations. We have generated models capable of predicting whether a patient with TLE secondary to hippocampal sclerosis will fully recover from epilepsy or not. The machine learning analysis revealed that outcome could be predicted with an estimated accuracy of almost 90% using some clinical and neuropsychological features. Importantly, not all the features were needed to perform the prediction; some of them proved to be irrelevant to the prognosis. Personality style was found to be one of the key features to predict the outcome. Although we examined relatively few cases, findings were verified across all data, showing that the machine learning approach described in the present study may be a powerful method. Since neuropsychological assessment of epileptic patients is a standard protocol in the pre-surgical evaluation, we propose to include these specific psychological tests and machine learning tools to improve the selection of candidates for epilepsy surgery. PMID:23646148
NASA Astrophysics Data System (ADS)
Cavuoti, S.; Tortora, C.; Brescia, M.; Longo, G.; Radovich, M.; Napolitano, N. R.; Amaro, V.; Vellucci, C.; La Barbera, F.; Getman, F.; Grado, A.
2017-04-01
Photometric redshifts (photo-z) are fundamental in galaxy surveys to address different topics, from gravitational lensing and dark matter distribution to galaxy evolution. The Kilo Degree Survey (KiDS), I.e. the European Southern Observatory (ESO) public survey on the VLT Survey Telescope (VST), provides the unprecedented opportunity to exploit a large galaxy data set with an exceptional image quality and depth in the optical wavebands. Using a KiDS subset of about 25000 galaxies with measured spectroscopic redshifts, we have derived photo-z using (I) three different empirical methods based on supervised machine learning; (II) the Bayesian photometric redshift model (or BPZ); and (III) a classical spectral energy distribution (SED) template fitting procedure (LE PHARE). We confirm that, in the regions of the photometric parameter space properly sampled by the spectroscopic templates, machine learning methods provide better redshift estimates, with a lower scatter and a smaller fraction of outliers. SED fitting techniques, however, provide useful information on the galaxy spectral type, which can be effectively used to constrain systematic errors and to better characterize potential catastrophic outliers. Such classification is then used to specialize the training of regression machine learning models, by demonstrating that a hybrid approach, involving SED fitting and machine learning in a single collaborative framework, can be effectively used to improve the accuracy of photo-z estimates.
Wang, Yuanjia; Chen, Tianle; Zeng, Donglin
2016-01-01
Learning risk scores to predict dichotomous or continuous outcomes using machine learning approaches has been studied extensively. However, how to learn risk scores for time-to-event outcomes subject to right censoring has received little attention until recently. Existing approaches rely on inverse probability weighting or rank-based regression, which may be inefficient. In this paper, we develop a new support vector hazards machine (SVHM) approach to predict censored outcomes. Our method is based on predicting the counting process associated with the time-to-event outcomes among subjects at risk via a series of support vector machines. Introducing counting processes to represent time-to-event data leads to a connection between support vector machines in supervised learning and hazards regression in standard survival analysis. To account for different at risk populations at observed event times, a time-varying offset is used in estimating risk scores. The resulting optimization is a convex quadratic programming problem that can easily incorporate non-linearity using kernel trick. We demonstrate an interesting link from the profiled empirical risk function of SVHM to the Cox partial likelihood. We then formally show that SVHM is optimal in discriminating covariate-specific hazard function from population average hazard function, and establish the consistency and learning rate of the predicted risk using the estimated risk scores. Simulation studies show improved prediction accuracy of the event times using SVHM compared to existing machine learning methods and standard conventional approaches. Finally, we analyze two real world biomedical study data where we use clinical markers and neuroimaging biomarkers to predict age-at-onset of a disease, and demonstrate superiority of SVHM in distinguishing high risk versus low risk subjects.
Hepworth, Philip J.; Nefedov, Alexey V.; Muchnik, Ilya B.; Morgan, Kenton L.
2012-01-01
Machine-learning algorithms pervade our daily lives. In epidemiology, supervised machine learning has the potential for classification, diagnosis and risk factor identification. Here, we report the use of support vector machine learning to identify the features associated with hock burn on commercial broiler farms, using routinely collected farm management data. These data lend themselves to analysis using machine-learning techniques. Hock burn, dermatitis of the skin over the hock, is an important indicator of broiler health and welfare. Remarkably, this classifier can predict the occurrence of high hock burn prevalence with accuracy of 0.78 on unseen data, as measured by the area under the receiver operating characteristic curve. We also compare the results with those obtained by standard multi-variable logistic regression and suggest that this technique provides new insights into the data. This novel application of a machine-learning algorithm, embedded in poultry management systems could offer significant improvements in broiler health and welfare worldwide. PMID:22319115
Hepworth, Philip J; Nefedov, Alexey V; Muchnik, Ilya B; Morgan, Kenton L
2012-08-07
Machine-learning algorithms pervade our daily lives. In epidemiology, supervised machine learning has the potential for classification, diagnosis and risk factor identification. Here, we report the use of support vector machine learning to identify the features associated with hock burn on commercial broiler farms, using routinely collected farm management data. These data lend themselves to analysis using machine-learning techniques. Hock burn, dermatitis of the skin over the hock, is an important indicator of broiler health and welfare. Remarkably, this classifier can predict the occurrence of high hock burn prevalence with accuracy of 0.78 on unseen data, as measured by the area under the receiver operating characteristic curve. We also compare the results with those obtained by standard multi-variable logistic regression and suggest that this technique provides new insights into the data. This novel application of a machine-learning algorithm, embedded in poultry management systems could offer significant improvements in broiler health and welfare worldwide.
Exploring Genome-Wide Expression Profiles Using Machine Learning Techniques.
Kebschull, Moritz; Papapanou, Panos N
2017-01-01
Although contemporary high-throughput -omics methods produce high-dimensional data, the resulting wealth of information is difficult to assess using traditional statistical procedures. Machine learning methods facilitate the detection of additional patterns, beyond the mere identification of lists of features that differ between groups.Here, we demonstrate the utility of (1) supervised classification algorithms in class validation, and (2) unsupervised clustering in class discovery. We use data from our previous work that described the transcriptional profiles of gingival tissue samples obtained from subjects suffering from chronic or aggressive periodontitis (1) to test whether the two diagnostic entities were also characterized by differences on the molecular level, and (2) to search for a novel, alternative classification of periodontitis based on the tissue transcriptomes.Using machine learning technology, we provide evidence for diagnostic imprecision in the currently accepted classification of periodontitis, and demonstrate that a novel, alternative classification based on differences in gingival tissue transcriptomes is feasible. The outlined procedures allow for the unbiased interrogation of high-dimensional datasets for characteristic underlying classes, and are applicable to a broad range of -omics data.
An illustration of new methods in machine condition monitoring, Part I: stochastic resonance
NASA Astrophysics Data System (ADS)
Worden, K.; Antoniadou, I.; Marchesiello, S.; Mba, C.; Garibaldi, L.
2017-05-01
There have been many recent developments in the application of data-based methods to machine condition monitoring. A powerful methodology based on machine learning has emerged, where diagnostics are based on a two-step procedure: extraction of damage-sensitive features, followed by unsupervised learning (novelty detection) or supervised learning (classification). The objective of the current pair of papers is simply to illustrate one state-of-the-art procedure for each step, using synthetic data representative of reality in terms of size and complexity. The first paper in the pair will deal with feature extraction. Although some papers have appeared in the recent past considering stochastic resonance as a means of amplifying damage information in signals, they have largely relied on ad hoc specifications of the resonator used. In contrast, the current paper will adopt a principled optimisation-based approach to the resonator design. The paper will also show that a discrete dynamical system can provide all the benefits of a continuous system, but also provide a considerable speed-up in terms of simulation time in order to facilitate the optimisation approach.
Bias correction for selecting the minimal-error classifier from many machine learning models.
Ding, Ying; Tang, Shaowu; Liao, Serena G; Jia, Jia; Oesterreich, Steffi; Lin, Yan; Tseng, George C
2014-11-15
Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30-60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts. In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package 'MLbias' and all source files are publicly available. tsenglab.biostat.pitt.edu/software.htm. ctseng@pitt.edu Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Machine Learning and Data Mining Methods in Diabetes Research.
Kavakiotis, Ioannis; Tsave, Olga; Salifoglou, Athanasios; Maglaveras, Nicos; Vlahavas, Ioannis; Chouvarda, Ioanna
2017-01-01
The remarkable advances in biotechnology and health sciences have led to a significant production of data, such as high throughput genetic data and clinical information, generated from large Electronic Health Records (EHRs). To this end, application of machine learning and data mining methods in biosciences is presently, more than ever before, vital and indispensable in efforts to transform intelligently all available information into valuable knowledge. Diabetes mellitus (DM) is defined as a group of metabolic disorders exerting significant pressure on human health worldwide. Extensive research in all aspects of diabetes (diagnosis, etiopathophysiology, therapy, etc.) has led to the generation of huge amounts of data. The aim of the present study is to conduct a systematic review of the applications of machine learning, data mining techniques and tools in the field of diabetes research with respect to a) Prediction and Diagnosis, b) Diabetic Complications, c) Genetic Background and Environment, and e) Health Care and Management with the first category appearing to be the most popular. A wide range of machine learning algorithms were employed. In general, 85% of those used were characterized by supervised learning approaches and 15% by unsupervised ones, and more specifically, association rules. Support vector machines (SVM) arise as the most successful and widely used algorithm. Concerning the type of data, clinical datasets were mainly used. The title applications in the selected articles project the usefulness of extracting valuable knowledge leading to new hypotheses targeting deeper understanding and further investigation in DM.
Zeng, Irene Sui Lan; Lumley, Thomas
2018-01-01
Integrated omics is becoming a new channel for investigating the complex molecular system in modern biological science and sets a foundation for systematic learning for precision medicine. The statistical/machine learning methods that have emerged in the past decade for integrated omics are not only innovative but also multidisciplinary with integrated knowledge in biology, medicine, statistics, machine learning, and artificial intelligence. Here, we review the nontrivial classes of learning methods from the statistical aspects and streamline these learning methods within the statistical learning framework. The intriguing findings from the review are that the methods used are generalizable to other disciplines with complex systematic structure, and the integrated omics is part of an integrated information science which has collated and integrated different types of information for inferences and decision making. We review the statistical learning methods of exploratory and supervised learning from 42 publications. We also discuss the strengths and limitations of the extended principal component analysis, cluster analysis, network analysis, and regression methods. Statistical techniques such as penalization for sparsity induction when there are fewer observations than the number of features and using Bayesian approach when there are prior knowledge to be integrated are also included in the commentary. For the completeness of the review, a table of currently available software and packages from 23 publications for omics are summarized in the appendix.
Lise, Stefano; Archambeau, Cedric; Pontil, Massimiliano; Jones, David T
2009-10-30
Alanine scanning mutagenesis is a powerful experimental methodology for investigating the structural and energetic characteristics of protein complexes. Individual amino-acids are systematically mutated to alanine and changes in free energy of binding (DeltaDeltaG) measured. Several experiments have shown that protein-protein interactions are critically dependent on just a few residues ("hot spots") at the interface. Hot spots make a dominant contribution to the free energy of binding and if mutated they can disrupt the interaction. As mutagenesis studies require significant experimental efforts, there is a need for accurate and reliable computational methods. Such methods would also add to our understanding of the determinants of affinity and specificity in protein-protein recognition. We present a novel computational strategy to identify hot spot residues, given the structure of a complex. We consider the basic energetic terms that contribute to hot spot interactions, i.e. van der Waals potentials, solvation energy, hydrogen bonds and Coulomb electrostatics. We treat them as input features and use machine learning algorithms such as Support Vector Machines and Gaussian Processes to optimally combine and integrate them, based on a set of training examples of alanine mutations. We show that our approach is effective in predicting hot spots and it compares favourably to other available methods. In particular we find the best performances using Transductive Support Vector Machines, a semi-supervised learning scheme. When hot spots are defined as those residues for which DeltaDeltaG >or= 2 kcal/mol, our method achieves a precision and a recall respectively of 56% and 65%. We have developed an hybrid scheme in which energy terms are used as input features of machine learning models. This strategy combines the strengths of machine learning and energy-based methods. Although so far these two types of approaches have mainly been applied separately to biomolecular problems, the results of our investigation indicate that there are substantial benefits to be gained by their integration.
NASA Astrophysics Data System (ADS)
Eisenhart, T.; Josset, L.; Rising, J. A.; Devineni, N.; Lall, U.
2017-12-01
In the wake of recent water crises, the need to understand and predict the risk of water stress in urban and rural areas has grown. This understanding has the potential to improve decision making in public resource management, policy making, risk management and investment decisions. Assuming an underlying relationship between urban and rural water stress and observable features, we apply Deep Learning and Supervised Learning models to uncover hidden nonlinear patterns from spatiotemporal datasets. Results of interest includes prediction accuracy on extreme categories (i.e. urban areas highly prone to water stress) and not solely the average risk for urban or rural area, which adds complexity to the tuning of model parameters. We first label urban water stressed counties using annual water quality violations and compile a comprehensive spatiotemporal dataset that captures the yearly evolution of climatic, demographic and economic factors of more than 3,000 US counties over the 1980-2010 period. As county-level data reporting is not done on a yearly basis, we test multiple imputation methods to get around the issue of missing data. Using Python libraries, TensorFlow and scikit-learn, we apply and compare the ability of, amongst other methods, Recurrent Neural Networks (testing both LSTM and GRU cells), Convolutional Neural Networks and Support Vector Machines to predict urban water stress. We evaluate the performance of those models over multiple time spans and combine methods to diminish the risk of overfitting and increase prediction power on test sets. This methodology seeks to identify hidden nonlinear patterns to assess the predominant data features that influence urban and rural water stress. Results from this application at the national scale will assess the performance of deep learning models to predict water stress risk areas across all US counties and will highlight a predominant Machine Learning method for modeling water stress risk using spatiotemporal data.
Machine learning properties of materials and molecules with entropy-regularized kernels
NASA Astrophysics Data System (ADS)
Ceriotti, Michele; Bartók, Albert; CsáNyi, GáBor; de, Sandip
Application of machine-learning methods to physics, chemistry and materials science is gaining traction as a strategy to obtain accurate predictions of the properties of matter at a fraction of the typical cost of quantum mechanical electronic structure calculations. In this endeavor, one can leverage general-purpose frameworks for supervised-learning. It is however very important that the input data - for instance the positions of atoms in a molecule or solid - is processed into a form that reflects all the underlying physical symmetries of the problem, and that possesses the regularity properties that are required by machine-learning algorithms. Here we introduce a general strategy to build a representation of this kind. We will start from existing approaches to compare local environments (basically, groups of atoms), and combine them using techniques borrowed from optimal transport theory, discussing the relation between this idea and additive energy decompositions. We will present a few examples demonstrating the potential of this approach as a tool to predict molecular and materials' properties with an accuracy on par with state-of-the-art electronic structure methods. MARVEL NCCR (Swiss National Science Foundation) and ERC StG HBMAP (European Research Council, G.A. 677013).
On the convergence of nanotechnology and Big Data analysis for computer-aided diagnosis.
Rodrigues, Jose F; Paulovich, Fernando V; de Oliveira, Maria Cf; de Oliveira, Osvaldo N
2016-04-01
An overview is provided of the challenges involved in building computer-aided diagnosis systems capable of precise medical diagnostics based on integration and interpretation of data from different sources and formats. The availability of massive amounts of data and computational methods associated with the Big Data paradigm has brought hope that such systems may soon be available in routine clinical practices, which is not the case today. We focus on visual and machine learning analysis of medical data acquired with varied nanotech-based techniques and on methods for Big Data infrastructure. Because diagnosis is essentially a classification task, we address the machine learning techniques with supervised and unsupervised classification, making a critical assessment of the progress already made in the medical field and the prospects for the near future. We also advocate that successful computer-aided diagnosis requires a merge of methods and concepts from nanotechnology and Big Data analysis.
Supervised filters for EEG signal in naturally occurring epilepsy forecasting.
Muñoz-Almaraz, Francisco Javier; Zamora-Martínez, Francisco; Botella-Rocamora, Paloma; Pardo, Juan
2017-01-01
Nearly 1% of the global population has Epilepsy. Forecasting epileptic seizures with an acceptable confidence level, could improve the disease treatment and thus the lifestyle of the people who suffer it. To do that the electroencephalogram (EEG) signal is usually studied through spectral power band filtering, but this paper proposes an alternative novel method of preprocessing the EEG signal based on supervised filters. Such filters have been employed in a machine learning algorithm, such as the K-Nearest Neighbor (KNN), to improve the prediction of seizures. The proposed solution extends with this novel approach an algorithm that was submitted to win the third prize of an international Data Science challenge promoted by Kaggle contest platform and the American Epilepsy Society, the Epilepsy Foundation, National Institutes of Health (NIH) and Mayo Clinic. A formal description of these preprocessing methods is presented and a detailed analysis in terms of Receiver Operating Characteristics (ROC) curve and Area Under ROC curve is performed. The obtained results show statistical significant improvements when compared with the spectral power band filtering (PBF) typical baseline. A trend between performance and the dataset size is observed, suggesting that the supervised filters bring better information, compared to the conventional PBF filters, as the dataset grows in terms of monitored variables (sensors) and time length. The paper demonstrates a better accuracy in forecasting when new filters are employed and its main contribution is in the field of machine learning algorithms to develop more accurate predictive systems.
Supervised filters for EEG signal in naturally occurring epilepsy forecasting
2017-01-01
Nearly 1% of the global population has Epilepsy. Forecasting epileptic seizures with an acceptable confidence level, could improve the disease treatment and thus the lifestyle of the people who suffer it. To do that the electroencephalogram (EEG) signal is usually studied through spectral power band filtering, but this paper proposes an alternative novel method of preprocessing the EEG signal based on supervised filters. Such filters have been employed in a machine learning algorithm, such as the K-Nearest Neighbor (KNN), to improve the prediction of seizures. The proposed solution extends with this novel approach an algorithm that was submitted to win the third prize of an international Data Science challenge promoted by Kaggle contest platform and the American Epilepsy Society, the Epilepsy Foundation, National Institutes of Health (NIH) and Mayo Clinic. A formal description of these preprocessing methods is presented and a detailed analysis in terms of Receiver Operating Characteristics (ROC) curve and Area Under ROC curve is performed. The obtained results show statistical significant improvements when compared with the spectral power band filtering (PBF) typical baseline. A trend between performance and the dataset size is observed, suggesting that the supervised filters bring better information, compared to the conventional PBF filters, as the dataset grows in terms of monitored variables (sensors) and time length. The paper demonstrates a better accuracy in forecasting when new filters are employed and its main contribution is in the field of machine learning algorithms to develop more accurate predictive systems. PMID:28632737
Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments.
Han, Wenjing; Coutinho, Eduardo; Ruan, Huabin; Li, Haifeng; Schuller, Björn; Yu, Xiaojie; Zhu, Xuan
2016-01-01
Coping with scarcity of labeled data is a common problem in sound classification tasks. Approaches for classifying sounds are commonly based on supervised learning algorithms, which require labeled data which is often scarce and leads to models that do not generalize well. In this paper, we make an efficient combination of confidence-based Active Learning and Self-Training with the aim of minimizing the need for human annotation for sound classification model training. The proposed method pre-processes the instances that are ready for labeling by calculating their classifier confidence scores, and then delivers the candidates with lower scores to human annotators, and those with high scores are automatically labeled by the machine. We demonstrate the feasibility and efficacy of this method in two practical scenarios: pool-based and stream-based processing. Extensive experimental results indicate that our approach requires significantly less labeled instances to reach the same performance in both scenarios compared to Passive Learning, Active Learning and Self-Training. A reduction of 52.2% in human labeled instances is achieved in both of the pool-based and stream-based scenarios on a sound classification task considering 16,930 sound instances.
Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments
Han, Wenjing; Coutinho, Eduardo; Li, Haifeng; Schuller, Björn; Yu, Xiaojie; Zhu, Xuan
2016-01-01
Coping with scarcity of labeled data is a common problem in sound classification tasks. Approaches for classifying sounds are commonly based on supervised learning algorithms, which require labeled data which is often scarce and leads to models that do not generalize well. In this paper, we make an efficient combination of confidence-based Active Learning and Self-Training with the aim of minimizing the need for human annotation for sound classification model training. The proposed method pre-processes the instances that are ready for labeling by calculating their classifier confidence scores, and then delivers the candidates with lower scores to human annotators, and those with high scores are automatically labeled by the machine. We demonstrate the feasibility and efficacy of this method in two practical scenarios: pool-based and stream-based processing. Extensive experimental results indicate that our approach requires significantly less labeled instances to reach the same performance in both scenarios compared to Passive Learning, Active Learning and Self-Training. A reduction of 52.2% in human labeled instances is achieved in both of the pool-based and stream-based scenarios on a sound classification task considering 16,930 sound instances. PMID:27627768
Data Programming: Creating Large Training Sets, Quickly.
Ratner, Alexander; De Sa, Christopher; Wu, Sen; Selsam, Daniel; Ré, Christopher
2016-12-01
Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions , which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way for non-experts to create machine learning models when training data is limited or unavailable.
Data Programming: Creating Large Training Sets, Quickly
Ratner, Alexander; De Sa, Christopher; Wu, Sen; Selsam, Daniel; Ré, Christopher
2018-01-01
Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can “denoise” the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way for non-experts to create machine learning models when training data is limited or unavailable. PMID:29872252
Author Detection on a Mobile Phone
2011-03-01
handwriting , and to mine sales data for profitable trends. Two broad categories of machine learning are supervised learn- ing and unsupervised learning...evaluation,” AI 2006: Advances in Artificial Intelligence, p. 1015–1021, 2006. [23] “Gartner says worldwide mobile phone sales grew 17 per cent in first
Kindermans, Pieter-Jan; Tangermann, Michael; Müller, Klaus-Robert; Schrauwen, Benjamin
2014-06-01
Most BCIs have to undergo a calibration session in which data is recorded to train decoders with machine learning. Only recently zero-training methods have become a subject of study. This work proposes a probabilistic framework for BCI applications which exploit event-related potentials (ERPs). For the example of a visual P300 speller we show how the framework harvests the structure suitable to solve the decoding task by (a) transfer learning, (b) unsupervised adaptation, (c) language model and (d) dynamic stopping. A simulation study compares the proposed probabilistic zero framework (using transfer learning and task structure) to a state-of-the-art supervised model on n = 22 subjects. The individual influence of the involved components (a)-(d) are investigated. Without any need for a calibration session, the probabilistic zero-training framework with inter-subject transfer learning shows excellent performance--competitive to a state-of-the-art supervised method using calibration. Its decoding quality is carried mainly by the effect of transfer learning in combination with continuous unsupervised adaptation. A high-performing zero-training BCI is within reach for one of the most popular BCI paradigms: ERP spelling. Recording calibration data for a supervised BCI would require valuable time which is lost for spelling. The time spent on calibration would allow a novel user to spell 29 symbols with our unsupervised approach. It could be of use for various clinical and non-clinical ERP-applications of BCI.
NASA Astrophysics Data System (ADS)
Kindermans, Pieter-Jan; Tangermann, Michael; Müller, Klaus-Robert; Schrauwen, Benjamin
2014-06-01
Objective. Most BCIs have to undergo a calibration session in which data is recorded to train decoders with machine learning. Only recently zero-training methods have become a subject of study. This work proposes a probabilistic framework for BCI applications which exploit event-related potentials (ERPs). For the example of a visual P300 speller we show how the framework harvests the structure suitable to solve the decoding task by (a) transfer learning, (b) unsupervised adaptation, (c) language model and (d) dynamic stopping. Approach. A simulation study compares the proposed probabilistic zero framework (using transfer learning and task structure) to a state-of-the-art supervised model on n = 22 subjects. The individual influence of the involved components (a)-(d) are investigated. Main results. Without any need for a calibration session, the probabilistic zero-training framework with inter-subject transfer learning shows excellent performance—competitive to a state-of-the-art supervised method using calibration. Its decoding quality is carried mainly by the effect of transfer learning in combination with continuous unsupervised adaptation. Significance. A high-performing zero-training BCI is within reach for one of the most popular BCI paradigms: ERP spelling. Recording calibration data for a supervised BCI would require valuable time which is lost for spelling. The time spent on calibration would allow a novel user to spell 29 symbols with our unsupervised approach. It could be of use for various clinical and non-clinical ERP-applications of BCI.
Semi-supervised morphosyntactic classification of Old Icelandic.
Urban, Kryztof; Tangherlini, Timothy R; Vijūnas, Aurelijus; Broadwell, Peter M
2014-01-01
We present IceMorph, a semi-supervised morphosyntactic analyzer of Old Icelandic. In addition to machine-read corpora and dictionaries, it applies a small set of declension prototypes to map corpus words to dictionary entries. A web-based GUI allows expert users to modify and augment data through an online process. A machine learning module incorporates prototype data, edit-distance metrics, and expert feedback to continuously update part-of-speech and morphosyntactic classification. An advantage of the analyzer is its ability to achieve competitive classification accuracy with minimum training data.
Weng, Wei-Hung; Wagholikar, Kavishwar B; McCray, Alexa T; Szolovits, Peter; Chueh, Henry C
2017-12-01
The medical subdomain of a clinical note, such as cardiology or neurology, is useful content-derived metadata for developing machine learning downstream applications. To classify the medical subdomain of a note accurately, we have constructed a machine learning-based natural language processing (NLP) pipeline and developed medical subdomain classifiers based on the content of the note. We constructed the pipeline using the clinical NLP system, clinical Text Analysis and Knowledge Extraction System (cTAKES), the Unified Medical Language System (UMLS) Metathesaurus, Semantic Network, and learning algorithms to extract features from two datasets - clinical notes from Integrating Data for Analysis, Anonymization, and Sharing (iDASH) data repository (n = 431) and Massachusetts General Hospital (MGH) (n = 91,237), and built medical subdomain classifiers with different combinations of data representation methods and supervised learning algorithms. We evaluated the performance of classifiers and their portability across the two datasets. The convolutional recurrent neural network with neural word embeddings trained-medical subdomain classifier yielded the best performance measurement on iDASH and MGH datasets with area under receiver operating characteristic curve (AUC) of 0.975 and 0.991, and F1 scores of 0.845 and 0.870, respectively. Considering better clinical interpretability, linear support vector machine-trained medical subdomain classifier using hybrid bag-of-words and clinically relevant UMLS concepts as the feature representation, with term frequency-inverse document frequency (tf-idf)-weighting, outperformed other shallow learning classifiers on iDASH and MGH datasets with AUC of 0.957 and 0.964, and F1 scores of 0.932 and 0.934 respectively. We trained classifiers on one dataset, applied to the other dataset and yielded the threshold of F1 score of 0.7 in classifiers for half of the medical subdomains we studied. Our study shows that a supervised learning-based NLP approach is useful to develop medical subdomain classifiers. The deep learning algorithm with distributed word representation yields better performance yet shallow learning algorithms with the word and concept representation achieves comparable performance with better clinical interpretability. Portable classifiers may also be used across datasets from different institutions.
Incorporating conditional random fields and active learning to improve sentiment identification.
Zhang, Kunpeng; Xie, Yusheng; Yang, Yi; Sun, Aaron; Liu, Hengchang; Choudhary, Alok
2014-10-01
Many machine learning, statistical, and computational linguistic methods have been developed to identify sentiment of sentences in documents, yielding promising results. However, most of state-of-the-art methods focus on individual sentences and ignore the impact of context on the meaning of a sentence. In this paper, we propose a method based on conditional random fields to incorporate sentence structure and context information in addition to syntactic information for improving sentiment identification. We also investigate how human interaction affects the accuracy of sentiment labeling using limited training data. We propose and evaluate two different active learning strategies for labeling sentiment data. Our experiments with the proposed approach demonstrate a 5%-15% improvement in accuracy on Amazon customer reviews compared to existing supervised learning and rule-based methods. Copyright © 2014 Elsevier Ltd. All rights reserved.
Active appearance model and deep learning for more accurate prostate segmentation on MRI
NASA Astrophysics Data System (ADS)
Cheng, Ruida; Roth, Holger R.; Lu, Le; Wang, Shijun; Turkbey, Baris; Gandler, William; McCreedy, Evan S.; Agarwal, Harsh K.; Choyke, Peter; Summers, Ronald M.; McAuliffe, Matthew J.
2016-03-01
Prostate segmentation on 3D MR images is a challenging task due to image artifacts, large inter-patient prostate shape and texture variability, and lack of a clear prostate boundary specifically at apex and base levels. We propose a supervised machine learning model that combines atlas based Active Appearance Model (AAM) with a Deep Learning model to segment the prostate on MR images. The performance of the segmentation method is evaluated on 20 unseen MR image datasets. The proposed method combining AAM and Deep Learning achieves a mean Dice Similarity Coefficient (DSC) of 0.925 for whole 3D MR images of the prostate using axial cross-sections. The proposed model utilizes the adaptive atlas-based AAM model and Deep Learning to achieve significant segmentation accuracy.
Genetic Classification of Populations Using Supervised Learning
Bridges, Michael; Heron, Elizabeth A.; O'Dushlaine, Colm; Segurado, Ricardo; Morris, Derek; Corvin, Aiden; Gill, Michael; Pinto, Carlos
2011-01-01
There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case–control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies. PMID:21589856
Time-Frequency Learning Machines for Nonstationarity Detection Using Surrogates
NASA Astrophysics Data System (ADS)
Borgnat, Pierre; Flandrin, Patrick; Richard, Cédric; Ferrari, André; Amoud, Hassan; Honeine, Paul
2012-03-01
Time-frequency representations provide a powerful tool for nonstationary signal analysis and classification, supporting a wide range of applications [12]. As opposed to conventional Fourier analysis, these techniques reveal the evolution in time of the spectral content of signals. In Ref. [7,38], time-frequency analysis is used to test stationarity of any signal. The proposed method consists of a comparison between global and local time-frequency features. The originality is to make use of a family of stationary surrogate signals for defining the null hypothesis of stationarity and, based upon this information, to derive statistical tests. An open question remains, however, about how to choose relevant time-frequency features. Over the last decade, a number of new pattern recognition methods based on reproducing kernels have been introduced. These learning machines have gained popularity due to their conceptual simplicity and their outstanding performance [30]. Initiated by Vapnik’s support vector machines (SVM) [35], they offer now a wide class of supervised and unsupervised learning algorithms. In Ref. [17-19], the authors have shown how the most effective and innovative learning machines can be tuned to operate in the time-frequency domain. This chapter follows this line of research by taking advantage of learning machines to test and quantify stationarity. Based on one-class SVM, our approach uses the entire time-frequency representation and does not require arbitrary feature extraction. Applied to a set of surrogates, it provides the domain boundary that includes most of these stationarized signals. This allows us to test the stationarity of the signal under investigation. This chapter is organized as follows. In Section 22.2, we introduce the surrogate data method to generate stationarized signals, namely, the null hypothesis of stationarity. The concept of time-frequency learning machines is presented in Section 22.3, and applied to one-class SVM in order to derive a stationarity test in Section 22.4. The relevance of the latter is illustrated by simulation results in Section 22.5.
Adaptive maritime video surveillance
NASA Astrophysics Data System (ADS)
Gupta, Kalyan Moy; Aha, David W.; Hartley, Ralph; Moore, Philip G.
2009-05-01
Maritime assets such as ports, harbors, and vessels are vulnerable to a variety of near-shore threats such as small-boat attacks. Currently, such vulnerabilities are addressed predominantly by watchstanders and manual video surveillance, which is manpower intensive. Automatic maritime video surveillance techniques are being introduced to reduce manpower costs, but they have limited functionality and performance. For example, they only detect simple events such as perimeter breaches and cannot predict emerging threats. They also generate too many false alerts and cannot explain their reasoning. To overcome these limitations, we are developing the Maritime Activity Analysis Workbench (MAAW), which will be a mixed-initiative real-time maritime video surveillance tool that uses an integrated supervised machine learning approach to label independent and coordinated maritime activities. It uses the same information to predict anomalous behavior and explain its reasoning; this is an important capability for watchstander training and for collecting performance feedback. In this paper, we describe MAAW's functional architecture, which includes the following pipeline of components: (1) a video acquisition and preprocessing component that detects and tracks vessels in video images, (2) a vessel categorization and activity labeling component that uses standard and relational supervised machine learning methods to label maritime activities, and (3) an ontology-guided vessel and maritime activity annotator to enable subject matter experts (e.g., watchstanders) to provide feedback and supervision to the system. We report our findings from a preliminary system evaluation on river traffic video.
Towards large-scale FAME-based bacterial species identification using machine learning techniques.
Slabbinck, Bram; De Baets, Bernard; Dawyndt, Peter; De Vos, Paul
2009-05-01
In the last decade, bacterial taxonomy witnessed a huge expansion. The swift pace of bacterial species (re-)definitions has a serious impact on the accuracy and completeness of first-line identification methods. Consequently, back-end identification libraries need to be synchronized with the List of Prokaryotic names with Standing in Nomenclature. In this study, we focus on bacterial fatty acid methyl ester (FAME) profiling as a broadly used first-line identification method. From the BAME@LMG database, we have selected FAME profiles of individual strains belonging to the genera Bacillus, Paenibacillus and Pseudomonas. Only those profiles resulting from standard growth conditions have been retained. The corresponding data set covers 74, 44 and 95 validly published bacterial species, respectively, represented by 961, 378 and 1673 standard FAME profiles. Through the application of machine learning techniques in a supervised strategy, different computational models have been built for genus and species identification. Three techniques have been considered: artificial neural networks, random forests and support vector machines. Nearly perfect identification has been achieved at genus level. Notwithstanding the known limited discriminative power of FAME analysis for species identification, the computational models have resulted in good species identification results for the three genera. For Bacillus, Paenibacillus and Pseudomonas, random forests have resulted in sensitivity values, respectively, 0.847, 0.901 and 0.708. The random forests models outperform those of the other machine learning techniques. Moreover, our machine learning approach also outperformed the Sherlock MIS (MIDI Inc., Newark, DE, USA). These results show that machine learning proves very useful for FAME-based bacterial species identification. Besides good bacterial identification at species level, speed and ease of taxonomic synchronization are major advantages of this computational species identification strategy.
Estimating False Positive Contamination in Crater Annotations from Citizen Science Data
NASA Astrophysics Data System (ADS)
Tar, P. D.; Bugiolacchi, R.; Thacker, N. A.; Gilmour, J. D.
2017-01-01
Web-based citizen science often involves the classification of image features by large numbers of minimally trained volunteers, such as the identification of lunar impact craters under the Moon Zoo project. Whilst such approaches facilitate the analysis of large image data sets, the inexperience of users and ambiguity in image content can lead to contamination from false positive identifications. We give an approach, using Linear Poisson Models and image template matching, that can quantify levels of false positive contamination in citizen science Moon Zoo crater annotations. Linear Poisson Models are a form of machine learning which supports predictive error modelling and goodness-of-fits, unlike most alternative machine learning methods. The proposed supervised learning system can reduce the variability in crater counts whilst providing predictive error assessments of estimated quantities of remaining true verses false annotations. In an area of research influenced by human subjectivity, the proposed method provides a level of objectivity through the utilisation of image evidence, guided by candidate crater identifications.
RG-inspired machine learning for lattice field theory
NASA Astrophysics Data System (ADS)
Foreman, Sam; Giedt, Joel; Meurice, Yannick; Unmuth-Yockey, Judah
2018-03-01
Machine learning has been a fast growing field of research in several areas dealing with large datasets. We report recent attempts to use renormalization group (RG) ideas in the context of machine learning. We examine coarse graining procedures for perceptron models designed to identify the digits of the MNIST data. We discuss the correspondence between principal components analysis (PCA) and RG flows across the transition for worm configurations of the 2D Ising model. Preliminary results regarding the logarithmic divergence of the leading PCA eigenvalue were presented at the conference. More generally, we discuss the relationship between PCA and observables in Monte Carlo simulations and the possibility of reducing the number of learning parameters in supervised learning based on RG inspired hierarchical ansatzes.
FRaC: a feature-modeling approach for semi-supervised and unsupervised anomaly detection.
Noto, Keith; Brodley, Carla; Slonim, Donna
2012-01-01
Anomaly detection involves identifying rare data instances (anomalies) that come from a different class or distribution than the majority (which are simply called "normal" instances). Given a training set of only normal data, the semi-supervised anomaly detection task is to identify anomalies in the future. Good solutions to this task have applications in fraud and intrusion detection. The unsupervised anomaly detection task is different: Given unlabeled, mostly-normal data, identify the anomalies among them. Many real-world machine learning tasks, including many fraud and intrusion detection tasks, are unsupervised because it is impractical (or impossible) to verify all of the training data. We recently presented FRaC, a new approach for semi-supervised anomaly detection. FRaC is based on using normal instances to build an ensemble of feature models, and then identifying instances that disagree with those models as anomalous. In this paper, we investigate the behavior of FRaC experimentally and explain why FRaC is so successful. We also show that FRaC is a superior approach for the unsupervised as well as the semi-supervised anomaly detection task, compared to well-known state-of-the-art anomaly detection methods, LOF and one-class support vector machines, and to an existing feature-modeling approach.
FRaC: a feature-modeling approach for semi-supervised and unsupervised anomaly detection
Brodley, Carla; Slonim, Donna
2011-01-01
Anomaly detection involves identifying rare data instances (anomalies) that come from a different class or distribution than the majority (which are simply called “normal” instances). Given a training set of only normal data, the semi-supervised anomaly detection task is to identify anomalies in the future. Good solutions to this task have applications in fraud and intrusion detection. The unsupervised anomaly detection task is different: Given unlabeled, mostly-normal data, identify the anomalies among them. Many real-world machine learning tasks, including many fraud and intrusion detection tasks, are unsupervised because it is impractical (or impossible) to verify all of the training data. We recently presented FRaC, a new approach for semi-supervised anomaly detection. FRaC is based on using normal instances to build an ensemble of feature models, and then identifying instances that disagree with those models as anomalous. In this paper, we investigate the behavior of FRaC experimentally and explain why FRaC is so successful. We also show that FRaC is a superior approach for the unsupervised as well as the semi-supervised anomaly detection task, compared to well-known state-of-the-art anomaly detection methods, LOF and one-class support vector machines, and to an existing feature-modeling approach. PMID:22639542
NASA Astrophysics Data System (ADS)
Harrison, Benjamin; Sandiford, Mike; McLaren, Sandra
2016-04-01
Supervised machine learning algorithms attempt to build a predictive model using empirical data. Their aim is to take a known set of input data along with known responses to the data, and adaptively train a model to generate predictions for new data inputs. A key attraction to their use is the ability to perform as function approximators where the definition of an explicit relationship between variables is infeasible. We present a novel means of estimating thermal conductivity using a supervised self-organising map algorithm, trained on about 150 thermal conductivity measurements, and using a suite of five electric logs common to 14 boreholes. A key motivation of the study was to supplement the small number of direct measurements of thermal conductivity with the decades of borehole data acquired in the Gippsland Basin to produce more confident calculations of surface heat flow. A previous attempt to generate estimates from well-log data in the Gippsland Basin using classic petrophysical log interpretation methods was able to produce reasonable synthetic thermal conductivity logs for only four boreholes. The current study has extended this to a further ten boreholes. Interesting outcomes from the study are: the method appears stable at very low sample sizes (< ~100); the SOM permits quantitative analysis of essentially qualitative uncalibrated well-log data; and the method's moderate success at prediction with minimal effort tuning the algorithm's parameters.
Nandi, Sutanu; Subramanian, Abhishek; Sarkar, Ram Rup
2017-07-25
Prediction of essential genes helps to identify a minimal set of genes that are absolutely required for the appropriate functioning and survival of a cell. The available machine learning techniques for essential gene prediction have inherent problems, like imbalanced provision of training datasets, biased choice of the best model for a given balanced dataset, choice of a complex machine learning algorithm, and data-based automated selection of biologically relevant features for classification. Here, we propose a simple support vector machine-based learning strategy for the prediction of essential genes in Escherichia coli K-12 MG1655 metabolism that integrates a non-conventional combination of an appropriate sample balanced training set, a unique organism-specific genotype, phenotype attributes that characterize essential genes, and optimal parameters of the learning algorithm to generate the best machine learning model (the model with the highest accuracy among all the models trained for different sample training sets). For the first time, we also introduce flux-coupled metabolic subnetwork-based features for enhancing the classification performance. Our strategy proves to be superior as compared to previous SVM-based strategies in obtaining a biologically relevant classification of genes with high sensitivity and specificity. This methodology was also trained with datasets of other recent supervised classification techniques for essential gene classification and tested using reported test datasets. The testing accuracy was always high as compared to the known techniques, proving that our method outperforms known methods. Observations from our study indicate that essential genes are conserved among homologous bacterial species, demonstrate high codon usage bias, GC content and gene expression, and predominantly possess a tendency to form physiological flux modules in metabolism.
Task-driven dictionary learning.
Mairal, Julien; Bach, Francis; Ponce, Jean
2012-04-01
Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience, and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that these models are well suited to restoration tasks. In this context, learning the dictionary amounts to solving a large-scale matrix factorization problem, which can be done efficiently with classical optimization tools. The same approach has also been used for learning features from data for other purposes, e.g., image classification, but tuning the dictionary in a supervised way for these tasks has proven to be more difficult. In this paper, we present a general formulation for supervised dictionary learning adapted to a wide variety of tasks, and present an efficient algorithm for solving the corresponding optimization problem. Experiments on handwritten digit classification, digital art identification, nonlinear inverse image problems, and compressed sensing demonstrate that our approach is effective in large-scale settings, and is well suited to supervised and semi-supervised classification, as well as regression tasks for data that admit sparse representations.
NASA Astrophysics Data System (ADS)
Ruske, Simon; Topping, David O.; Foot, Virginia E.; Kaye, Paul H.; Stanley, Warren R.; Crawford, Ian; Morse, Andrew P.; Gallagher, Martin W.
2017-03-01
Characterisation of bioaerosols has important implications within environment and public health sectors. Recent developments in ultraviolet light-induced fluorescence (UV-LIF) detectors such as the Wideband Integrated Bioaerosol Spectrometer (WIBS) and the newly introduced Multiparameter Bioaerosol Spectrometer (MBS) have allowed for the real-time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal spores and pollen.This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents, bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification.For unsupervised learning we tested hierarchical agglomerative clustering with various different linkages. For supervised learning, 11 methods were tested, including decision trees, ensemble methods (random forests, gradient boosting and AdaBoost), two implementations for support vector machines (libsvm and liblinear) and Gaussian methods (Gaussian naïve Bayesian, quadratic and linear discriminant analysis, the k-nearest neighbours algorithm and artificial neural networks).The methods were applied to two different data sets produced using the new MBS, which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. The first data set contained mixed PSLs and the second contained a variety of laboratory-generated aerosol.Clustering in general performs slightly worse than the supervised learning methods, correctly classifying, at best, only 67. 6 and 91. 1 % for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 82. 8 and 98. 27 % of the testing data, respectively, across the two data sets.A possible alternative to gradient boosting is neural networks. We do however note that this method requires much more user input than the other methods, and we suggest that further research should be conducted using this method, especially using parallelised hardware such as the GPU, which would allow for larger networks to be trained, which could possibly yield better results.We also saw that some methods, such as clustering, failed to utilise the additional shape information provided by the instrument, whilst for others, such as the decision trees, ensemble methods and neural networks, improved performance could be attained with the inclusion of such information.
Kinase Identification with Supervised Laplacian Regularized Least Squares
Zhang, He; Wang, Minghui
2015-01-01
Phosphorylation is catalyzed by protein kinases and is irreplaceable in regulating biological processes. Identification of phosphorylation sites with their corresponding kinases contributes to the understanding of molecular mechanisms. Mass spectrometry analysis of phosphor-proteomes generates a large number of phosphorylated sites. However, experimental methods are costly and time-consuming, and most phosphorylation sites determined by experimental methods lack kinase information. Therefore, computational methods are urgently needed to address the kinase identification problem. To this end, we propose a new kernel-based machine learning method called Supervised Laplacian Regularized Least Squares (SLapRLS), which adopts a new method to construct kernels based on the similarity matrix and minimizes both structure risk and overall inconsistency between labels and similarities. The results predicted using both Phospho.ELM and an additional independent test dataset indicate that SLapRLS can more effectively identify kinases compared to other existing algorithms. PMID:26448296
Kinase Identification with Supervised Laplacian Regularized Least Squares.
Li, Ao; Xu, Xiaoyi; Zhang, He; Wang, Minghui
2015-01-01
Phosphorylation is catalyzed by protein kinases and is irreplaceable in regulating biological processes. Identification of phosphorylation sites with their corresponding kinases contributes to the understanding of molecular mechanisms. Mass spectrometry analysis of phosphor-proteomes generates a large number of phosphorylated sites. However, experimental methods are costly and time-consuming, and most phosphorylation sites determined by experimental methods lack kinase information. Therefore, computational methods are urgently needed to address the kinase identification problem. To this end, we propose a new kernel-based machine learning method called Supervised Laplacian Regularized Least Squares (SLapRLS), which adopts a new method to construct kernels based on the similarity matrix and minimizes both structure risk and overall inconsistency between labels and similarities. The results predicted using both Phospho.ELM and an additional independent test dataset indicate that SLapRLS can more effectively identify kinases compared to other existing algorithms.
Sleep in patients with disorders of consciousness characterized by means of machine learning
Lechinger, Julia; Wislowska, Malgorzata; Blume, Christine; Ott, Peter; Wegenkittl, Stefan; del Giudice, Renata; Heib, Dominik P. J.; Mayer, Helmut A.; Laureys, Steven; Pichler, Gerald; Schabus, Manuel
2018-01-01
Sleep has been proposed to indicate preserved residual brain functioning in patients suffering from disorders of consciousness (DOC) after awakening from coma. However, a reliable characterization of sleep patterns in this clinical population continues to be challenging given severely altered brain oscillations, frequent and extended artifacts in clinical recordings and the absence of established staging criteria. In the present study, we try to address these issues and investigate the usefulness of a multivariate machine learning technique based on permutation entropy, a complexity measure. Specifically, we used long-term polysomnography (PSG), along with video recordings in day and night periods in a sample of 23 DOC; 12 patients were diagnosed as Unresponsive Wakefulness Syndrome (UWS) and 11 were diagnosed as Minimally Conscious State (MCS). Eight hour PSG recordings of healthy sleepers (N = 26) were additionally used for training and setting parameters of supervised and unsupervised model, respectively. In DOC, the supervised classification (wake, N1, N2, N3 or REM) was validated using simultaneous videos which identified periods with prolonged eye opening or eye closure.The supervised classification revealed that out of the 23 subjects, 11 patients (5 MCS and 6 UWS) yielded highly accurate classification with an average F1-score of 0.87 representing high overlap between the classifier predicting sleep (i.e. one of the 4 sleep stages) and closed eyes. Furthermore, the unsupervised approach revealed a more complex pattern of sleep-wake stages during the night period in the MCS group, as evidenced by the presence of several distinct clusters. In contrast, in UWS patients no such clustering was found. Altogether, we present a novel data-driven method, based on machine learning that can be used to gain new and unambiguous insights into sleep organization and residual brain functioning of patients with DOC. PMID:29293607
Analysis of Small Muscle Movement Effects on EEG Signals
2016-12-22
research. Finally, I wish to extend my appreciation to my dear family and girlfriend, in Turkey for their deep trust and devotion. Erhan E YANTERI...and it hasn’t been fully discovered and understood yet. Advances in cognitive neuroscience and brain imaging technologies have started to provide us...Since we have the reference channels and EMG data, this detection problem can be solved by a supervised machine learning method. The level of small
High-order distance-based multiview stochastic learning in image classification.
Yu, Jun; Rui, Yong; Tang, Yuan Yan; Tao, Dacheng
2014-12-01
How do we find all images in a larger set of images which have a specific content? Or estimate the position of a specific object relative to the camera? Image classification methods, like support vector machine (supervised) and transductive support vector machine (semi-supervised), are invaluable tools for the applications of content-based image retrieval, pose estimation, and optical character recognition. However, these methods only can handle the images represented by single feature. In many cases, different features (or multiview data) can be obtained, and how to efficiently utilize them is a challenge. It is inappropriate for the traditionally concatenating schema to link features of different views into a long vector. The reason is each view has its specific statistical property and physical interpretation. In this paper, we propose a high-order distance-based multiview stochastic learning (HD-MSL) method for image classification. HD-MSL effectively combines varied features into a unified representation and integrates the labeling information based on a probabilistic framework. In comparison with the existing strategies, our approach adopts the high-order distance obtained from the hypergraph to replace pairwise distance in estimating the probability matrix of data distribution. In addition, the proposed approach can automatically learn a combination coefficient for each view, which plays an important role in utilizing the complementary information of multiview data. An alternative optimization is designed to solve the objective functions of HD-MSL and obtain different views on coefficients and classification scores simultaneously. Experiments on two real world datasets demonstrate the effectiveness of HD-MSL in image classification.
Fatigue Level Estimation of Bill Based on Acoustic Signal Feature by Supervised SOM
NASA Astrophysics Data System (ADS)
Teranishi, Masaru; Omatu, Sigeru; Kosaka, Toshihisa
Fatigued bills have harmful influence on daily operation of Automated Teller Machine(ATM). To make the fatigued bills classification more efficient, development of an automatic fatigued bill classification method is desired. We propose a new method to estimate bending rigidity of bill from acoustic signal feature of banking machines. The estimated bending rigidities are used as continuous fatigue level for classification of fatigued bill. By using the supervised Self-Organizing Map(supervised SOM), we estimate the bending rigidity from only the acoustic energy pattern effectively. The experimental result with real bill samples shows the effectiveness of the proposed method.
Sharma, Ram C; Hara, Keitarou; Hirayama, Hidetake
2017-01-01
This paper presents the performance and evaluation of a number of machine learning classifiers for the discrimination between the vegetation physiognomic classes using the satellite based time-series of the surface reflectance data. Discrimination of six vegetation physiognomic classes, Evergreen Coniferous Forest, Evergreen Broadleaf Forest, Deciduous Coniferous Forest, Deciduous Broadleaf Forest, Shrubs, and Herbs, was dealt with in the research. Rich-feature data were prepared from time-series of the satellite data for the discrimination and cross-validation of the vegetation physiognomic types using machine learning approach. A set of machine learning experiments comprised of a number of supervised classifiers with different model parameters was conducted to assess how the discrimination of vegetation physiognomic classes varies with classifiers, input features, and ground truth data size. The performance of each experiment was evaluated by using the 10-fold cross-validation method. Experiment using the Random Forests classifier provided highest overall accuracy (0.81) and kappa coefficient (0.78). However, accuracy metrics did not vary much with experiments. Accuracy metrics were found to be very sensitive to input features and size of ground truth data. The results obtained in the research are expected to be useful for improving the vegetation physiognomic mapping in Japan.
Classifying black and white spruce pollen using layered machine learning.
Punyasena, Surangi W; Tcheng, David K; Wesseln, Cassandra; Mueller, Pietra G
2012-11-01
Pollen is among the most ubiquitous of terrestrial fossils, preserving an extended record of vegetation change. However, this temporal continuity comes with a taxonomic tradeoff. Analytical methods that improve the taxonomic precision of pollen identifications would expand the research questions that could be addressed by pollen, in fields such as paleoecology, paleoclimatology, biostratigraphy, melissopalynology, and forensics. We developed a supervised, layered, instance-based machine-learning classification system that uses leave-one-out bias optimization and discriminates among small variations in pollen shape, size, and texture. We tested our system on black and white spruce, two paleoclimatically significant taxa in the North American Quaternary. We achieved > 93% grain-to-grain classification accuracies in a series of experiments with both fossil and reference material. More significantly, when applied to Quaternary samples, the learning system was able to replicate the count proportions of a human expert (R(2) = 0.78, P = 0.007), with one key difference - the machine achieved these ratios by including larger numbers of grains with low-confidence identifications. Our results demonstrate the capability of machine-learning systems to solve the most challenging palynological classification problem, the discrimination of congeneric species, extending the capabilities of the pollen analyst and improving the taxonomic resolution of the palynological record. © 2012 The Authors. New Phytologist © 2012 New Phytologist Trust.
Box, Simon
2014-01-01
Optimal switching of traffic lights on a network of junctions is a computationally intractable problem. In this research, road traffic networks containing signallized junctions are simulated. A computer game interface is used to enable a human ‘player’ to control the traffic light settings on the junctions within the simulation. A supervised learning approach, based on simple neural network classifiers can be used to capture human player's strategies in the game and thus develop a human-trained machine control (HuTMaC) system that approaches human levels of performance. Experiments conducted within the simulation compare the performance of HuTMaC to two well-established traffic-responsive control systems that are widely deployed in the developed world and also to a temporal difference learning-based control method. In all experiments, HuTMaC outperforms the other control methods in terms of average delay and variance over delay. The conclusion is that these results add weight to the suggestion that HuTMaC may be a viable alternative, or supplemental method, to approximate optimization for some practical engineering control problems where the optimal strategy is computationally intractable. PMID:26064570
Box, Simon
2014-12-01
Optimal switching of traffic lights on a network of junctions is a computationally intractable problem. In this research, road traffic networks containing signallized junctions are simulated. A computer game interface is used to enable a human 'player' to control the traffic light settings on the junctions within the simulation. A supervised learning approach, based on simple neural network classifiers can be used to capture human player's strategies in the game and thus develop a human-trained machine control (HuTMaC) system that approaches human levels of performance. Experiments conducted within the simulation compare the performance of HuTMaC to two well-established traffic-responsive control systems that are widely deployed in the developed world and also to a temporal difference learning-based control method. In all experiments, HuTMaC outperforms the other control methods in terms of average delay and variance over delay. The conclusion is that these results add weight to the suggestion that HuTMaC may be a viable alternative, or supplemental method, to approximate optimization for some practical engineering control problems where the optimal strategy is computationally intractable.
Data-driven advice for applying machine learning to bioinformatics problems
Olson, Randal S.; La Cava, William; Mustahsan, Zairah; Varik, Akshay; Moore, Jason H.
2017-01-01
As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems. PMID:29218881
Two-layer contractive encodings for learning stable nonlinear features.
Schulz, Hannes; Cho, Kyunghyun; Raiko, Tapani; Behnke, Sven
2015-04-01
Unsupervised learning of feature hierarchies is often a good strategy to initialize deep architectures for supervised learning. Most existing deep learning methods build these feature hierarchies layer by layer in a greedy fashion using either auto-encoders or restricted Boltzmann machines. Both yield encoders which compute linear projections of input followed by a smooth thresholding function. In this work, we demonstrate that these encoders fail to find stable features when the required computation is in the exclusive-or class. To overcome this limitation, we propose a two-layer encoder which is less restricted in the type of features it can learn. The proposed encoder is regularized by an extension of previous work on contractive regularization. This proposed two-layer contractive encoder potentially poses a more difficult optimization problem, and we further propose to linearly transform hidden neurons of the encoder to make learning easier. We demonstrate the advantages of the two-layer encoders qualitatively on artificially constructed datasets as well as commonly used benchmark datasets. We also conduct experiments on a semi-supervised learning task and show the benefits of the proposed two-layer encoders trained with the linear transformation of perceptrons. Copyright © 2014 Elsevier Ltd. All rights reserved.
A Novel Method for Satellite Maneuver Prediction
NASA Astrophysics Data System (ADS)
Shabarekh, C.; Kent-Bryant, J.; Keselman, G.; Mitidis, A.
2016-09-01
A space operations tradecraft consisting of detect-track-characterize-catalog is insufficient for maintaining Space Situational Awareness (SSA) as space becomes increasingly congested and contested. In this paper, we apply analytical methodology from the Geospatial-Intelligence (GEOINT) community to a key challenge in SSA: predicting where and when a satellite may maneuver in the future. We developed a machine learning approach to probabilistically characterize Patterns of Life (PoL) for geosynchronous (GEO) satellites. PoL are repeatable, predictable behaviors that an object exhibits within a context and is driven by spatio-temporal, relational, environmental and physical constraints. An example of PoL are station-keeping maneuvers in GEO which become generally predictable as the satellite re-positions itself to account for orbital perturbations. In an earlier publication, we demonstrated the ability to probabilistically predict maneuvers of the Galaxy 15 (NORAD ID: 28884) satellite with high confidence eight days in advance of the actual maneuver. Additionally, we were able to detect deviations from expected PoL within hours of the predicted maneuver [6]. This was done with a custom unsupervised machine learning algorithm, the Interval Similarity Model (ISM), which learns repeating intervals of maneuver patterns from unlabeled historical observations and then predicts future maneuvers. In this paper, we introduce a supervised machine learning algorithm that works in conjunction with the ISM to produce a probabilistic distribution of when future maneuvers will occur. The supervised approach uses a Support Vector Machine (SVM) to process the orbit state whereas the ISM processes the temporal intervals between maneuvers and the physics-based characteristics of the maneuvers. This multiple model approach capitalizes on the mathematical strengths of each respective algorithm while incorporating multiple features and inputs. Initial findings indicate that the combined approach can predict 70% of maneuver times within 3 days of a true maneuver time and 22% of maneuver times within 24 hours of a maneuver. We have also been able to detect deviations from expected maneuver patterns up to a week in advance.
Learning locality preserving graph from data.
Zhang, Yan-Ming; Huang, Kaizhu; Hou, Xinwen; Liu, Cheng-Lin
2014-11-01
Machine learning based on graph representation, or manifold learning, has attracted great interest in recent years. As the discrete approximation of data manifold, the graph plays a crucial role in these kinds of learning approaches. In this paper, we propose a novel learning method for graph construction, which is distinct from previous methods in that it solves an optimization problem with the aim of directly preserving the local information of the original data set. We show that the proposed objective has close connections with the popular Laplacian Eigenmap problem, and is hence well justified. The optimization turns out to be a quadratic programming problem with n(n-1)/2 variables (n is the number of data points). Exploiting the sparsity of the graph, we further propose a more efficient cutting plane algorithm to solve the problem, making the method better scalable in practice. In the context of clustering and semi-supervised learning, we demonstrated the advantages of our proposed method by experiments.
Plaza-Leiva, Victoria; Gomez-Ruiz, Jose Antonio; Mandow, Anthony; García-Cerezo, Alfonso
2017-03-15
Improving the effectiveness of spatial shape features classification from 3D lidar data is very relevant because it is largely used as a fundamental step towards higher level scene understanding challenges of autonomous vehicles and terrestrial robots. In this sense, computing neighborhood for points in dense scans becomes a costly process for both training and classification. This paper proposes a new general framework for implementing and comparing different supervised learning classifiers with a simple voxel-based neighborhood computation where points in each non-overlapping voxel in a regular grid are assigned to the same class by considering features within a support region defined by the voxel itself. The contribution provides offline training and online classification procedures as well as five alternative feature vector definitions based on principal component analysis for scatter, tubular and planar shapes. Moreover, the feasibility of this approach is evaluated by implementing a neural network (NN) method previously proposed by the authors as well as three other supervised learning classifiers found in scene processing methods: support vector machines (SVM), Gaussian processes (GP), and Gaussian mixture models (GMM). A comparative performance analysis is presented using real point clouds from both natural and urban environments and two different 3D rangefinders (a tilting Hokuyo UTM-30LX and a Riegl). Classification performance metrics and processing time measurements confirm the benefits of the NN classifier and the feasibility of voxel-based neighborhood.
Generating a Spanish Affective Dictionary with Supervised Learning Techniques
ERIC Educational Resources Information Center
Bermudez-Gonzalez, Daniel; Miranda-Jiménez, Sabino; García-Moreno, Raúl-Ulises; Calderón-Nepamuceno, Dora
2016-01-01
Nowadays, machine learning techniques are being used in several Natural Language Processing (NLP) tasks such as Opinion Mining (OM). OM is used to analyse and determine the affective orientation of texts. Usually, OM approaches use affective dictionaries in order to conduct sentiment analysis. These lexicons are labeled manually with affective…
Jimeno Yepes, Antonio
2017-09-01
Word sense disambiguation helps identifying the proper sense of ambiguous words in text. With large terminologies such as the UMLS Metathesaurus ambiguities appear and highly effective disambiguation methods are required. Supervised learning algorithm methods are used as one of the approaches to perform disambiguation. Features extracted from the context of an ambiguous word are used to identify the proper sense of such a word. The type of features have an impact on machine learning methods, thus affect disambiguation performance. In this work, we have evaluated several types of features derived from the context of the ambiguous word and we have explored as well more global features derived from MEDLINE using word embeddings. Results show that word embeddings improve the performance of more traditional features and allow as well using recurrent neural network classifiers based on Long-Short Term Memory (LSTM) nodes. The combination of unigrams and word embeddings with an SVM sets a new state of the art performance with a macro accuracy of 95.97 in the MSH WSD data set. Copyright © 2017 Elsevier Inc. All rights reserved.
Stanescu, Ana; Caragea, Doina
2015-01-01
Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.
2015-01-01
Background Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. Results Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. Conclusions In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework. PMID:26356316
Inferring physical properties of galaxies from their emission-line spectra
NASA Astrophysics Data System (ADS)
Ucci, G.; Ferrara, A.; Gallerani, S.; Pallottini, A.
2017-02-01
We present a new approach based on Supervised Machine Learning algorithms to infer key physical properties of galaxies (density, metallicity, column density and ionization parameter) from their emission-line spectra. We introduce a numerical code (called GAME, GAlaxy Machine learning for Emission lines) implementing this method and test it extensively. GAME delivers excellent predictive performances, especially for estimates of metallicity and column densities. We compare GAME with the most widely used diagnostics (e.g. R23, [N II] λ6584/Hα indicators) showing that it provides much better accuracy and wider applicability range. GAME is particularly suitable for use in combination with Integral Field Unit spectroscopy, both for rest-frame optical/UV nebular lines and far-infrared/sub-millimeter lines arising from photodissociation regions. Finally, GAME can also be applied to the analysis of synthetic galaxy maps built from numerical simulations.
Nonlinear Deep Kernel Learning for Image Annotation.
Jiu, Mingyuan; Sahbi, Hichem
2017-02-08
Multiple kernel learning (MKL) is a widely used technique for kernel design. Its principle consists in learning, for a given support vector classifier, the most suitable convex (or sparse) linear combination of standard elementary kernels. However, these combinations are shallow and often powerless to capture the actual similarity between highly semantic data, especially for challenging classification tasks such as image annotation. In this paper, we redefine multiple kernels using deep multi-layer networks. In this new contribution, a deep multiple kernel is recursively defined as a multi-layered combination of nonlinear activation functions, each one involves a combination of several elementary or intermediate kernels, and results into a positive semi-definite deep kernel. We propose four different frameworks in order to learn the weights of these networks: supervised, unsupervised, kernel-based semisupervised and Laplacian-based semi-supervised. When plugged into support vector machines (SVMs), the resulting deep kernel networks show clear gain, compared to several shallow kernels for the task of image annotation. Extensive experiments and analysis on the challenging ImageCLEF photo annotation benchmark, the COREL5k database and the Banana dataset validate the effectiveness of the proposed method.
CHISSL: A Human-Machine Collaboration Space for Unsupervised Learning
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arendt, Dustin L.; Komurlu, Caner; Blaha, Leslie M.
We developed CHISSL, a human-machine interface that utilizes supervised machine learning in an unsupervised context to help the user group unlabeled instances by her own mental model. The user primarily interacts via correction (moving a misplaced instance into its correct group) or confirmation (accepting that an instance is placed in its correct group). Concurrent with the user's interactions, CHISSL trains a classification model guided by the user's grouping of the data. It then predicts the group of unlabeled instances and arranges some of these alongside the instances manually organized by the user. We hypothesize that this mode of human andmore » machine collaboration is more effective than Active Learning, wherein the machine decides for itself which instances should be labeled by the user. We found supporting evidence for this hypothesis in a pilot study where we applied CHISSL to organize a collection of handwritten digits.« less
Perceptron ensemble of graph-based positive-unlabeled learning for disease gene identification.
Jowkar, Gholam-Hossein; Mansoori, Eghbal G
2016-10-01
Identification of disease genes, using computational methods, is an important issue in biomedical and bioinformatics research. According to observations that diseases with the same or similar phenotype have the same biological characteristics, researchers have tried to identify genes by using machine learning tools. In recent attempts, some semi-supervised learning methods, called positive-unlabeled learning, is used for disease gene identification. In this paper, we present a Perceptron ensemble of graph-based positive-unlabeled learning (PEGPUL) on three types of biological attributes: gene ontologies, protein domains and protein-protein interaction networks. In our method, a reliable set of positive and negative genes are extracted using co-training schema. Then, the similarity graph of genes is built using metric learning by concentrating on multi-rank-walk method to perform inference from labeled genes. At last, a Perceptron ensemble is learned from three weighted classifiers: multilevel support vector machine, k-nearest neighbor and decision tree. The main contributions of this paper are: (i) incorporating the statistical properties of gene data through choosing proper metrics, (ii) statistical evaluation of biological features, and (iii) noise robustness characteristic of PEGPUL via using multilevel schema. In order to assess PEGPUL, we have applied it on 12950 disease genes with 949 positive genes from six class of diseases and 12001 unlabeled genes. Compared with some popular disease gene identification methods, the experimental results show that PEGPUL has reasonable performance. Copyright © 2016 Elsevier Ltd. All rights reserved.
Knowledge based word-concept model estimation and refinement for biomedical text mining.
Jimeno Yepes, Antonio; Berlanga, Rafael
2015-02-01
Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches. Copyright © 2014 Elsevier Inc. All rights reserved.
Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert
2012-01-01
Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the “wisdom of the crowds.” Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., “funky jazz with saxophone,” “spooky electronica,” etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data. PMID:22460786
Inverse Problems in Geodynamics Using Machine Learning Algorithms
NASA Astrophysics Data System (ADS)
Shahnas, M. H.; Yuen, D. A.; Pysklywec, R. N.
2018-01-01
During the past few decades numerical studies have been widely employed to explore the style of circulation and mixing in the mantle of Earth and other planets. However, in geodynamical studies there are many properties from mineral physics, geochemistry, and petrology in these numerical models. Machine learning, as a computational statistic-related technique and a subfield of artificial intelligence, has rapidly emerged recently in many fields of sciences and engineering. We focus here on the application of supervised machine learning (SML) algorithms in predictions of mantle flow processes. Specifically, we emphasize on estimating mantle properties by employing machine learning techniques in solving an inverse problem. Using snapshots of numerical convection models as training samples, we enable machine learning models to determine the magnitude of the spin transition-induced density anomalies that can cause flow stagnation at midmantle depths. Employing support vector machine algorithms, we show that SML techniques can successfully predict the magnitude of mantle density anomalies and can also be used in characterizing mantle flow patterns. The technique can be extended to more complex geodynamic problems in mantle dynamics by employing deep learning algorithms for putting constraints on properties such as viscosity, elastic parameters, and the nature of thermal and chemical anomalies.
Game-powered machine learning.
Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert
2012-04-24
Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the "wisdom of the crowds." Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., "funky jazz with saxophone," "spooky electronica," etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data.
Label Information Guided Graph Construction for Semi-Supervised Learning.
Zhuang, Liansheng; Zhou, Zihan; Gao, Shenghua; Yin, Jingwen; Lin, Zhouchen; Ma, Yi
2017-09-01
In the literature, most existing graph-based semi-supervised learning methods only use the label information of observed samples in the label propagation stage, while ignoring such valuable information when learning the graph. In this paper, we argue that it is beneficial to consider the label information in the graph learning stage. Specifically, by enforcing the weight of edges between labeled samples of different classes to be zero, we explicitly incorporate the label information into the state-of-the-art graph learning methods, such as the low-rank representation (LRR), and propose a novel semi-supervised graph learning method called semi-supervised low-rank representation. This results in a convex optimization problem with linear constraints, which can be solved by the linearized alternating direction method. Though we take LRR as an example, our proposed method is in fact very general and can be applied to any self-representation graph learning methods. Experiment results on both synthetic and real data sets demonstrate that the proposed graph learning method can better capture the global geometric structure of the data, and therefore is more effective for semi-supervised learning tasks.
Separation of pulsar signals from noise using supervised machine learning algorithms
NASA Astrophysics Data System (ADS)
Bethapudi, S.; Desai, S.
2018-04-01
We evaluate the performance of four different machine learning (ML) algorithms: an Artificial Neural Network Multi-Layer Perceptron (ANN MLP), Adaboost, Gradient Boosting Classifier (GBC), and XGBoost, for the separation of pulsars from radio frequency interference (RFI) and other sources of noise, using a dataset obtained from the post-processing of a pulsar search pipeline. This dataset was previously used for the cross-validation of the SPINN-based machine learning engine, obtained from the reprocessing of the HTRU-S survey data (Morello et al., 2014). We have used the Synthetic Minority Over-sampling Technique (SMOTE) to deal with high-class imbalance in the dataset. We report a variety of quality scores from all four of these algorithms on both the non-SMOTE and SMOTE datasets. For all the above ML methods, we report high accuracy and G-mean for both the non-SMOTE and SMOTE cases. We study the feature importances using Adaboost, GBC, and XGBoost and also from the minimum Redundancy Maximum Relevance approach to report algorithm-agnostic feature ranking. From these methods, we find that the signal to noise of the folded profile to be the best feature. We find that all the ML algorithms report FPRs about an order of magnitude lower than the corresponding FPRs obtained in Morello et al. (2014), for the same recall value.
Sentiment classification technology based on Markov logic networks
NASA Astrophysics Data System (ADS)
He, Hui; Li, Zhigang; Yao, Chongchong; Zhang, Weizhe
2016-07-01
With diverse online media emerging, there is a growing concern of sentiment classification problem. At present, text sentiment classification mainly utilizes supervised machine learning methods, which feature certain domain dependency. On the basis of Markov logic networks (MLNs), this study proposed a cross-domain multi-task text sentiment classification method rooted in transfer learning. Through many-to-one knowledge transfer, labeled text sentiment classification, knowledge was successfully transferred into other domains, and the precision of the sentiment classification analysis in the text tendency domain was improved. The experimental results revealed the following: (1) the model based on a MLN demonstrated higher precision than the single individual learning plan model. (2) Multi-task transfer learning based on Markov logical networks could acquire more knowledge than self-domain learning. The cross-domain text sentiment classification model could significantly improve the precision and efficiency of text sentiment classification.
Improving semi-automated segmentation by integrating learning with active sampling
NASA Astrophysics Data System (ADS)
Huo, Jing; Okada, Kazunori; Brown, Matthew
2012-02-01
Interactive segmentation algorithms such as GrowCut usually require quite a few user interactions to perform well, and have poor repeatability. In this study, we developed a novel technique to boost the performance of the interactive segmentation method GrowCut involving: 1) a novel "focused sampling" approach for supervised learning, as opposed to conventional random sampling; 2) boosting GrowCut using the machine learned results. We applied the proposed technique to the glioblastoma multiforme (GBM) brain tumor segmentation, and evaluated on a dataset of ten cases from a multiple center pharmaceutical drug trial. The results showed that the proposed system has the potential to reduce user interaction while maintaining similar segmentation accuracy.
Safe semi-supervised learning based on weighted likelihood.
Kawakita, Masanori; Takeuchi, Jun'ichi
2014-05-01
We are interested in developing a safe semi-supervised learning that works in any situation. Semi-supervised learning postulates that n(') unlabeled data are available in addition to n labeled data. However, almost all of the previous semi-supervised methods require additional assumptions (not only unlabeled data) to make improvements on supervised learning. If such assumptions are not met, then the methods possibly perform worse than supervised learning. Sokolovska, Cappé, and Yvon (2008) proposed a semi-supervised method based on a weighted likelihood approach. They proved that this method asymptotically never performs worse than supervised learning (i.e., it is safe) without any assumption. Their method is attractive because it is easy to implement and is potentially general. Moreover, it is deeply related to a certain statistical paradox. However, the method of Sokolovska et al. (2008) assumes a very limited situation, i.e., classification, discrete covariates, n(')→∞ and a maximum likelihood estimator. In this paper, we extend their method by modifying the weight. We prove that our proposal is safe in a significantly wide range of situations as long as n≤n('). Further, we give a geometrical interpretation of the proof of safety through the relationship with the above-mentioned statistical paradox. Finally, we show that the above proposal is asymptotically safe even when n(')
Applying Machine Learning to Star Cluster Classification
NASA Astrophysics Data System (ADS)
Fedorenko, Kristina; Grasha, Kathryn; Calzetti, Daniela; Mahadevan, Sridhar
2016-01-01
Catalogs describing populations of star clusters are essential in investigating a range of important issues, from star formation to galaxy evolution. Star cluster catalogs are typically created in a two-step process: in the first step, a catalog of sources is automatically produced; in the second step, each of the extracted sources is visually inspected by 3-to-5 human classifiers and assigned a category. Classification by humans is labor-intensive and time consuming, thus it creates a bottleneck, and substantially slows down progress in star cluster research.We seek to automate the process of labeling star clusters (the second step) through applying supervised machine learning techniques. This will provide a fast, objective, and reproducible classification. Our data is HST (WFC3 and ACS) images of galaxies in the distance range of 3.5-12 Mpc, with a few thousand star clusters already classified by humans as a part of the LEGUS (Legacy ExtraGalactic UV Survey) project. The classification is based on 4 labels (Class 1 - symmetric, compact cluster; Class 2 - concentrated object with some degree of asymmetry; Class 3 - multiple peak system, diffuse; and Class 4 - spurious detection). We start by looking at basic machine learning methods such as decision trees. We then proceed to evaluate performance of more advanced techniques, focusing on convolutional neural networks and other Deep Learning methods. We analyze the results, and suggest several directions for further improvement.
Cole-Lewis, Heather; Varghese, Arun; Sanders, Amy; Schwarz, Mary; Pugatch, Jillian
2015-01-01
Background Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surrounding e-cigarettes, and subsequently guide public health interventions. Objective Our aim was to establish a supervised machine learning algorithm to build predictive classification models that assess Twitter data for a range of factors related to e-cigarettes. Methods Manual content analysis was conducted for 17,098 tweets. These tweets were coded for five categories: e-cigarette relevance, sentiment, user description, genre, and theme. Machine learning classification models were then built for each of these five categories, and word groupings (n-grams) were used to define the feature space for each classifier. Results Predictive performance scores for classification models indicated that the models correctly labeled the tweets with the appropriate variables between 68.40% and 99.34% of the time, and the percentage of maximum possible improvement over a random baseline that was achieved by the classification models ranged from 41.59% to 80.62%. Classifiers with the highest performance scores that also achieved the highest percentage of the maximum possible improvement over a random baseline were Policy/Government (performance: 0.94; % improvement: 80.62%), Relevance (performance: 0.94; % improvement: 75.26%), Ad or Promotion (performance: 0.89; % improvement: 72.69%), and Marketing (performance: 0.91; % improvement: 72.56%). The most appropriate word-grouping unit (n-gram) was 1 for the majority of classifiers. Performance continued to marginally increase with the size of the training dataset of manually annotated data, but eventually leveled off. Even at low dataset sizes of 4000 observations, performance characteristics were fairly sound. Conclusions Social media outlets like Twitter can uncover real-time snapshots of personal sentiment, knowledge, attitudes, and behavior that are not as accessible, at this scale, through any other offline platform. Using the vast data available through social media presents an opportunity for social science and public health methodologies to utilize computational methodologies to enhance and extend research and practice. This study was successful in automating a complex five-category manual content analysis of e-cigarette-related content on Twitter using machine learning techniques. The study details machine learning model specifications that provided the best accuracy for data related to e-cigarettes, as well as a replicable methodology to allow extension of these methods to additional topics. PMID:26307512
Supervised Learning Applied to Air Traffic Trajectory Classification
NASA Technical Reports Server (NTRS)
Bosson, Christabelle S.; Nikoleris, Tasos
2018-01-01
Given the recent increase of interest in introducing new vehicle types and missions into the National Airspace System, a transition towards a more autonomous air traffic control system is required in order to enable and handle increased density and complexity. This paper presents an exploratory effort of the needed autonomous capabilities by exploring supervised learning techniques in the context of aircraft trajectories. In particular, it focuses on the application of machine learning algorithms and neural network models to a runway recognition trajectory-classification study. It investigates the applicability and effectiveness of various classifiers using datasets containing trajectory records for a month of air traffic. A feature importance and sensitivity analysis are conducted to challenge the chosen time-based datasets and the ten selected features. The study demonstrates that classification accuracy levels of 90% and above can be reached in less than 40 seconds of training for most machine learning classifiers when one track data point, described by the ten selected features at a particular time step, per trajectory is used as input. It also shows that neural network models can achieve similar accuracy levels but at higher training time costs.
PlantRNA_Sniffer: A SVM-Based Workflow to Predict Long Intergenic Non-Coding RNAs in Plants.
Vieira, Lucas Maciel; Grativol, Clicia; Thiebaut, Flavia; Carvalho, Thais G; Hardoim, Pablo R; Hemerly, Adriana; Lifschitz, Sergio; Ferreira, Paulo Cavalcanti Gomes; Walter, Maria Emilia M T
2017-03-04
Non-coding RNAs (ncRNAs) constitute an important set of transcripts produced in the cells of organisms. Among them, there is a large amount of a particular class of long ncRNAs that are difficult to predict, the so-called long intergenic ncRNAs (lincRNAs), which might play essential roles in gene regulation and other cellular processes. Despite the importance of these lincRNAs, there is still a lack of biological knowledge and, currently, the few computational methods considered are so specific that they cannot be successfully applied to other species different from those that they have been originally designed to. Prediction of lncRNAs have been performed with machine learning techniques. Particularly, for lincRNA prediction, supervised learning methods have been explored in recent literature. As far as we know, there are no methods nor workflows specially designed to predict lincRNAs in plants. In this context, this work proposes a workflow to predict lincRNAs on plants, considering a workflow that includes known bioinformatics tools together with machine learning techniques, here a support vector machine (SVM). We discuss two case studies that allowed to identify novel lincRNAs, in sugarcane ( Saccharum spp.) and in maize ( Zea mays ). From the results, we also could identify differentially-expressed lincRNAs in sugarcane and maize plants submitted to pathogenic and beneficial microorganisms.
PlantRNA_Sniffer: A SVM-Based Workflow to Predict Long Intergenic Non-Coding RNAs in Plants
Vieira, Lucas Maciel; Grativol, Clicia; Thiebaut, Flavia; Carvalho, Thais G.; Hardoim, Pablo R.; Hemerly, Adriana; Lifschitz, Sergio; Ferreira, Paulo Cavalcanti Gomes; Walter, Maria Emilia M. T.
2017-01-01
Non-coding RNAs (ncRNAs) constitute an important set of transcripts produced in the cells of organisms. Among them, there is a large amount of a particular class of long ncRNAs that are difficult to predict, the so-called long intergenic ncRNAs (lincRNAs), which might play essential roles in gene regulation and other cellular processes. Despite the importance of these lincRNAs, there is still a lack of biological knowledge and, currently, the few computational methods considered are so specific that they cannot be successfully applied to other species different from those that they have been originally designed to. Prediction of lncRNAs have been performed with machine learning techniques. Particularly, for lincRNA prediction, supervised learning methods have been explored in recent literature. As far as we know, there are no methods nor workflows specially designed to predict lincRNAs in plants. In this context, this work proposes a workflow to predict lincRNAs on plants, considering a workflow that includes known bioinformatics tools together with machine learning techniques, here a support vector machine (SVM). We discuss two case studies that allowed to identify novel lincRNAs, in sugarcane (Saccharum spp.) and in maize (Zea mays). From the results, we also could identify differentially-expressed lincRNAs in sugarcane and maize plants submitted to pathogenic and beneficial microorganisms. PMID:29657283
Predicting human protein function with multi-task deep neural networks.
Fa, Rui; Cozzetto, Domenico; Wan, Cen; Jones, David T
2018-01-01
Machine learning methods for protein function prediction are urgently needed, especially now that a substantial fraction of known sequences remains unannotated despite the extensive use of functional assignments based on sequence similarity. One major bottleneck supervised learning faces in protein function prediction is the structured, multi-label nature of the problem, because biological roles are represented by lists of terms from hierarchically organised controlled vocabularies such as the Gene Ontology. In this work, we build on recent developments in the area of deep learning and investigate the usefulness of multi-task deep neural networks (MTDNN), which consist of upstream shared layers upon which are stacked in parallel as many independent modules (additional hidden layers with their own output units) as the number of output GO terms (the tasks). MTDNN learns individual tasks partially using shared representations and partially from task-specific characteristics. When no close homologues with experimentally validated functions can be identified, MTDNN gives more accurate predictions than baseline methods based on annotation frequencies in public databases or homology transfers. More importantly, the results show that MTDNN binary classification accuracy is higher than alternative machine learning-based methods that do not exploit commonalities and differences among prediction tasks. Interestingly, compared with a single-task predictor, the performance improvement is not linearly correlated with the number of tasks in MTDNN, but medium size models provide more improvement in our case. One of advantages of MTDNN is that given a set of features, there is no requirement for MTDNN to have a bootstrap feature selection procedure as what traditional machine learning algorithms do. Overall, the results indicate that the proposed MTDNN algorithm improves the performance of protein function prediction. On the other hand, there is still large room for deep learning techniques to further enhance prediction ability.
Unsupervised Learning —A Novel Clustering Method for Rolling Bearing Faults Identification
NASA Astrophysics Data System (ADS)
Kai, Li; Bo, Luo; Tao, Ma; Xuefeng, Yang; Guangming, Wang
2017-12-01
To promptly process the massive fault data and automatically provide accurate diagnosis results, numerous studies have been conducted on intelligent fault diagnosis of rolling bearing. Among these studies, such as artificial neural networks, support vector machines, decision trees and other supervised learning methods are used commonly. These methods can detect the failure of rolling bearing effectively, but to achieve better detection results, it often requires a lot of training samples. Based on above, a novel clustering method is proposed in this paper. This novel method is able to find the correct number of clusters automatically the effectiveness of the proposed method is validated using datasets from rolling element bearings. The diagnosis results show that the proposed method can accurately detect the fault types of small samples. Meanwhile, the diagnosis results are also relative high accuracy even for massive samples.
Psoriasis image representation using patch-based dictionary learning for erythema severity scoring.
George, Yasmeen; Aldeen, Mohammad; Garnavi, Rahil
2018-06-01
Psoriasis is a chronic skin disease which can be life-threatening. Accurate severity scoring helps dermatologists to decide on the treatment. In this paper, we present a semi-supervised computer-aided system for automatic erythema severity scoring in psoriasis images. Firstly, the unsupervised stage includes a novel image representation method. We construct a dictionary, which is then used in the sparse representation for local feature extraction. To acquire the final image representation vector, an aggregation method is exploited over the local features. Secondly, the supervised phase is where various multi-class machine learning (ML) classifiers are trained for erythema severity scoring. Finally, we compare the proposed system with two popular unsupervised feature extractor methods, namely: bag of visual words model (BoVWs) and AlexNet pretrained model. Root mean square error (RMSE) and F1 score are used as performance measures for the learned dictionaries and the trained ML models, respectively. A psoriasis image set consisting of 676 images, is used in this study. Experimental results demonstrate that the use of the proposed procedure can provide a setup where erythema scoring is accurate and consistent. Also, it is revealed that dictionaries with large number of atoms and small patch sizes yield the best representative erythema severity features. Further, random forest (RF) outperforms other classifiers with F1 score 0.71, followed by support vector machine (SVM) and boosting with 0.66 and 0.64 scores, respectively. Furthermore, the conducted comparative studies confirm the effectiveness of the proposed approach with improvement of 9% and 12% over BoVWs and AlexNet based features, respectively. Crown Copyright © 2018. Published by Elsevier Ltd. All rights reserved.
Brain-Machine Interface control of a robot arm using actor-critic rainforcement learning.
Pohlmeyer, Eric A; Mahmoudi, Babak; Geng, Shijia; Prins, Noeline; Sanchez, Justin C
2012-01-01
Here we demonstrate how a marmoset monkey can use a reinforcement learning (RL) Brain-Machine Interface (BMI) to effectively control the movements of a robot arm for a reaching task. In this work, an actor-critic RL algorithm used neural ensemble activity in the monkey's motor cortext to control the robot movements during a two-target decision task. This novel approach to decoding offers unique advantages for BMI control applications. Compared to supervised learning decoding methods, the actor-critic RL algorithm does not require an explicit set of training data to create a static control model, but rather it incrementally adapts the model parameters according to its current performance, in this case requiring only a very basic feedback signal. We show how this algorithm achieved high performance when mapping the monkey's neural states (94%) to robot actions, and only needed to experience a few trials before obtaining accurate real-time control of the robot arm. Since RL methods responsively adapt and adjust their parameters, they can provide a method to create BMIs that are robust against perturbations caused by changes in either the neural input space or the output actions they generate under different task requirements or goals.
Semi-supervised anomaly detection - towards model-independent searches of new physics
NASA Astrophysics Data System (ADS)
Kuusela, Mikael; Vatanen, Tommi; Malmi, Eric; Raiko, Tapani; Aaltonen, Timo; Nagai, Yoshikazu
2012-06-01
Most classification algorithms used in high energy physics fall under the category of supervised machine learning. Such methods require a training set containing both signal and background events and are prone to classification errors should this training data be systematically inaccurate for example due to the assumed MC model. To complement such model-dependent searches, we propose an algorithm based on semi-supervised anomaly detection techniques, which does not require a MC training sample for the signal data. We first model the background using a multivariate Gaussian mixture model. We then search for deviations from this model by fitting to the observations a mixture of the background model and a number of additional Gaussians. This allows us to perform pattern recognition of any anomalous excess over the background. We show by a comparison to neural network classifiers that such an approach is a lot more robust against misspecification of the signal MC than supervised classification. In cases where there is an unexpected signal, a neural network might fail to correctly identify it, while anomaly detection does not suffer from such a limitation. On the other hand, when there are no systematic errors in the training data, both methods perform comparably.
ERIC Educational Resources Information Center
Instructor, 1983
1983-01-01
Instructor's Computer-Using Teachers Board members give practical tips on how to get a classroom ready for a new computer, introduce students to the machine, and help them learn about programing and computer literacy. Safety, scheduling, and supervision requirements are noted. (PP)
Prediction of bacterial associations with plants using a supervised machine-learning approach.
Martínez-García, Pedro Manuel; López-Solanilla, Emilia; Ramos, Cayo; Rodríguez-Palenzuela, Pablo
2016-12-01
Recent scenarios of fresh produce contamination by human enteric pathogens have resulted in severe food-borne outbreaks, and a new paradigm has emerged stating that some human-associated bacteria can use plants as secondary hosts. As a consequence, there has been growing concern in the scientific community about these interactions that have not yet been elucidated. Since this is a relatively new area, there is a lack of strategies to address the problem of food-borne illnesses due to the ingestion of fruits and vegetables. In the present study, we performed specific genome annotations to train a supervised machine-learning model that allows for the identification of plant-associated bacteria with a precision of ∼93%. The application of our method to approximately 9500 genomes predicted several unknown interactions between well-known human pathogens and plants, and it also confirmed several cases for which evidence has been reported. We observed that factors involved in adhesion, the deconstruction of the plant cell wall and detoxifying activities were highlighted as the most predictive features. The application of our strategy to sequenced strains that are involved in food poisoning can be used as a primary screening tool to determine the possible causes of contaminations. © 2016 Society for Applied Microbiology and John Wiley & Sons Ltd.
A new supervised learning algorithm for spiking neurons.
Xu, Yan; Zeng, Xiaoqin; Zhong, Shuiming
2013-06-01
The purpose of supervised learning with temporal encoding for spiking neurons is to make the neurons emit a specific spike train encoded by the precise firing times of spikes. If only running time is considered, the supervised learning for a spiking neuron is equivalent to distinguishing the times of desired output spikes and the other time during the running process of the neuron through adjusting synaptic weights, which can be regarded as a classification problem. Based on this idea, this letter proposes a new supervised learning method for spiking neurons with temporal encoding; it first transforms the supervised learning into a classification problem and then solves the problem by using the perceptron learning rule. The experiment results show that the proposed method has higher learning accuracy and efficiency over the existing learning methods, so it is more powerful for solving complex and real-time problems.
Generalized SMO algorithm for SVM-based multitask learning.
Cai, Feng; Cherkassky, Vladimir
2012-06-01
Exploiting additional information to improve traditional inductive learning is an active research area in machine learning. In many supervised-learning applications, training data can be naturally separated into several groups, and incorporating this group information into learning may improve generalization. Recently, Vapnik proposed a general approach to formalizing such problems, known as "learning with structured data" and its support vector machine (SVM) based optimization formulation called SVM+. Liang and Cherkassky showed the connection between SVM+ and multitask learning (MTL) approaches in machine learning, and proposed an SVM-based formulation for MTL called SVM+MTL for classification. Training the SVM+MTL classifier requires the solution of a large quadratic programming optimization problem which scales as O(n(3)) with sample size n. So there is a need to develop computationally efficient algorithms for implementing SVM+MTL. This brief generalizes Platt's sequential minimal optimization (SMO) algorithm to the SVM+MTL setting. Empirical results show that, for typical SVM+MTL problems, the proposed generalized SMO achieves over 100 times speed-up, in comparison with general-purpose optimization routines.
Zhao, Xiaowei; Ning, Qiao; Chai, Haiting; Ma, Zhiqiang
2015-06-07
As a widespread type of protein post-translational modifications (PTMs), succinylation plays an important role in regulating protein conformation, function and physicochemical properties. Compared with the labor-intensive and time-consuming experimental approaches, computational predictions of succinylation sites are much desirable due to their convenient and fast speed. Currently, numerous computational models have been developed to identify PTMs sites through various types of two-class machine learning algorithms. These methods require both positive and negative samples for training. However, designation of the negative samples of PTMs was difficult and if it is not properly done can affect the performance of computational models dramatically. So that in this work, we implemented the first application of positive samples only learning (PSoL) algorithm to succinylation sites prediction problem, which was a special class of semi-supervised machine learning that used positive samples and unlabeled samples to train the model. Meanwhile, we proposed a novel succinylation sites computational predictor called SucPred (succinylation site predictor) by using multiple feature encoding schemes. Promising results were obtained by the SucPred predictor with an accuracy of 88.65% using 5-fold cross validation on the training dataset and an accuracy of 84.40% on the independent testing dataset, which demonstrated that the positive samples only learning algorithm presented here was particularly useful for identification of protein succinylation sites. Besides, the positive samples only learning algorithm can be applied to build predictors for other types of PTMs sites with ease. A web server for predicting succinylation sites was developed and was freely accessible at http://59.73.198.144:8088/SucPred/. Copyright © 2015 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Das, Siddhartha; Siopsis, George; Weedbrook, Christian
2018-02-01
With the significant advancement in quantum computation during the past couple of decades, the exploration of machine-learning subroutines using quantum strategies has become increasingly popular. Gaussian process regression is a widely used technique in supervised classical machine learning. Here we introduce an algorithm for Gaussian process regression using continuous-variable quantum systems that can be realized with technology based on photonic quantum computers under certain assumptions regarding distribution of data and availability of efficient quantum access. Our algorithm shows that by using a continuous-variable quantum computer a dramatic speedup in computing Gaussian process regression can be achieved, i.e., the possibility of exponentially reducing the time to compute. Furthermore, our results also include a continuous-variable quantum-assisted singular value decomposition method of nonsparse low rank matrices and forms an important subroutine in our Gaussian process regression algorithm.
A Cluster-then-label Semi-supervised Learning Approach for Pathology Image Classification.
Peikari, Mohammad; Salama, Sherine; Nofech-Mozes, Sharon; Martel, Anne L
2018-05-08
Completely labeled pathology datasets are often challenging and time-consuming to obtain. Semi-supervised learning (SSL) methods are able to learn from fewer labeled data points with the help of a large number of unlabeled data points. In this paper, we investigated the possibility of using clustering analysis to identify the underlying structure of the data space for SSL. A cluster-then-label method was proposed to identify high-density regions in the data space which were then used to help a supervised SVM in finding the decision boundary. We have compared our method with other supervised and semi-supervised state-of-the-art techniques using two different classification tasks applied to breast pathology datasets. We found that compared with other state-of-the-art supervised and semi-supervised methods, our SSL method is able to improve classification performance when a limited number of labeled data instances are made available. We also showed that it is important to examine the underlying distribution of the data space before applying SSL techniques to ensure semi-supervised learning assumptions are not violated by the data.
Gönen, Mehmet
2014-01-01
Coupled training of dimensionality reduction and classification is proposed previously to improve the prediction performance for single-label problems. Following this line of research, in this paper, we first introduce a novel Bayesian method that combines linear dimensionality reduction with linear binary classification for supervised multilabel learning and present a deterministic variational approximation algorithm to learn the proposed probabilistic model. We then extend the proposed method to find intrinsic dimensionality of the projected subspace using automatic relevance determination and to handle semi-supervised learning using a low-density assumption. We perform supervised learning experiments on four benchmark multilabel learning data sets by comparing our method with baseline linear dimensionality reduction algorithms. These experiments show that the proposed approach achieves good performance values in terms of hamming loss, average AUC, macro F1, and micro F1 on held-out test data. The low-dimensional embeddings obtained by our method are also very useful for exploratory data analysis. We also show the effectiveness of our approach in finding intrinsic subspace dimensionality and semi-supervised learning tasks. PMID:24532862
Gönen, Mehmet
2014-03-01
Coupled training of dimensionality reduction and classification is proposed previously to improve the prediction performance for single-label problems. Following this line of research, in this paper, we first introduce a novel Bayesian method that combines linear dimensionality reduction with linear binary classification for supervised multilabel learning and present a deterministic variational approximation algorithm to learn the proposed probabilistic model. We then extend the proposed method to find intrinsic dimensionality of the projected subspace using automatic relevance determination and to handle semi-supervised learning using a low-density assumption. We perform supervised learning experiments on four benchmark multilabel learning data sets by comparing our method with baseline linear dimensionality reduction algorithms. These experiments show that the proposed approach achieves good performance values in terms of hamming loss, average AUC, macro F 1 , and micro F 1 on held-out test data. The low-dimensional embeddings obtained by our method are also very useful for exploratory data analysis. We also show the effectiveness of our approach in finding intrinsic subspace dimensionality and semi-supervised learning tasks.
NASA Astrophysics Data System (ADS)
Reynen, Andrew; Audet, Pascal
2017-09-01
A new method using a machine learning technique is applied to event classification and detection at seismic networks. This method is applicable to a variety of network sizes and settings. The algorithm makes use of a small catalogue of known observations across the entire network. Two attributes, the polarization and frequency content, are used as input to regression. These attributes are extracted at predicted arrival times for P and S waves using only an approximate velocity model, as attributes are calculated over large time spans. This method of waveform characterization is shown to be able to distinguish between blasts and earthquakes with 99 per cent accuracy using a network of 13 stations located in Southern California. The combination of machine learning with generalized waveform features is further applied to event detection in Oklahoma, United States. The event detection algorithm makes use of a pair of unique seismic phases to locate events, with a precision directly related to the sampling rate of the generalized waveform features. Over a week of data from 30 stations in Oklahoma, United States are used to automatically detect 25 times more events than the catalogue of the local geological survey, with a false detection rate of less than 2 per cent. This method provides a highly confident way of detecting and locating events. Furthermore, a large number of seismic events can be automatically detected with low false alarm, allowing for a larger automatic event catalogue with a high degree of trust.
Plaza-Leiva, Victoria; Gomez-Ruiz, Jose Antonio; Mandow, Anthony; García-Cerezo, Alfonso
2017-01-01
Improving the effectiveness of spatial shape features classification from 3D lidar data is very relevant because it is largely used as a fundamental step towards higher level scene understanding challenges of autonomous vehicles and terrestrial robots. In this sense, computing neighborhood for points in dense scans becomes a costly process for both training and classification. This paper proposes a new general framework for implementing and comparing different supervised learning classifiers with a simple voxel-based neighborhood computation where points in each non-overlapping voxel in a regular grid are assigned to the same class by considering features within a support region defined by the voxel itself. The contribution provides offline training and online classification procedures as well as five alternative feature vector definitions based on principal component analysis for scatter, tubular and planar shapes. Moreover, the feasibility of this approach is evaluated by implementing a neural network (NN) method previously proposed by the authors as well as three other supervised learning classifiers found in scene processing methods: support vector machines (SVM), Gaussian processes (GP), and Gaussian mixture models (GMM). A comparative performance analysis is presented using real point clouds from both natural and urban environments and two different 3D rangefinders (a tilting Hokuyo UTM-30LX and a Riegl). Classification performance metrics and processing time measurements confirm the benefits of the NN classifier and the feasibility of voxel-based neighborhood. PMID:28294963
Advanced methods in NDE using machine learning approaches
NASA Astrophysics Data System (ADS)
Wunderlich, Christian; Tschöpe, Constanze; Duckhorn, Frank
2018-04-01
Machine learning (ML) methods and algorithms have been applied recently with great success in quality control and predictive maintenance. Its goal to build new and/or leverage existing algorithms to learn from training data and give accurate predictions, or to find patterns, particularly with new and unseen similar data, fits perfectly to Non-Destructive Evaluation. The advantages of ML in NDE are obvious in such tasks as pattern recognition in acoustic signals or automated processing of images from X-ray, Ultrasonics or optical methods. Fraunhofer IKTS is using machine learning algorithms in acoustic signal analysis. The approach had been applied to such a variety of tasks in quality assessment. The principal approach is based on acoustic signal processing with a primary and secondary analysis step followed by a cognitive system to create model data. Already in the second analysis steps unsupervised learning algorithms as principal component analysis are used to simplify data structures. In the cognitive part of the software further unsupervised and supervised learning algorithms will be trained. Later the sensor signals from unknown samples can be recognized and classified automatically by the algorithms trained before. Recently the IKTS team was able to transfer the software for signal processing and pattern recognition to a small printed circuit board (PCB). Still, algorithms will be trained on an ordinary PC; however, trained algorithms run on the Digital Signal Processor and the FPGA chip. The identical approach will be used for pattern recognition in image analysis of OCT pictures. Some key requirements have to be fulfilled, however. A sufficiently large set of training data, a high signal-to-noise ratio, and an optimized and exact fixation of components are required. The automated testing can be done subsequently by the machine. By integrating the test data of many components along the value chain further optimization including lifetime and durability prediction based on big data becomes possible, even if components are used in different versions or configurations. This is the promise behind German Industry 4.0.
Machine Learning Toolkit for Extreme Scale
DOE Office of Scientific and Technical Information (OSTI.GOV)
2014-03-31
Support Vector Machines (SVM) is a popular machine learning technique, which has been applied to a wide range of domains such as science, finance, and social networks for supervised learning. MaTEx undertakes the challenge of designing a scalable parallel SVM training algorithm for large scale systems, which includes commodity multi-core machines, tightly connected supercomputers and cloud computing systems. Several techniques are proposed for improved speed and memory space usage including adaptive and aggressive elimination of samples for faster convergence , and sparse format representation of data samples. Several heuristics for earliest possible to lazy elimination of non-contributing samples are consideredmore » in MaTEx. In many cases, where an early sample elimination might result in a false positive, low overhead mechanisms for reconstruction of key data structures are proposed. The proposed algorithm and heuristics are implemented and evaluated on various publicly available datasets« less
NASA Astrophysics Data System (ADS)
Arsenault, Louis-François; Neuberg, Richard; Hannah, Lauren A.; Millis, Andrew J.
2017-11-01
We present a supervised machine learning approach to the inversion of Fredholm integrals of the first kind as they arise, for example, in the analytic continuation problem of quantum many-body physics. The approach provides a natural regularization for the ill-conditioned inverse of the Fredholm kernel, as well as an efficient and stable treatment of constraints. The key observation is that the stability of the forward problem permits the construction of a large database of outputs for physically meaningful inputs. Applying machine learning to this database generates a regression function of controlled complexity, which returns approximate solutions for previously unseen inputs; the approximate solutions are then projected onto the subspace of functions satisfying relevant constraints. Under standard error metrics the method performs as well or better than the Maximum Entropy method for low input noise and is substantially more robust to increased input noise. We suggest that the methodology will be similarly effective for other problems involving a formally ill-conditioned inversion of an integral operator, provided that the forward problem can be efficiently solved.
Training Feedforward Neural Networks Using Symbiotic Organisms Search Algorithm.
Wu, Haizhou; Zhou, Yongquan; Luo, Qifang; Basset, Mohamed Abdel
2016-01-01
Symbiotic organisms search (SOS) is a new robust and powerful metaheuristic algorithm, which stimulates the symbiotic interaction strategies adopted by organisms to survive and propagate in the ecosystem. In the supervised learning area, it is a challenging task to present a satisfactory and efficient training algorithm for feedforward neural networks (FNNs). In this paper, SOS is employed as a new method for training FNNs. To investigate the performance of the aforementioned method, eight different datasets selected from the UCI machine learning repository are employed for experiment and the results are compared among seven metaheuristic algorithms. The results show that SOS performs better than other algorithms for training FNNs in terms of converging speed. It is also proven that an FNN trained by the method of SOS has better accuracy than most algorithms compared.
Machine learning and next-generation asteroid surveys
NASA Astrophysics Data System (ADS)
Nugent, Carrie R.; Dailey, John; Cutri, Roc M.; Masci, Frank J.; Mainzer, Amy K.
2017-10-01
Next-generation surveys such as NEOCam (Mainzer et al., 2016) will sift through tens of millions of point source detections daily to detect and discover asteroids. This requires new, more efficient techniques to distinguish between solar system objects, background stars and galaxies, and artifacts such as cosmic rays, scattered light and diffraction spikes.Supervised machine learning is a set of algorithms that allows computers to classify data on a training set, and then apply that classification to make predictions on new datasets. It has been employed by a broad range of fields, including computer vision, medical diagnoses, economics, and natural language processing. It has also been applied to astronomical datasets, including transient identification in the Palomar Transient Factory pipeline (Masci et al., 2016), and in the Pan-STARRS1 difference imaging (D. E. Wright et al., 2015).As part of the NEOCam extended phase A work we apply machine learning techniques to the problem of asteroid detection. Asteroid detection is an ideal application of supervised learning, as there is a wealth of metrics associated with each extracted source, and suitable training sets are easily created. Using the vetted NEOWISE dataset (E. L. Wright et al., 2010, Mainzer et al., 2011) as a proof-of-concept of this technique, we applied the python package sklearn. We report on reliability, feature set selection, and the suitability of various algorithms.
Machine learning enhanced optical distance sensor
NASA Astrophysics Data System (ADS)
Amin, M. Junaid; Riza, N. A.
2018-01-01
Presented for the first time is a machine learning enhanced optical distance sensor. The distance sensor is based on our previously demonstrated distance measurement technique that uses an Electronically Controlled Variable Focus Lens (ECVFL) with a laser source to illuminate a target plane with a controlled optical beam spot. This spot with varying spot sizes is viewed by an off-axis camera and the spot size data is processed to compute the distance. In particular, proposed and demonstrated in this paper is the use of a regularized polynomial regression based supervised machine learning algorithm to enhance the accuracy of the operational sensor. The algorithm uses the acquired features and corresponding labels that are the actual target distance values to train a machine learning model. The optimized training model is trained over a 1000 mm (or 1 m) experimental target distance range. Using the machine learning algorithm produces a training set and testing set distance measurement errors of <0.8 mm and <2.2 mm, respectively. The test measurement error is at least a factor of 4 improvement over our prior sensor demonstration without the use of machine learning. Applications for the proposed sensor include industrial scenario distance sensing where target material specific training models can be generated to realize low <1% measurement error distance measurements.
Cross-platform normalization of microarray and RNA-seq data for machine learning applications
Thompson, Jeffrey A.; Tan, Jie
2016-01-01
Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language. PMID:26844019
Identifying images of handwritten digits using deep learning in H2O
NASA Astrophysics Data System (ADS)
Sadhasivam, Jayakumar; Charanya, R.; Kumar, S. Harish; Srinivasan, A.
2017-11-01
Automatic digit recognition is of popular interest today. Deep learning techniques make it possible for object recognition in image data. Perceiving the digit has turned into a fundamental part as far as certifiable applications. Since, digits are composed in various styles in this way to distinguish the digit it is important to perceive and arrange it with the assistance of machine learning methods. This exploration depends on supervised learning vector quantization neural system arranged under counterfeit artificial neural network. The pictures of digits are perceived, prepared and tried. After the system is made digits are prepared utilizing preparing dataset vectors and testing is connected to the pictures of digits which are separated to each other by fragmenting the picture and resizing the digit picture as needs be for better precision.
Paiva, Joana S; Cardoso, João; Pereira, Tânia
2018-01-01
The main goal of this study was to develop an automatic method based on supervised learning methods, able to distinguish healthy from pathologic arterial pulse wave (APW), and those two from noisy waveforms (non-relevant segments of the signal), from the data acquired during a clinical examination with a novel optical system. The APW dataset analysed was composed by signals acquired in a clinical environment from a total of 213 subjects, including healthy volunteers and non-healthy patients. The signals were parameterised by means of 39pulse features: morphologic, time domain statistics, cross-correlation features, wavelet features. Multiclass Support Vector Machine Recursive Feature Elimination (SVM RFE) method was used to select the most relevant features. A comparative study was performed in order to evaluate the performance of the two classifiers: Support Vector Machine (SVM) and Artificial Neural Network (ANN). SVM achieved a statistically significant better performance for this problem with an average accuracy of 0.9917±0.0024 and a F-Measure of 0.9925±0.0019, in comparison with ANN, which reached the values of 0.9847±0.0032 and 0.9852±0.0031 for Accuracy and F-Measure, respectively. A significant difference was observed between the performances obtained with SVM classifier using a different number of features from the original set available. The comparison between SVM and NN allowed reassert the higher performance of SVM. The results obtained in this study showed the potential of the proposed method to differentiate those three important signal outcomes (healthy, pathologic and noise) and to reduce bias associated with clinical diagnosis of cardiovascular disease using APW. Copyright © 2017 Elsevier B.V. All rights reserved.
Lim, Dong Kyu; Long, Nguyen Phuoc; Mo, Changyeun; Dong, Ziyuan; Cui, Lingmei; Kim, Giyoung; Kwon, Sung Won
2017-10-01
The mixing of extraneous ingredients with original products is a common adulteration practice in food and herbal medicines. In particular, authenticity of white rice and its corresponding blended products has become a key issue in food industry. Accordingly, our current study aimed to develop and evaluate a novel discrimination method by combining targeted lipidomics with powerful supervised learning methods, and eventually introduce a platform to verify the authenticity of white rice. A total of 30 cultivars were collected, and 330 representative samples of white rice from Korea and China as well as seven mixing ratios were examined. Random forests (RF), support vector machines (SVM) with a radial basis function kernel, C5.0, model averaged neural network, and k-nearest neighbor classifiers were used for the classification. We achieved desired results, and the classifiers effectively differentiated white rice from Korea to blended samples with high prediction accuracy for the contamination ratio as low as five percent. In addition, RF and SVM classifiers were generally superior to and more robust than the other techniques. Our approach demonstrated that the relative differences in lysoGPLs can be successfully utilized to detect the adulterated mixing of white rice originating from different countries. In conclusion, the present study introduces a novel and high-throughput platform that can be applied to authenticate adulterated admixtures from original white rice samples. Copyright © 2017 Elsevier Ltd. All rights reserved.
Supervised Time Series Event Detector for Building Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
2016-04-13
A machine learning based approach is developed to detect events that have rarely been seen in the historical data. The data can include building energy consumption, sensor data, environmental data and any data that may affect the building's energy consumption. The algorithm is a modified nonlinear Bayesian support vector machine, which examines daily energy consumption profile, detect the days with abnormal events, and diagnose the cause of the events.
Wei, Ning; You, Jia; Friehs, Karl; Flaschel, Erwin; Nattkemper, Tim Wilhelm
2007-08-15
Fermentation industries would benefit from on-line monitoring of important parameters describing cell growth such as cell density and viability during fermentation processes. For this purpose, an in situ probe has been developed, which utilizes a dark field illumination unit to obtain high contrast images with an integrated CCD camera. To test the probe, brewer's yeast Saccharomyces cerevisiae is chosen as the target microorganism. Images of the yeast cells in the bioreactors are captured, processed, and analyzed automatically by means of mechatronics, image processing, and machine learning. Two support vector machine based classifiers are used for separating cells from background, and for distinguishing live from dead cells afterwards. The evaluation of the in situ experiments showed strong correlation between results obtained by the probe and those by widely accepted standard methods. Thus, the in situ probe has been proved to be a feasible device for on-line monitoring of both cell density and viability with high accuracy and stability. (c) 2007 Wiley Periodicals, Inc.
Virus Particle Detection by Convolutional Neural Network in Transmission Electron Microscopy Images.
Ito, Eisuke; Sato, Takaaki; Sano, Daisuke; Utagawa, Etsuko; Kato, Tsuyoshi
2018-06-01
A new computational method for the detection of virus particles in transmission electron microscopy (TEM) images is presented. Our approach is to use a convolutional neural network that transforms a TEM image to a probabilistic map that indicates where virus particles exist in the image. Our proposed approach automatically and simultaneously learns both discriminative features and classifier for virus particle detection by machine learning, in contrast to existing methods that are based on handcrafted features that yield many false positives and require several postprocessing steps. The detection performance of the proposed method was assessed against a dataset of TEM images containing feline calicivirus particles and compared with several existing detection methods, and the state-of-the-art performance of the developed method for detecting virus was demonstrated. Since our method is based on supervised learning that requires both the input images and their corresponding annotations, it is basically used for detection of already-known viruses. However, the method is highly flexible, and the convolutional networks can adapt themselves to any virus particles by learning automatically from an annotated dataset.
NASA Astrophysics Data System (ADS)
Radziszewski, Kacper
2017-10-01
The following paper presents the results of the research in the field of the machine learning, investigating the scope of application of the artificial neural networks algorithms as a tool in architectural design. The computational experiment was held using the backward propagation of errors method of training the artificial neural network, which was trained based on the geometry of the details of the Roman Corinthian order capital. During the experiment, as an input training data set, five local geometry parameters combined has given the best results: Theta, Pi, Rho in spherical coordinate system based on the capital volume centroid, followed by Z value of the Cartesian coordinate system and a distance from vertical planes created based on the capital symmetry. Additionally during the experiment, artificial neural network hidden layers optimal count and structure was found, giving results of the error below 0.2% for the mentioned before input parameters. Once successfully trained artificial network, was able to mimic the details composition on any other geometry type given. Despite of calculating the transformed geometry locally and separately for each of the thousands of surface points, system could create visually attractive and diverse, complex patterns. Designed tool, based on the supervised learning method of machine learning, gives possibility of generating new architectural forms- free of the designer’s imagination bounds. Implementing the infinitely broad computational methods of machine learning, or Artificial Intelligence in general, not only could accelerate and simplify the design process, but give an opportunity to explore never seen before, unpredictable forms or everyday architectural practice solutions.
Jing, Yankang; Bian, Yuemin; Hu, Ziheng; Wang, Lirong; Xie, Xiang-Qun Sean
2018-03-30
Over the last decade, deep learning (DL) methods have been extremely successful and widely used to develop artificial intelligence (AI) in almost every domain, especially after it achieved its proud record on computational Go. Compared to traditional machine learning (ML) algorithms, DL methods still have a long way to go to achieve recognition in small molecular drug discovery and development. And there is still lots of work to do for the popularization and application of DL for research purpose, e.g., for small molecule drug research and development. In this review, we mainly discussed several most powerful and mainstream architectures, including the convolutional neural network (CNN), recurrent neural network (RNN), and deep auto-encoder networks (DAENs), for supervised learning and nonsupervised learning; summarized most of the representative applications in small molecule drug design; and briefly introduced how DL methods were used in those applications. The discussion for the pros and cons of DL methods as well as the main challenges we need to tackle were also emphasized.
Predictive Models of target organ and Systemic toxicities (BOSC)
The objective of this work is to predict the hazard classification and point of departure (PoD) of untested chemicals in repeat-dose animal testing studies. We used supervised machine learning to objectively evaluate the predictive accuracy of different classification and regress...
Unsupervised learning on scientific ocean drilling datasets from the South China Sea
NASA Astrophysics Data System (ADS)
Tse, Kevin C.; Chiu, Hon-Chim; Tsang, Man-Yin; Li, Yiliang; Lam, Edmund Y.
2018-06-01
Unsupervised learning methods were applied to explore data patterns in multivariate geophysical datasets collected from ocean floor sediment core samples coming from scientific ocean drilling in the South China Sea. Compared to studies on similar datasets, but using supervised learning methods which are designed to make predictions based on sample training data, unsupervised learning methods require no a priori information and focus only on the input data. In this study, popular unsupervised learning methods including K-means, self-organizing maps, hierarchical clustering and random forest were coupled with different distance metrics to form exploratory data clusters. The resulting data clusters were externally validated with lithologic units and geologic time scales assigned to the datasets by conventional methods. Compact and connected data clusters displayed varying degrees of correspondence with existing classification by lithologic units and geologic time scales. K-means and self-organizing maps were observed to perform better with lithologic units while random forest corresponded best with geologic time scales. This study sets a pioneering example of how unsupervised machine learning methods can be used as an automatic processing tool for the increasingly high volume of scientific ocean drilling data.
Semi-Supervised Marginal Fisher Analysis for Hyperspectral Image Classification
NASA Astrophysics Data System (ADS)
Huang, H.; Liu, J.; Pan, Y.
2012-07-01
The problem of learning with both labeled and unlabeled examples arises frequently in Hyperspectral image (HSI) classification. While marginal Fisher analysis is a supervised method, which cannot be directly applied for Semi-supervised classification. In this paper, we proposed a novel method, called semi-supervised marginal Fisher analysis (SSMFA), to process HSI of natural scenes, which uses a combination of semi-supervised learning and manifold learning. In SSMFA, a new difference-based optimization objective function with unlabeled samples has been designed. SSMFA preserves the manifold structure of labeled and unlabeled samples in addition to separating labeled samples in different classes from each other. The semi-supervised method has an analytic form of the globally optimal solution, and it can be computed based on eigen decomposition. Classification experiments with a challenging HSI task demonstrate that this method outperforms current state-of-the-art HSI-classification methods.
Guo, Yufan; Silins, Ilona; Stenius, Ulla; Korhonen, Anna
2013-06-01
Techniques that are capable of automatically analyzing the information structure of scientific articles could be highly useful for improving information access to biomedical literature. However, most existing approaches rely on supervised machine learning (ML) and substantial labeled data that are expensive to develop and apply to different sub-fields of biomedicine. Recent research shows that minimal supervision is sufficient for fairly accurate information structure analysis of biomedical abstracts. However, is it realistic for full articles given their high linguistic and informational complexity? We introduce and release a novel corpus of 50 biomedical articles annotated according to the Argumentative Zoning (AZ) scheme, and investigate active learning with one of the most widely used ML models-Support Vector Machines (SVM)-on this corpus. Additionally, we introduce two novel applications that use AZ to support real-life literature review in biomedicine via question answering and summarization. We show that active learning with SVM trained on 500 labeled sentences (6% of the corpus) performs surprisingly well with the accuracy of 82%, just 2% lower than fully supervised learning. In our question answering task, biomedical researchers find relevant information significantly faster from AZ-annotated than unannotated articles. In the summarization task, sentences extracted from particular zones are significantly more similar to gold standard summaries than those extracted from particular sections of full articles. These results demonstrate that active learning of full articles' information structure is indeed realistic and the accuracy is high enough to support real-life literature review in biomedicine. The annotated corpus, our AZ classifier and the two novel applications are available at http://www.cl.cam.ac.uk/yg244/12bioinfo.html
Benchmarking protein classification algorithms via supervised cross-validation.
Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor
2008-04-24
Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.
Quantum Support Vector Machine for Big Data Classification
NASA Astrophysics Data System (ADS)
Rebentrost, Patrick; Mohseni, Masoud; Lloyd, Seth
2014-09-01
Supervised machine learning is the classification of new data based on already classified training examples. In this work, we show that the support vector machine, an optimized binary classifier, can be implemented on a quantum computer, with complexity logarithmic in the size of the vectors and the number of training examples. In cases where classical sampling algorithms require polynomial time, an exponential speedup is obtained. At the core of this quantum big data algorithm is a nonsparse matrix exponentiation technique for efficiently performing a matrix inversion of the training data inner-product (kernel) matrix.
Training Feedforward Neural Networks Using Symbiotic Organisms Search Algorithm
Wu, Haizhou; Luo, Qifang
2016-01-01
Symbiotic organisms search (SOS) is a new robust and powerful metaheuristic algorithm, which stimulates the symbiotic interaction strategies adopted by organisms to survive and propagate in the ecosystem. In the supervised learning area, it is a challenging task to present a satisfactory and efficient training algorithm for feedforward neural networks (FNNs). In this paper, SOS is employed as a new method for training FNNs. To investigate the performance of the aforementioned method, eight different datasets selected from the UCI machine learning repository are employed for experiment and the results are compared among seven metaheuristic algorithms. The results show that SOS performs better than other algorithms for training FNNs in terms of converging speed. It is also proven that an FNN trained by the method of SOS has better accuracy than most algorithms compared. PMID:28105044
Machine learning algorithms for mode-of-action classification in toxicity assessment.
Zhang, Yile; Wong, Yau Shu; Deng, Jian; Anton, Cristina; Gabos, Stephan; Zhang, Weiping; Huang, Dorothy Yu; Jin, Can
2016-01-01
Real Time Cell Analysis (RTCA) technology is used to monitor cellular changes continuously over the entire exposure period. Combining with different testing concentrations, the profiles have potential in probing the mode of action (MOA) of the testing substances. In this paper, we present machine learning approaches for MOA assessment. Computational tools based on artificial neural network (ANN) and support vector machine (SVM) are developed to analyze the time-concentration response curves (TCRCs) of human cell lines responding to tested chemicals. The techniques are capable of learning data from given TCRCs with known MOA information and then making MOA classification for the unknown toxicity. A novel data processing step based on wavelet transform is introduced to extract important features from the original TCRC data. From the dose response curves, time interval leading to higher classification success rate can be selected as input to enhance the performance of the machine learning algorithm. This is particularly helpful when handling cases with limited and imbalanced data. The validation of the proposed method is demonstrated by the supervised learning algorithm applied to the exposure data of HepG2 cell line to 63 chemicals with 11 concentrations in each test case. Classification success rate in the range of 85 to 95 % are obtained using SVM for MOA classification with two clusters to cases up to four clusters. Wavelet transform is capable of capturing important features of TCRCs for MOA classification. The proposed SVM scheme incorporated with wavelet transform has a great potential for large scale MOA classification and high-through output chemical screening.
Hasan, Mehedi; Kotov, Alexander; Carcone, April; Dong, Ming; Naar, Sylvie; Hartlieb, Kathryn Brogan
2016-08-01
This study examines the effectiveness of state-of-the-art supervised machine learning methods in conjunction with different feature types for the task of automatic annotation of fragments of clinical text based on codebooks with a large number of categories. We used a collection of motivational interview transcripts consisting of 11,353 utterances, which were manually annotated by two human coders as the gold standard, and experimented with state-of-art classifiers, including Naïve Bayes, J48 Decision Tree, Support Vector Machine (SVM), Random Forest (RF), AdaBoost, DiscLDA, Conditional Random Fields (CRF) and Convolutional Neural Network (CNN) in conjunction with lexical, contextual (label of the previous utterance) and semantic (distribution of words in the utterance across the Linguistic Inquiry and Word Count dictionaries) features. We found out that, when the number of classes is large, the performance of CNN and CRF is inferior to SVM. When only lexical features were used, interview transcripts were automatically annotated by SVM with the highest classification accuracy among all classifiers of 70.8%, 61% and 53.7% based on the codebooks consisting of 17, 20 and 41 codes, respectively. Using contextual and semantic features, as well as their combination, in addition to lexical ones, improved the accuracy of SVM for annotation of utterances in motivational interview transcripts with a codebook consisting of 17 classes to 71.5%, 74.2%, and 75.1%, respectively. Our results demonstrate the potential of using machine learning methods in conjunction with lexical, semantic and contextual features for automatic annotation of clinical interview transcripts with near-human accuracy. Copyright © 2016 Elsevier Inc. All rights reserved.
Unsupervised active learning based on hierarchical graph-theoretic clustering.
Hu, Weiming; Hu, Wei; Xie, Nianhua; Maybank, Steve
2009-10-01
Most existing active learning approaches are supervised. Supervised active learning has the following problems: inefficiency in dealing with the semantic gap between the distribution of samples in the feature space and their labels, lack of ability in selecting new samples that belong to new categories that have not yet appeared in the training samples, and lack of adaptability to changes in the semantic interpretation of sample categories. To tackle these problems, we propose an unsupervised active learning framework based on hierarchical graph-theoretic clustering. In the framework, two promising graph-theoretic clustering algorithms, namely, dominant-set clustering and spectral clustering, are combined in a hierarchical fashion. Our framework has some advantages, such as ease of implementation, flexibility in architecture, and adaptability to changes in the labeling. Evaluations on data sets for network intrusion detection, image classification, and video classification have demonstrated that our active learning framework can effectively reduce the workload of manual classification while maintaining a high accuracy of automatic classification. It is shown that, overall, our framework outperforms the support-vector-machine-based supervised active learning, particularly in terms of dealing much more efficiently with new samples whose categories have not yet appeared in the training samples.
Web-conference supervision for advanced psychotherapy training: a practical guide.
Abbass, Allan; Arthey, Stephen; Elliott, Jason; Fedak, Tim; Nowoweiski, Dion; Markovski, Jasmina; Nowoweiski, Sarah
2011-06-01
The advent of readily accessible, inexpensive Web-conferencing applications has opened the door for distance psychotherapy supervision, using video recordings of treated clients. Although relatively new, this method of supervision is advantageous given the ease of use and low cost of various Internet applications. This method allows periodic supervision from point to point around the world, with no travel costs and no long gaps between direct training contacts. Web-conferencing permits face-to-face training so that the learner and supervisor can read each other's emotional responses while reviewing case material. It allows group learning from direct supervision to complement local peer-to-peer learning methods. In this article, we describe the relevant literature on this type of learning method, the practical points in its utilization, its limitations, and its benefits.
The effect of sample size and disease prevalence on supervised machine learning of narrative data.
McKnight, Lawrence K.; Wilcox, Adam; Hripcsak, George
2002-01-01
This paper examines the independent effects of outcome prevalence and training sample sizes on inductive learning performance. We trained 3 inductive learning algorithms (MC4, IB, and Naïve-Bayes) on 60 simulated datasets of parsed radiology text reports labeled with 6 disease states. Data sets were constructed to define positive outcome states at 4 prevalence rates (1, 5, 10, 25, and 50%) in training set sizes of 200 and 2,000 cases. We found that the effect of outcome prevalence is significant when outcome classes drop below 10% of cases. The effect appeared independent of sample size, induction algorithm used, or class label. Work is needed to identify methods of improving classifier performance when output classes are rare. PMID:12463878
Teacher and learner: Supervised and unsupervised learning in communities.
Shafto, Michael G; Seifert, Colleen M
2015-01-01
How far can teaching methods go to enhance learning? Optimal methods of teaching have been considered in research on supervised and unsupervised learning. Locally optimal methods are usually hybrids of teaching and self-directed approaches. The costs and benefits of specific methods have been shown to depend on the structure of the learning task, the learners, the teachers, and the environment.
NASA Astrophysics Data System (ADS)
Hramov, Alexander E.; Frolov, Nikita S.; Musatov, Vyachaslav Yu.
2018-02-01
In present work we studied features of the human brain states classification, corresponding to the real movements of hands and legs. For this purpose we used supervised learning algorithm based on feed-forward artificial neural networks (ANNs) with error back-propagation along with the support vector machine (SVM) method. We compared the quality of operator movements classification by means of EEG signals obtained experimentally in the absence of preliminary processing and after filtration in different ranges up to 25 Hz. It was shown that low-frequency filtering of multichannel EEG data significantly improved accuracy of operator movements classification.
Pizarro, Ricardo A; Cheng, Xi; Barnett, Alan; Lemaitre, Herve; Verchinski, Beth A; Goldman, Aaron L; Xiao, Ena; Luo, Qian; Berman, Karen F; Callicott, Joseph H; Weinberger, Daniel R; Mattay, Venkata S
2016-01-01
High-resolution three-dimensional magnetic resonance imaging (3D-MRI) is being increasingly used to delineate morphological changes underlying neuropsychiatric disorders. Unfortunately, artifacts frequently compromise the utility of 3D-MRI yielding irreproducible results, from both type I and type II errors. It is therefore critical to screen 3D-MRIs for artifacts before use. Currently, quality assessment involves slice-wise visual inspection of 3D-MRI volumes, a procedure that is both subjective and time consuming. Automating the quality rating of 3D-MRI could improve the efficiency and reproducibility of the procedure. The present study is one of the first efforts to apply a support vector machine (SVM) algorithm in the quality assessment of structural brain images, using global and region of interest (ROI) automated image quality features developed in-house. SVM is a supervised machine-learning algorithm that can predict the category of test datasets based on the knowledge acquired from a learning dataset. The performance (accuracy) of the automated SVM approach was assessed, by comparing the SVM-predicted quality labels to investigator-determined quality labels. The accuracy for classifying 1457 3D-MRI volumes from our database using the SVM approach is around 80%. These results are promising and illustrate the possibility of using SVM as an automated quality assessment tool for 3D-MRI.
Cole-Lewis, Heather; Varghese, Arun; Sanders, Amy; Schwarz, Mary; Pugatch, Jillian; Augustson, Erik
2015-08-25
Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public's knowledge, attitudes, and beliefs surrounding e-cigarettes, and subsequently guide public health interventions. Our aim was to establish a supervised machine learning algorithm to build predictive classification models that assess Twitter data for a range of factors related to e-cigarettes. Manual content analysis was conducted for 17,098 tweets. These tweets were coded for five categories: e-cigarette relevance, sentiment, user description, genre, and theme. Machine learning classification models were then built for each of these five categories, and word groupings (n-grams) were used to define the feature space for each classifier. Predictive performance scores for classification models indicated that the models correctly labeled the tweets with the appropriate variables between 68.40% and 99.34% of the time, and the percentage of maximum possible improvement over a random baseline that was achieved by the classification models ranged from 41.59% to 80.62%. Classifiers with the highest performance scores that also achieved the highest percentage of the maximum possible improvement over a random baseline were Policy/Government (performance: 0.94; % improvement: 80.62%), Relevance (performance: 0.94; % improvement: 75.26%), Ad or Promotion (performance: 0.89; % improvement: 72.69%), and Marketing (performance: 0.91; % improvement: 72.56%). The most appropriate word-grouping unit (n-gram) was 1 for the majority of classifiers. Performance continued to marginally increase with the size of the training dataset of manually annotated data, but eventually leveled off. Even at low dataset sizes of 4000 observations, performance characteristics were fairly sound. Social media outlets like Twitter can uncover real-time snapshots of personal sentiment, knowledge, attitudes, and behavior that are not as accessible, at this scale, through any other offline platform. Using the vast data available through social media presents an opportunity for social science and public health methodologies to utilize computational methodologies to enhance and extend research and practice. This study was successful in automating a complex five-category manual content analysis of e-cigarette-related content on Twitter using machine learning techniques. The study details machine learning model specifications that provided the best accuracy for data related to e-cigarettes, as well as a replicable methodology to allow extension of these methods to additional topics.
Machine Learning Techniques in Clinical Vision Sciences.
Caixinha, Miguel; Nunes, Sandrina
2017-01-01
This review presents and discusses the contribution of machine learning techniques for diagnosis and disease monitoring in the context of clinical vision science. Many ocular diseases leading to blindness can be halted or delayed when detected and treated at its earliest stages. With the recent developments in diagnostic devices, imaging and genomics, new sources of data for early disease detection and patients' management are now available. Machine learning techniques emerged in the biomedical sciences as clinical decision-support techniques to improve sensitivity and specificity of disease detection and monitoring, increasing objectively the clinical decision-making process. This manuscript presents a review in multimodal ocular disease diagnosis and monitoring based on machine learning approaches. In the first section, the technical issues related to the different machine learning approaches will be present. Machine learning techniques are used to automatically recognize complex patterns in a given dataset. These techniques allows creating homogeneous groups (unsupervised learning), or creating a classifier predicting group membership of new cases (supervised learning), when a group label is available for each case. To ensure a good performance of the machine learning techniques in a given dataset, all possible sources of bias should be removed or minimized. For that, the representativeness of the input dataset for the true population should be confirmed, the noise should be removed, the missing data should be treated and the data dimensionally (i.e., the number of parameters/features and the number of cases in the dataset) should be adjusted. The application of machine learning techniques in ocular disease diagnosis and monitoring will be presented and discussed in the second section of this manuscript. To show the clinical benefits of machine learning in clinical vision sciences, several examples will be presented in glaucoma, age-related macular degeneration, and diabetic retinopathy, these ocular pathologies being the major causes of irreversible visual impairment.
Supervised learning with decision margins in pools of spiking neurons.
Le Mouel, Charlotte; Harris, Kenneth D; Yger, Pierre
2014-10-01
Learning to categorise sensory inputs by generalising from a few examples whose category is precisely known is a crucial step for the brain to produce appropriate behavioural responses. At the neuronal level, this may be performed by adaptation of synaptic weights under the influence of a training signal, in order to group spiking patterns impinging on the neuron. Here we describe a framework that allows spiking neurons to perform such "supervised learning", using principles similar to the Support Vector Machine, a well-established and robust classifier. Using a hinge-loss error function, we show that requesting a margin similar to that of the SVM improves performance on linearly non-separable problems. Moreover, we show that using pools of neurons to discriminate categories can also increase the performance by sharing the load among neurons.
NASA Astrophysics Data System (ADS)
Karsi, Redouane; Zaim, Mounia; El Alami, Jamila
2017-07-01
Thanks to the development of the internet, a large community now has the possibility to communicate and express its opinions and preferences through multiple media such as blogs, forums, social networks and e-commerce sites. Today, it becomes clearer that opinions published on the web are a very valuable source for decision-making, so a rapidly growing field of research called “sentiment analysis” is born to address the problem of automatically determining the polarity (Positive, negative, neutral,…) of textual opinions. People expressing themselves in a particular domain often use specific domain language expressions, thus, building a classifier, which performs well in different domains is a challenging problem. The purpose of this paper is to evaluate the impact of domain for sentiment classification when using machine learning techniques. In our study three popular machine learning techniques: Support Vector Machines (SVM), Naive Bayes and K nearest neighbors(KNN) were applied on datasets collected from different domains. Experimental results show that Support Vector Machines outperforms other classifiers in all domains, since it achieved at least 74.75% accuracy with a standard deviation of 4,08.
Cross-Domain Semi-Supervised Learning Using Feature Formulation.
Xingquan Zhu
2011-12-01
Semi-Supervised Learning (SSL) traditionally makes use of unlabeled samples by including them into the training set through an automated labeling process. Such a primitive Semi-Supervised Learning (pSSL) approach suffers from a number of disadvantages including false labeling and incapable of utilizing out-of-domain samples. In this paper, we propose a formative Semi-Supervised Learning (fSSL) framework which explores hidden features between labeled and unlabeled samples to achieve semi-supervised learning. fSSL regards that both labeled and unlabeled samples are generated from some hidden concepts with labeling information partially observable for some samples. The key of the fSSL is to recover the hidden concepts, and take them as new features to link labeled and unlabeled samples for semi-supervised learning. Because unlabeled samples are only used to generate new features, but not to be explicitly included in the training set like pSSL does, fSSL overcomes the inherent disadvantages of the traditional pSSL methods, especially for samples not within the same domain as the labeled instances. Experimental results and comparisons demonstrate that fSSL significantly outperforms pSSL-based methods for both within-domain and cross-domain semi-supervised learning.
Ecological interactions and the Netflix problem.
Desjardins-Proulx, Philippe; Laigle, Idaline; Poisot, Timothée; Gravel, Dominique
2017-01-01
Species interactions are a key component of ecosystems but we generally have an incomplete picture of who-eats-who in a given community. Different techniques have been devised to predict species interactions using theoretical models or abundances. Here, we explore the K nearest neighbour approach, with a special emphasis on recommendation, along with a supervised machine learning technique. Recommenders are algorithms developed for companies like Netflix to predict whether a customer will like a product given the preferences of similar customers. These machine learning techniques are well-suited to study binary ecological interactions since they focus on positive-only data. By removing a prey from a predator, we find that recommenders can guess the missing prey around 50% of the times on the first try, with up to 881 possibilities. Traits do not improve significantly the results for the K nearest neighbour, although a simple test with a supervised learning approach (random forests) show we can predict interactions with high accuracy using only three traits per species. This result shows that binary interactions can be predicted without regard to the ecological community given only three variables: body mass and two variables for the species' phylogeny. These techniques are complementary, as recommenders can predict interactions in the absence of traits, using only information about other species' interactions, while supervised learning algorithms such as random forests base their predictions on traits only but do not exploit other species' interactions. Further work should focus on developing custom similarity measures specialized for ecology to improve the KNN algorithms and using richer data to capture indirect relationships between species.
Ecological interactions and the Netflix problem
Laigle, Idaline; Poisot, Timothée; Gravel, Dominique
2017-01-01
Species interactions are a key component of ecosystems but we generally have an incomplete picture of who-eats-who in a given community. Different techniques have been devised to predict species interactions using theoretical models or abundances. Here, we explore the K nearest neighbour approach, with a special emphasis on recommendation, along with a supervised machine learning technique. Recommenders are algorithms developed for companies like Netflix to predict whether a customer will like a product given the preferences of similar customers. These machine learning techniques are well-suited to study binary ecological interactions since they focus on positive-only data. By removing a prey from a predator, we find that recommenders can guess the missing prey around 50% of the times on the first try, with up to 881 possibilities. Traits do not improve significantly the results for the K nearest neighbour, although a simple test with a supervised learning approach (random forests) show we can predict interactions with high accuracy using only three traits per species. This result shows that binary interactions can be predicted without regard to the ecological community given only three variables: body mass and two variables for the species’ phylogeny. These techniques are complementary, as recommenders can predict interactions in the absence of traits, using only information about other species’ interactions, while supervised learning algorithms such as random forests base their predictions on traits only but do not exploit other species’ interactions. Further work should focus on developing custom similarity measures specialized for ecology to improve the KNN algorithms and using richer data to capture indirect relationships between species. PMID:28828250
McPhee, M J; Walmsley, B J; Skinner, B; Littler, B; Siddell, J P; Cafe, L M; Wilkins, J F; Oddy, V H; Alempijevic, A
2017-04-01
The objective of this study was to develop a proof of concept for using off-the-shelf Red Green Blue-Depth (RGB-D) Microsoft Kinect cameras to objectively assess P8 rump fat (P8 fat; mm) and muscle score (MS) traits in Angus cows and steers. Data from low and high muscled cattle (156 cows and 79 steers) were collected at multiple locations and time points. The following steps were required for the 3-dimensional (3D) image data and subsequent machine learning techniques to learn the traits: 1) reduce the high dimensionality of the point cloud data by extracting features from the input signals to produce a compact and representative feature vector, 2) perform global optimization of the signatures using machine learning algorithms and a parallel genetic algorithm, and 3) train a sensor model using regression-supervised learning techniques on the ultrasound P8 fat and the classified learning techniques for the assessed MS for each animal in the data set. The correlation of estimating hip height (cm) between visually measured and assessed 3D data from RGB-D cameras on cows and steers was 0.75 and 0.90, respectively. The supervised machine learning and global optimization approach correctly classified MS (mean [SD]) 80 (4.7) and 83% [6.6%] for cows and steers, respectively. Kappa tests of MS were 0.74 and 0.79 in cows and steers, respectively, indicating substantial agreement between visual assessment and the learning approaches of RGB-D camera images. A stratified 10-fold cross-validation for P8 fat did not find any differences in the mean bias ( = 0.62 and = 0.42 for cows and steers, respectively). The root mean square error of P8 fat was 1.54 and 1.00 mm for cows and steers, respectively. Additional data is required to strengthen the capacity of machine learning to estimate measured P8 fat and assessed MS. Data sets for and continental cattle are also required to broaden the use of 3D cameras to assess cattle. The results demonstrate the importance of capturing curvature as a form of representing body shape. A data-driven model from shape to trait has established a proof of concept using optimized machine learning techniques to assess P8 fat and MS in Angus cows and steers.
García Arroyo, Jose Luis; García Zapirain, Begoña
2014-01-01
By means of this study, a detection algorithm for the "pigment network" in dermoscopic images is presented, one of the most relevant indicators in the diagnosis of melanoma. The design of the algorithm consists of two blocks. In the first one, a machine learning process is carried out, allowing the generation of a set of rules which, when applied over the image, permit the construction of a mask with the pixels candidates to be part of the pigment network. In the second block, an analysis of the structures over this mask is carried out, searching for those corresponding to the pigment network and making the diagnosis, whether it has pigment network or not, and also generating the mask corresponding to this pattern, if any. The method was tested against a database of 220 images, obtaining 86% sensitivity and 81.67% specificity, which proves the reliability of the algorithm. © 2013 The Authors. Published by Elsevier Ltd. All rights reserved.
Learning disordered topological phases by statistical recovery of symmetry
NASA Astrophysics Data System (ADS)
Yoshioka, Nobuyuki; Akagi, Yutaka; Katsura, Hosho
2018-05-01
We apply the artificial neural network in a supervised manner to map out the quantum phase diagram of disordered topological superconductors in class DIII. Given the disorder that keeps the discrete symmetries of the ensemble as a whole, translational symmetry which is broken in the quasiparticle distribution individually is recovered statistically by taking an ensemble average. By using this, we classify the phases by the artificial neural network that learned the quasiparticle distribution in the clean limit and show that the result is totally consistent with the calculation by the transfer matrix method or noncommutative geometry approach. If all three phases, namely the Z2, trivial, and thermal metal phases, appear in the clean limit, the machine can classify them with high confidence over the entire phase diagram. If only the former two phases are present, we find that the machine remains confused in a certain region, leading us to conclude the detection of the unknown phase which is eventually identified as the thermal metal phase.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Möller, A.; Ruhlmann-Kleider, V.; Leloup, C.
In the era of large astronomical surveys, photometric classification of supernovae (SNe) has become an important research field due to limited spectroscopic resources for candidate follow-up and classification. In this work, we present a method to photometrically classify type Ia supernovae based on machine learning with redshifts that are derived from the SN light-curves. This method is implemented on real data from the SNLS deferred pipeline, a purely photometric pipeline that identifies SNe Ia at high-redshifts (0.2 < z < 1.1). Our method consists of two stages: feature extraction (obtaining the SN redshift from photometry and estimating light-curve shape parameters)more » and machine learning classification. We study the performance of different algorithms such as Random Forest and Boosted Decision Trees. We evaluate the performance using SN simulations and real data from the first 3 years of the Supernova Legacy Survey (SNLS), which contains large spectroscopically and photometrically classified type Ia samples. Using the Area Under the Curve (AUC) metric, where perfect classification is given by 1, we find that our best-performing classifier (Extreme Gradient Boosting Decision Tree) has an AUC of 0.98.We show that it is possible to obtain a large photometrically selected type Ia SN sample with an estimated contamination of less than 5%. When applied to data from the first three years of SNLS, we obtain 529 events. We investigate the differences between classifying simulated SNe, and real SN survey data. In particular, we find that applying a thorough set of selection cuts to the SN sample is essential for good classification. This work demonstrates for the first time the feasibility of machine learning classification in a high- z SN survey with application to real SN data.« less
An Efficient Semi-supervised Learning Approach to Predict SH2 Domain Mediated Interactions.
Kundu, Kousik; Backofen, Rolf
2017-01-01
Src homology 2 (SH2) domain is an important subclass of modular protein domains that plays an indispensable role in several biological processes in eukaryotes. SH2 domains specifically bind to the phosphotyrosine residue of their binding peptides to facilitate various molecular functions. For determining the subtle binding specificities of SH2 domains, it is very important to understand the intriguing mechanisms by which these domains recognize their target peptides in a complex cellular environment. There are several attempts have been made to predict SH2-peptide interactions using high-throughput data. However, these high-throughput data are often affected by a low signal to noise ratio. Furthermore, the prediction methods have several additional shortcomings, such as linearity problem, high computational complexity, etc. Thus, computational identification of SH2-peptide interactions using high-throughput data remains challenging. Here, we propose a machine learning approach based on an efficient semi-supervised learning technique for the prediction of 51 SH2 domain mediated interactions in the human proteome. In our study, we have successfully employed several strategies to tackle the major problems in computational identification of SH2-peptide interactions.
Supervised multimedia categorization
NASA Astrophysics Data System (ADS)
Aldershoff, Frank; Salden, Alfons H.; Iacob, Sorin M.; Kempen, Masja
2003-01-01
Static multimedia on the Web can already be hardly structured manually. Although unavoidable and necessary, manual annotation of dynamic multimedia becomes even less feasible when multimedia quickly changes in complexity, i.e. in volume, modality, and usage context. The latter context could be set by learning or other purposes of the multimedia material. This multimedia dynamics calls for categorisation systems that index, query and retrieve multimedia objects on the fly in a similar way as a human expert would. We present and demonstrate such a supervised dynamic multimedia object categorisation system. Our categorisation system comes about by continuously gauging it to a group of human experts who annotate raw multimedia for a certain domain ontology given a usage context. Thus effectively our system learns the categorisation behaviour of human experts. By inducing supervised multi-modal content and context-dependent potentials our categorisation system associates field strengths of raw dynamic multimedia object categorisations with those human experts would assign. After a sufficient long period of supervised machine learning we arrive at automated robust and discriminative multimedia categorisation. We demonstrate the usefulness and effectiveness of our multimedia categorisation system in retrieving semantically meaningful soccer-video fragments, in particular by taking advantage of multimodal and domain specific information and knowledge supplied by human experts.
Machine learning with naturally labeled data for identifying abbreviation definitions.
Yeganova, Lana; Comeau, Donald C; Wilbur, W John
2011-06-09
The rapid growth of biomedical literature requires accurate text analysis and text processing tools. Detecting abbreviations and identifying their definitions is an important component of such tools. Most existing approaches for the abbreviation definition identification task employ rule-based methods. While achieving high precision, rule-based methods are limited to the rules defined and fail to capture many uncommon definition patterns. Supervised learning techniques, which offer more flexibility in detecting abbreviation definitions, have also been applied to the problem. However, they require manually labeled training data. In this work, we develop a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data. We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals.
Spiking neuron network Helmholtz machine.
Sountsov, Pavel; Miller, Paul
2015-01-01
An increasing amount of behavioral and neurophysiological data suggests that the brain performs optimal (or near-optimal) probabilistic inference and learning during perception and other tasks. Although many machine learning algorithms exist that perform inference and learning in an optimal way, the complete description of how one of those algorithms (or a novel algorithm) can be implemented in the brain is currently incomplete. There have been many proposed solutions that address how neurons can perform optimal inference but the question of how synaptic plasticity can implement optimal learning is rarely addressed. This paper aims to unify the two fields of probabilistic inference and synaptic plasticity by using a neuronal network of realistic model spiking neurons to implement a well-studied computational model called the Helmholtz Machine. The Helmholtz Machine is amenable to neural implementation as the algorithm it uses to learn its parameters, called the wake-sleep algorithm, uses a local delta learning rule. Our spiking-neuron network implements both the delta rule and a small example of a Helmholtz machine. This neuronal network can learn an internal model of continuous-valued training data sets without supervision. The network can also perform inference on the learned internal models. We show how various biophysical features of the neural implementation constrain the parameters of the wake-sleep algorithm, such as the duration of the wake and sleep phases of learning and the minimal sample duration. We examine the deviations from optimal performance and tie them to the properties of the synaptic plasticity rule.
Spiking neuron network Helmholtz machine
Sountsov, Pavel; Miller, Paul
2015-01-01
An increasing amount of behavioral and neurophysiological data suggests that the brain performs optimal (or near-optimal) probabilistic inference and learning during perception and other tasks. Although many machine learning algorithms exist that perform inference and learning in an optimal way, the complete description of how one of those algorithms (or a novel algorithm) can be implemented in the brain is currently incomplete. There have been many proposed solutions that address how neurons can perform optimal inference but the question of how synaptic plasticity can implement optimal learning is rarely addressed. This paper aims to unify the two fields of probabilistic inference and synaptic plasticity by using a neuronal network of realistic model spiking neurons to implement a well-studied computational model called the Helmholtz Machine. The Helmholtz Machine is amenable to neural implementation as the algorithm it uses to learn its parameters, called the wake-sleep algorithm, uses a local delta learning rule. Our spiking-neuron network implements both the delta rule and a small example of a Helmholtz machine. This neuronal network can learn an internal model of continuous-valued training data sets without supervision. The network can also perform inference on the learned internal models. We show how various biophysical features of the neural implementation constrain the parameters of the wake-sleep algorithm, such as the duration of the wake and sleep phases of learning and the minimal sample duration. We examine the deviations from optimal performance and tie them to the properties of the synaptic plasticity rule. PMID:25954191
Semi-Supervised Clustering for High-Dimensional and Sparse Features
ERIC Educational Resources Information Center
Yan, Su
2010-01-01
Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side…
ERIC Educational Resources Information Center
Carr, Linda
This student guide is intended to assist persons employed as supervisors in understanding and using communication equipment. Discussed in the first three sections are the following topics: producing and storing information (communicating, storing, and retrieving information and using word processors and talking machines); communicating information…
Imaging and machine learning techniques for diagnosis of Alzheimer's disease.
Mirzaei, Golrokh; Adeli, Anahita; Adeli, Hojjat
2016-12-01
Alzheimer's disease (AD) is a common health problem in elderly people. There has been considerable research toward the diagnosis and early detection of this disease in the past decade. The sensitivity of biomarkers and the accuracy of the detection techniques have been defined to be the key to an accurate diagnosis. This paper presents a state-of-the-art review of the research performed on the diagnosis of AD based on imaging and machine learning techniques. Different segmentation and machine learning techniques used for the diagnosis of AD are reviewed including thresholding, supervised and unsupervised learning, probabilistic techniques, Atlas-based approaches, and fusion of different image modalities. More recent and powerful classification techniques such as the enhanced probabilistic neural network of Ahmadlou and Adeli should be investigated with the goal of improving the diagnosis accuracy. A combination of different image modalities can help improve the diagnosis accuracy rate. Research is needed on the combination of modalities to discover multi-modal biomarkers.
Introduction to Concepts in Artificial Neural Networks
NASA Technical Reports Server (NTRS)
Niebur, Dagmar
1995-01-01
This introduction to artificial neural networks summarizes some basic concepts of computational neuroscience and the resulting models of artificial neurons. The terminology of biological and artificial neurons, biological and machine learning and neural processing is introduced. The concepts of supervised and unsupervised learning are explained with examples from the power system area. Finally, a taxonomy of different types of neurons and different classes of artificial neural networks is presented.
Drory, Ami; Li, Hongdong; Hartley, Richard
2017-04-11
We present a supervised machine learning approach for markerless estimation of human full-body kinematics for a cyclist from an unconstrained colour image. This approach is motivated by the limitations of existing marker-based approaches restricted by infrastructure, environmental conditions, and obtrusive markers. By using a discriminatively learned mixture-of-parts model, we construct a probabilistic tree representation to model the configuration and appearance of human body joints. During the learning stage, a Structured Support Vector Machine (SSVM) learns body parts appearance and spatial relations. In the testing stage, the learned models are employed to recover body pose via searching in a test image over a pyramid structure. We focus on the movement modality of cycling to demonstrate the efficacy of our approach. In natura estimation of cycling kinematics using images is challenging because of human interaction with a bicycle causing frequent occlusions. We make no assumptions in relation to the kinematic constraints of the model, nor the appearance of the scene. Our technique finds multiple quality hypotheses for the pose. We evaluate the precision of our method on two new datasets using loss functions. Our method achieves a score of 91.1 and 69.3 on mean Probability of Correct Keypoint (PCK) measure and 88.7 and 66.1 on the Average Precision of Keypoints (APK) measure for the frontal and sagittal datasets respectively. We conclude that our method opens new vistas to robust user-interaction free estimation of full body kinematics, a prerequisite to motion analysis. Copyright © 2017 Elsevier Ltd. All rights reserved.
Predicting novel microRNA: a comprehensive comparison of machine learning approaches.
Stegmayer, Georgina; Di Persia, Leandro E; Rubiolo, Mariano; Gerard, Matias; Pividori, Milton; Yones, Cristian; Bugnon, Leandro A; Rodriguez, Tadeo; Raad, Jonathan; Milone, Diego H
2018-05-23
The importance of microRNAs (miRNAs) is widely recognized in the community nowadays because these short segments of RNA can play several roles in almost all biological processes. The computational prediction of novel miRNAs involves training a classifier for identifying sequences having the highest chance of being precursors of miRNAs (pre-miRNAs). The big issue with this task is that well-known pre-miRNAs are usually few in comparison with the hundreds of thousands of candidate sequences in a genome, which results in high class imbalance. This imbalance has a strong influence on most standard classifiers, and if not properly addressed in the model and the experiments, not only performance reported can be completely unrealistic but also the classifier will not be able to work properly for pre-miRNA prediction. Besides, another important issue is that for most of the machine learning (ML) approaches already used (supervised methods), it is necessary to have both positive and negative examples. The selection of positive examples is straightforward (well-known pre-miRNAs). However, it is difficult to build a representative set of negative examples because they should be sequences with hairpin structure that do not contain a pre-miRNA. This review provides a comprehensive study and comparative assessment of methods from these two ML approaches for dealing with the prediction of novel pre-miRNAs: supervised and unsupervised training. We present and analyze the ML proposals that have appeared during the past 10 years in literature. They have been compared in several prediction tasks involving two model genomes and increasing imbalance levels. This work provides a review of existing ML approaches for pre-miRNA prediction and fair comparisons of the classifiers with same features and data sets, instead of just a revision of published software tools. The results and the discussion can help the community to select the most adequate bioinformatics approach according to the prediction task at hand. The comparative results obtained suggest that from low to mid-imbalance levels between classes, supervised methods can be the best. However, at very high imbalance levels, closer to real case scenarios, models including unsupervised and deep learning can provide better performance.
Evaluation of Semi-supervised Learning for Classification of Protein Crystallization Imagery.
Sigdel, Madhav; Dinç, İmren; Dinç, Semih; Sigdel, Madhu S; Pusey, Marc L; Aygün, Ramazan S
2014-03-01
In this paper, we investigate the performance of two wrapper methods for semi-supervised learning algorithms for classification of protein crystallization images with limited labeled images. Firstly, we evaluate the performance of semi-supervised approach using self-training with naïve Bayesian (NB) and sequential minimum optimization (SMO) as the base classifiers. The confidence values returned by these classifiers are used to select high confident predictions to be used for self-training. Secondly, we analyze the performance of Yet Another Two Stage Idea (YATSI) semi-supervised learning using NB, SMO, multilayer perceptron (MLP), J48 and random forest (RF) classifiers. These results are compared with the basic supervised learning using the same training sets. We perform our experiments on a dataset consisting of 2250 protein crystallization images for different proportions of training and test data. Our results indicate that NB and SMO using both self-training and YATSI semi-supervised approaches improve accuracies with respect to supervised learning. On the other hand, MLP, J48 and RF perform better using basic supervised learning. Overall, random forest classifier yields the best accuracy with supervised learning for our dataset.
NASA Astrophysics Data System (ADS)
Ratnam, T. C.; Ghosh, D. P.; Negash, B. M.
2018-05-01
Conventional reservoir modeling employs variograms to predict the spatial distribution of petrophysical properties. This study aims to improve property distribution by incorporating elastic wave properties. In this study, elastic wave properties obtained from seismic inversion are used as input for an artificial neural network to predict neutron porosity in between well locations. The method employed in this study is supervised learning based on available well logs. This method converts every seismic trace into a pseudo-well log, hence reducing the uncertainty between well locations. By incorporating the seismic response, the reliance on geostatistical methods such as variograms for the distribution of petrophysical properties is reduced drastically. The results of the artificial neural network show good correlation with the neutron porosity log which gives confidence for spatial prediction in areas where well logs are not available.
Incremental Transductive Learning Approaches to Schistosomiasis Vector Classification
NASA Astrophysics Data System (ADS)
Fusco, Terence; Bi, Yaxin; Wang, Haiying; Browne, Fiona
2016-08-01
The key issues pertaining to collection of epidemic disease data for our analysis purposes are that it is a labour intensive, time consuming and expensive process resulting in availability of sparse sample data which we use to develop prediction models. To address this sparse data issue, we present the novel Incremental Transductive methods to circumvent the data collection process by applying previously acquired data to provide consistent, confidence-based labelling alternatives to field survey research. We investigated various reasoning approaches for semi-supervised machine learning including Bayesian models for labelling data. The results show that using the proposed methods, we can label instances of data with a class of vector density at a high level of confidence. By applying the Liberal and Strict Training Approaches, we provide a labelling and classification alternative to standalone algorithms. The methods in this paper are components in the process of reducing the proliferation of the Schistosomiasis disease and its effects.
Lin, Chin; Hsu, Chia-Jung; Lou, Yu-Sheng; Yeh, Shih-Jen; Lee, Chia-Cheng; Su, Sui-Lung; Chen, Hsiang-Cheng
2017-11-06
Automated disease code classification using free-text medical information is important for public health surveillance. However, traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN). Our objective was to compare the performance of traditional pipelines (NLP plus supervised machine learning models) with that of word embedding combined with a CNN in conducting a classification task identifying International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis codes in discharge notes. We used 2 classification methods: (1) extracting from discharge notes some features (terms, n-gram phrases, and SNOMED CT categories) that we used to train a set of supervised machine learning models (support vector machine, random forests, and gradient boosting machine), and (2) building a feature matrix, by a pretrained word embedding model, that we used to train a CNN. We used these methods to identify the chapter-level ICD-10-CM diagnosis codes in a set of discharge notes. We conducted the evaluation using 103,390 discharge notes covering patients hospitalized from June 1, 2015 to January 31, 2017 in the Tri-Service General Hospital in Taipei, Taiwan. We used the receiver operating characteristic curve as an evaluation measure, and calculated the area under the curve (AUC) and F-measure as the global measure of effectiveness. In 5-fold cross-validation tests, our method had a higher testing accuracy (mean AUC 0.9696; mean F-measure 0.9086) than traditional NLP-based approaches (mean AUC range 0.8183-0.9571; mean F-measure range 0.5050-0.8739). A real-world simulation that split the training sample and the testing sample by date verified this result (mean AUC 0.9645; mean F-measure 0.9003 using the proposed method). Further analysis showed that the convolutional layers of the CNN effectively identified a large number of keywords and automatically extracted enough concepts to predict the diagnosis codes. Word embedding combined with a CNN showed outstanding performance compared with traditional methods, needing very little data preprocessing. This shows that future studies will not be limited by incomplete dictionaries. A large amount of unstructured information from free-text medical writing will be extracted by automated approaches in the future, and we believe that the health care field is about to enter the age of big data. ©Chin Lin, Chia-Jung Hsu, Yu-Sheng Lou, Shih-Jen Yeh, Chia-Cheng Lee, Sui-Lung Su, Hsiang-Cheng Chen. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 06.11.2017.
Machine learning for the New York City power grid.
Rudin, Cynthia; Waltz, David; Anderson, Roger N; Boulanger, Albert; Salleb-Aouissi, Ansaf; Chow, Maggie; Dutta, Haimonti; Gross, Philip N; Huang, Bert; Ierome, Steve; Isaac, Delfina F; Kressner, Arthur; Passonneau, Rebecca J; Radeva, Axinia; Wu, Leon
2012-02-01
Power companies can benefit from the use of knowledge discovery methods and statistical machine learning for preventive maintenance. We introduce a general process for transforming historical electrical grid data into models that aim to predict the risk of failures for components and systems. These models can be used directly by power companies to assist with prioritization of maintenance and repair work. Specialized versions of this process are used to produce 1) feeder failure rankings, 2) cable, joint, terminator, and transformer rankings, 3) feeder Mean Time Between Failure (MTBF) estimates, and 4) manhole events vulnerability rankings. The process in its most general form can handle diverse, noisy, sources that are historical (static), semi-real-time, or realtime, incorporates state-of-the-art machine learning algorithms for prioritization (supervised ranking or MTBF), and includes an evaluation of results via cross-validation and blind test. Above and beyond the ranked lists and MTBF estimates are business management interfaces that allow the prediction capability to be integrated directly into corporate planning and decision support; such interfaces rely on several important properties of our general modeling approach: that machine learning features are meaningful to domain experts, that the processing of data is transparent, and that prediction results are accurate enough to support sound decision making. We discuss the challenges in working with historical electrical grid data that were not designed for predictive purposes. The “rawness” of these data contrasts with the accuracy of the statistical models that can be obtained from the process; these models are sufficiently accurate to assist in maintaining New York City’s electrical grid.
Amaral, Jorge L M; Lopes, Agnaldo J; Jansen, José M; Faria, Alvaro C D; Melo, Pedro L
2013-12-01
The purpose of this study was to develop an automatic classifier to increase the accuracy of the forced oscillation technique (FOT) for diagnosing early respiratory abnormalities in smoking patients. The data consisted of FOT parameters obtained from 56 volunteers, 28 healthy and 28 smokers with low tobacco consumption. Many supervised learning techniques were investigated, including logistic linear classifiers, k nearest neighbor (KNN), neural networks and support vector machines (SVM). To evaluate performance, the ROC curve of the most accurate parameter was established as baseline. To determine the best input features and classifier parameters, we used genetic algorithms and a 10-fold cross-validation using the average area under the ROC curve (AUC). In the first experiment, the original FOT parameters were used as input. We observed a significant improvement in accuracy (KNN=0.89 and SVM=0.87) compared with the baseline (0.77). The second experiment performed a feature selection on the original FOT parameters. This selection did not cause any significant improvement in accuracy, but it was useful in identifying more adequate FOT parameters. In the third experiment, we performed a feature selection on the cross products of the FOT parameters. This selection resulted in a further increase in AUC (KNN=SVM=0.91), which allows for high diagnostic accuracy. In conclusion, machine learning classifiers can help identify early smoking-induced respiratory alterations. The use of FOT cross products and the search for the best features and classifier parameters can markedly improve the performance of machine learning classifiers. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Valizadegan, Hamed; Martin, Rodney; McCauliff, Sean D.; Jenkins, Jon Michael; Catanzarite, Joseph; Oza, Nikunj C.
2015-08-01
Building new catalogues of planetary candidates, astrophysical false alarms, and non-transiting phenomena is a challenging task that currently requires a reviewing team of astrophysicists and astronomers. These scientists need to examine more than 100 diagnostic metrics and associated graphics for each candidate exoplanet-transit-like signal to classify it into one of the three classes. Considering that the NASA Explorer Program's TESS mission and ESA's PLATO mission survey even a larger area of space, the classification of their transit-like signals is more time-consuming for human agents and a bottleneck to successfully construct the new catalogues in a timely manner. This encourages building automatic classification tools that can quickly and reliably classify the new signal data from these missions. The standard tool for building automatic classification systems is the supervised machine learning that requires a large set of highly accurate labeled examples in order to build an effective classifier. This requirement cannot be easily met for classifying transit-like signals because not only are existing labeled signals very limited, but also the current labels may not be reliable (because the labeling process is a subjective task). Our experiments with using different supervised classifiers to categorize transit-like signals verifies that the labeled signals are not rich enough to provide the classifier with enough power to generalize well beyond the observed cases (e.g. to unseen or test signals). That motivated us to utilize a new category of learning techniques, so-called semi-supervised learning, that combines the label information from the costly labeled signals, and distribution information from the cheaply available unlabeled signals in order to construct more effective classifiers. Our study on the Kepler Mission data shows that semi-supervised learning can significantly improve the result of multiple base classifiers (e.g. Support Vector Machines, AdaBoost, and Decision Tree) and is a good technique for automatic classification of exoplanet-transit-like signal.
Dawes, Timothy J W; de Marvao, Antonio; Shi, Wenzhe; Fletcher, Tristan; Watson, Geoffrey M J; Wharton, John; Rhodes, Christopher J; Howard, Luke S G E; Gibbs, J Simon R; Rueckert, Daniel; Cook, Stuart A; Wilkins, Martin R; O'Regan, Declan P
2017-05-01
Purpose To determine if patient survival and mechanisms of right ventricular failure in pulmonary hypertension could be predicted by using supervised machine learning of three-dimensional patterns of systolic cardiac motion. Materials and Methods The study was approved by a research ethics committee, and participants gave written informed consent. Two hundred fifty-six patients (143 women; mean age ± standard deviation, 63 years ± 17) with newly diagnosed pulmonary hypertension underwent cardiac magnetic resonance (MR) imaging, right-sided heart catheterization, and 6-minute walk testing with a median follow-up of 4.0 years. Semiautomated segmentation of short-axis cine images was used to create a three-dimensional model of right ventricular motion. Supervised principal components analysis was used to identify patterns of systolic motion that were most strongly predictive of survival. Survival prediction was assessed by using difference in median survival time and area under the curve with time-dependent receiver operating characteristic analysis for 1-year survival. Results At the end of follow-up, 36% of patients (93 of 256) died, and one underwent lung transplantation. Poor outcome was predicted by a loss of effective contraction in the septum and free wall, coupled with reduced basal longitudinal motion. When added to conventional imaging and hemodynamic, functional, and clinical markers, three-dimensional cardiac motion improved survival prediction (area under the receiver operating characteristic curve, 0.73 vs 0.60, respectively; P < .001) and provided greater differentiation according to difference in median survival time between high- and low-risk groups (13.8 vs 10.7 years, respectively; P < .001). Conclusion A machine-learning survival model that uses three-dimensional cardiac motion predicts outcome independent of conventional risk factors in patients with newly diagnosed pulmonary hypertension. Online supplemental material is available for this article.
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali
2017-01-01
Objectives Widespread implementation of electronic databases has improved the accessibility of plaintext clinical information for supplementary use. Numerous machine learning techniques, such as supervised machine learning approaches or ontology-based approaches, have been employed to obtain useful information from plaintext clinical data. This study proposes an automatic multi-class classification system to predict accident-related causes of death from plaintext autopsy reports through expert-driven feature selection with supervised automatic text classification decision models. Methods Accident-related autopsy reports were obtained from one of the largest hospital in Kuala Lumpur. These reports belong to nine different accident-related causes of death. Master feature vector was prepared by extracting features from the collected autopsy reports by using unigram with lexical categorization. This master feature vector was used to detect cause of death [according to internal classification of disease version 10 (ICD-10) classification system] through five automated feature selection schemes, proposed expert-driven approach, five subset sizes of features, and five machine learning classifiers. Model performance was evaluated using precisionM, recallM, F-measureM, accuracy, and area under ROC curve. Four baselines were used to compare the results with the proposed system. Results Random forest and J48 decision models parameterized using expert-driven feature selection yielded the highest evaluation measure approaching (85% to 90%) for most metrics by using a feature subset size of 30. The proposed system also showed approximately 14% to 16% improvement in the overall accuracy compared with the existing techniques and four baselines. Conclusion The proposed system is feasible and practical to use for automatic classification of ICD-10-related cause of death from autopsy reports. The proposed system assists pathologists to accurately and rapidly determine underlying cause of death based on autopsy findings. Furthermore, the proposed expert-driven feature selection approach and the findings are generally applicable to other kinds of plaintext clinical reports. PMID:28166263
Slip, David J.; Hocking, David P.; Harcourt, Robert G.
2016-01-01
Constructing activity budgets for marine animals when they are at sea and cannot be directly observed is challenging, but recent advances in bio-logging technology offer solutions to this problem. Accelerometers can potentially identify a wide range of behaviours for animals based on unique patterns of acceleration. However, when analysing data derived from accelerometers, there are many statistical techniques available which when applied to different data sets produce different classification accuracies. We investigated a selection of supervised machine learning methods for interpreting behavioural data from captive otariids (fur seals and sea lions). We conducted controlled experiments with 12 seals, where their behaviours were filmed while they were wearing 3-axis accelerometers. From video we identified 26 behaviours that could be grouped into one of four categories (foraging, resting, travelling and grooming) representing key behaviour states for wild seals. We used data from 10 seals to train four predictive classification models: stochastic gradient boosting (GBM), random forests, support vector machine using four different kernels and a baseline model: penalised logistic regression. We then took the best parameters from each model and cross-validated the results on the two seals unseen so far. We also investigated the influence of feature statistics (describing some characteristic of the seal), testing the models both with and without these. Cross-validation accuracies were lower than training accuracy, but the SVM with a polynomial kernel was still able to classify seal behaviour with high accuracy (>70%). Adding feature statistics improved accuracies across all models tested. Most categories of behaviour -resting, grooming and feeding—were all predicted with reasonable accuracy (52–81%) by the SVM while travelling was poorly categorised (31–41%). These results show that model selection is important when classifying behaviour and that by using animal characteristics we can strengthen the overall accuracy. PMID:28002450
NASA Astrophysics Data System (ADS)
Levchuk, Georgiy; Colonna-Romano, John; Eslami, Mohammed
2017-05-01
The United States increasingly relies on cyber-physical systems to conduct military and commercial operations. Attacks on these systems have increased dramatically around the globe. The attackers constantly change their methods, making state-of-the-art commercial and military intrusion detection systems ineffective. In this paper, we present a model to identify functional behavior of network devices from netflow traces. Our model includes two innovations. First, we define novel features for a host IP using detection of application graph patterns in IP's host graph constructed from 5-min aggregated packet flows. Second, we present the first application, to the best of our knowledge, of Graph Semi-Supervised Learning (GSSL) to the space of IP behavior classification. Using a cyber-attack dataset collected from NetFlow packet traces, we show that GSSL trained with only 20% of the data achieves higher attack detection rates than Support Vector Machines (SVM) and Naïve Bayes (NB) classifiers trained with 80% of data points. We also show how to improve detection quality by filtering out web browsing data, and conclude with discussion of future research directions.
Supervised detection of exoplanets in high-contrast imaging sequences
NASA Astrophysics Data System (ADS)
Gomez Gonzalez, C. A.; Absil, O.; Van Droogenbroeck, M.
2018-06-01
Context. Post-processing algorithms play a key role in pushing the detection limits of high-contrast imaging (HCI) instruments. State-of-the-art image processing approaches for HCI enable the production of science-ready images relying on unsupervised learning techniques, such as low-rank approximations, for generating a model point spread function (PSF) and subtracting the residual starlight and speckle noise. Aims: In order to maximize the detection rate of HCI instruments and survey campaigns, advanced algorithms with higher sensitivities to faint companions are needed, especially for the speckle-dominated innermost region of the images. Methods: We propose a reformulation of the exoplanet detection task (for ADI sequences) that builds on well-established machine learning techniques to take HCI post-processing from an unsupervised to a supervised learning context. In this new framework, we present algorithmic solutions using two different discriminative models: SODIRF (random forests) and SODINN (neural networks). We test these algorithms on real ADI datasets from VLT/NACO and VLT/SPHERE HCI instruments. We then assess their performances by injecting fake companions and using receiver operating characteristic analysis. This is done in comparison with state-of-the-art ADI algorithms, such as ADI principal component analysis (ADI-PCA). Results: This study shows the improved sensitivity versus specificity trade-off of the proposed supervised detection approach. At the diffraction limit, SODINN improves the true positive rate by a factor ranging from 2 to 10 (depending on the dataset and angular separation) with respect to ADI-PCA when working at the same false-positive level. Conclusions: The proposed supervised detection framework outperforms state-of-the-art techniques in the task of discriminating planet signal from speckles. In addition, it offers the possibility of re-processing existing HCI databases to maximize their scientific return and potentially improve the demographics of directly imaged exoplanets.
Multisubject Learning for Common Spatial Patterns in Motor-Imagery BCI
Devlaminck, Dieter; Wyns, Bart; Grosse-Wentrup, Moritz; Otte, Georges; Santens, Patrick
2011-01-01
Motor-imagery-based brain-computer interfaces (BCIs) commonly use the common spatial pattern filter (CSP) as preprocessing step before feature extraction and classification. The CSP method is a supervised algorithm and therefore needs subject-specific training data for calibration, which is very time consuming to collect. In order to reduce the amount of calibration data that is needed for a new subject, one can apply multitask (from now on called multisubject) machine learning techniques to the preprocessing phase. Here, the goal of multisubject learning is to learn a spatial filter for a new subject based on its own data and that of other subjects. This paper outlines the details of the multitask CSP algorithm and shows results on two data sets. In certain subjects a clear improvement can be seen, especially when the number of training trials is relatively low. PMID:22007194
Dong, Yadong; Sun, Yongqi; Qin, Chao
2018-01-01
The existing protein complex detection methods can be broadly divided into two categories: unsupervised and supervised learning methods. Most of the unsupervised learning methods assume that protein complexes are in dense regions of protein-protein interaction (PPI) networks even though many true complexes are not dense subgraphs. Supervised learning methods utilize the informative properties of known complexes; they often extract features from existing complexes and then use the features to train a classification model. The trained model is used to guide the search process for new complexes. However, insufficient extracted features, noise in the PPI data and the incompleteness of complex data make the classification model imprecise. Consequently, the classification model is not sufficient for guiding the detection of complexes. Therefore, we propose a new robust score function that combines the classification model with local structural information. Based on the score function, we provide a search method that works both forwards and backwards. The results from experiments on six benchmark PPI datasets and three protein complex datasets show that our approach can achieve better performance compared with the state-of-the-art supervised, semi-supervised and unsupervised methods for protein complex detection, occasionally significantly outperforming such methods.
Automatic Earthquake Detection by Active Learning
NASA Astrophysics Data System (ADS)
Bergen, K.; Beroza, G. C.
2017-12-01
In recent years, advances in machine learning have transformed fields such as image recognition, natural language processing and recommender systems. Many of these performance gains have relied on the availability of large, labeled data sets to train high-accuracy models; labeled data sets are those for which each sample includes a target class label, such as waveforms tagged as either earthquakes or noise. Earthquake seismologists are increasingly leveraging machine learning and data mining techniques to detect and analyze weak earthquake signals in large seismic data sets. One of the challenges in applying machine learning to seismic data sets is the limited labeled data problem; learning algorithms need to be given examples of earthquake waveforms, but the number of known events, taken from earthquake catalogs, may be insufficient to build an accurate detector. Furthermore, earthquake catalogs are known to be incomplete, resulting in training data that may be biased towards larger events and contain inaccurate labels. This challenge is compounded by the class imbalance problem; the events of interest, earthquakes, are infrequent relative to noise in continuous data sets, and many learning algorithms perform poorly on rare classes. In this work, we investigate the use of active learning for automatic earthquake detection. Active learning is a type of semi-supervised machine learning that uses a human-in-the-loop approach to strategically supplement a small initial training set. The learning algorithm incorporates domain expertise through interaction between a human expert and the algorithm, with the algorithm actively posing queries to the user to improve detection performance. We demonstrate the potential of active machine learning to improve earthquake detection performance with limited available training data.
Deep learning for brain tumor classification
NASA Astrophysics Data System (ADS)
Paul, Justin S.; Plassard, Andrew J.; Landman, Bennett A.; Fabbri, Daniel
2017-03-01
Recent research has shown that deep learning methods have performed well on supervised machine learning, image classification tasks. The purpose of this study is to apply deep learning methods to classify brain images with different tumor types: meningioma, glioma, and pituitary. A dataset was publicly released containing 3,064 T1-weighted contrast enhanced MRI (CE-MRI) brain images from 233 patients with either meningioma, glioma, or pituitary tumors split across axial, coronal, or sagittal planes. This research focuses on the 989 axial images from 191 patients in order to avoid confusing the neural networks with three different planes containing the same diagnosis. Two types of neural networks were used in classification: fully connected and convolutional neural networks. Within these two categories, further tests were computed via the augmentation of the original 512×512 axial images. Training neural networks over the axial data has proven to be accurate in its classifications with an average five-fold cross validation of 91.43% on the best trained neural network. This result demonstrates that a more general method (i.e. deep learning) can outperform specialized methods that require image dilation and ring-forming subregions on tumors.
Improving orbit prediction accuracy through supervised machine learning
NASA Astrophysics Data System (ADS)
Peng, Hao; Bai, Xiaoli
2018-05-01
Due to the lack of information such as the space environment condition and resident space objects' (RSOs') body characteristics, current orbit predictions that are solely grounded on physics-based models may fail to achieve required accuracy for collision avoidance and have led to satellite collisions already. This paper presents a methodology to predict RSOs' trajectories with higher accuracy than that of the current methods. Inspired by the machine learning (ML) theory through which the models are learned based on large amounts of observed data and the prediction is conducted without explicitly modeling space objects and space environment, the proposed ML approach integrates physics-based orbit prediction algorithms with a learning-based process that focuses on reducing the prediction errors. Using a simulation-based space catalog environment as the test bed, the paper demonstrates three types of generalization capability for the proposed ML approach: (1) the ML model can be used to improve the same RSO's orbit information that is not available during the learning process but shares the same time interval as the training data; (2) the ML model can be used to improve predictions of the same RSO at future epochs; and (3) the ML model based on a RSO can be applied to other RSOs that share some common features.
Kaufhold, John P; Tsai, Philbert S; Blinder, Pablo; Kleinfeld, David
2012-08-01
A graph of tissue vasculature is an essential requirement to model the exchange of gasses and nutriments between the blood and cells in the brain. Such a graph is derived from a vectorized representation of anatomical data, provides a map of all vessels as vertices and segments, and may include the location of nonvascular components, such as neuronal and glial somata. Yet vectorized data sets typically contain erroneous gaps, spurious endpoints, and spuriously merged strands. Current methods to correct such defects only address the issue of connecting gaps and further require manual tuning of parameters in a high dimensional algorithm. To address these shortcomings, we introduce a supervised machine learning method that (1) connects vessel gaps by "learned threshold relaxation"; (2) removes spurious segments by "learning to eliminate deletion candidate strands"; and (3) enforces consistency in the joint space of learned vascular graph corrections through "consistency learning." Human operators are only required to label individual objects they recognize in a training set and are not burdened with tuning parameters. The supervised learning procedure examines the geometry and topology of features in the neighborhood of each vessel segment under consideration. We demonstrate the effectiveness of these methods on four sets of microvascular data, each with >800(3) voxels, obtained with all optical histology of mouse tissue and vectorization by state-of-the-art techniques in image segmentation. Through statistically validated sampling and analysis in terms of precision recall curves, we find that learning with bagged boosted decision trees reduces equal-error error rates for threshold relaxation by 5-21% and strand elimination performance by 18-57%. We benchmark generalization performance across datasets; while improvements vary between data sets, learning always leads to a useful reduction in error rates. Overall, learning is shown to more than halve the total error rate, and therefore, human time spent manually correcting such vectorizations. Copyright © 2012 Elsevier B.V. All rights reserved.
Kaufhold, John P.; Tsai, Philbert S.; Blinder, Pablo; Kleinfeld, David
2012-01-01
A graph of tissue vasculature is an essential requirement to model the exchange of gasses and nutriments between the blood and cells in the brain. Such a graph is derived from a vectorized representation of anatomical data, provides a map of all vessels as vertices and segments, and may include the location of nonvascular components, such as neuronal and glial somata. Yet vectorized data sets typically contain erroneous gaps, spurious endpoints, and spuriously merged strands. Current methods to correct such defects only address the issue of connecting gaps and further require manual tuning of parameters in a high dimensional algorithm. To address these shortcomings, we introduce a supervised machine learning method that (1) connects vessel gaps by “learned threshold relaxation”; (2) removes spurious segments by “learning to eliminate deletion candidate strands”; and (3) enforces consistency in the joint space of learned vascular graph corrections through “consistency learning.” Human operators are only required to label individual objects they recognize in a training set and are not burdened with tuning parameters. The supervised learning procedure examines the geometry and topology of features in the neighborhood of each vessel segment under consideration. We demonstrate the effectiveness of these methods on four sets of microvascular data, each with > 8003 voxels, obtained with all optical histology of mouse tissue and vectorization by state-of-the-art techniques in image segmentation. Through statistically validated sampling and analysis in terms of precision recall curves, we find that learning with bagged boosted decision trees reduces equal-error error rates for threshold relaxation by 5 to 21 % and strand elimination performance by 18 to 57 %. We benchmark generalization performance across datasets; while improvements vary between data sets, learning always leads to a useful reduction in error rates. Overall, learning is shown to more than halve the total error rate, and therefore, human time spent manually correcting such vectorizations. PMID:22854035
An Incremental Type-2 Meta-Cognitive Extreme Learning Machine.
Pratama, Mahardhika; Zhang, Guangquan; Er, Meng Joo; Anavatti, Sreenatha
2017-02-01
Existing extreme learning algorithm have not taken into account four issues: 1) complexity; 2) uncertainty; 3) concept drift; and 4) high dimensionality. A novel incremental type-2 meta-cognitive extreme learning machine (ELM) called evolving type-2 ELM (eT2ELM) is proposed to cope with the four issues in this paper. The eT2ELM presents three main pillars of human meta-cognition: 1) what-to-learn; 2) how-to-learn; and 3) when-to-learn. The what-to-learn component selects important training samples for model updates by virtue of the online certainty-based active learning method, which renders eT2ELM as a semi-supervised classifier. The how-to-learn element develops a synergy between extreme learning theory and the evolving concept, whereby the hidden nodes can be generated and pruned automatically from data streams with no tuning of hidden nodes. The when-to-learn constituent makes use of the standard sample reserved strategy. A generalized interval type-2 fuzzy neural network is also put forward as a cognitive component, in which a hidden node is built upon the interval type-2 multivariate Gaussian function while exploiting a subset of Chebyshev series in the output node. The efficacy of the proposed eT2ELM is numerically validated in 12 data streams containing various concept drifts. The numerical results are confirmed by thorough statistical tests, where the eT2ELM demonstrates the most encouraging numerical results in delivering reliable prediction, while sustaining low complexity.
Webly-Supervised Fine-Grained Visual Categorization via Deep Domain Adaptation.
Xu, Zhe; Huang, Shaoli; Zhang, Ya; Tao, Dacheng
2018-05-01
Learning visual representations from web data has recently attracted attention for object recognition. Previous studies have mainly focused on overcoming label noise and data bias and have shown promising results by learning directly from web data. However, we argue that it might be better to transfer knowledge from existing human labeling resources to improve performance at nearly no additional cost. In this paper, we propose a new semi-supervised method for learning via web data. Our method has the unique design of exploiting strong supervision, i.e., in addition to standard image-level labels, our method also utilizes detailed annotations including object bounding boxes and part landmarks. By transferring as much knowledge as possible from existing strongly supervised datasets to weakly supervised web images, our method can benefit from sophisticated object recognition algorithms and overcome several typical problems found in webly-supervised learning. We consider the problem of fine-grained visual categorization, in which existing training resources are scarce, as our main research objective. Comprehensive experimentation and extensive analysis demonstrate encouraging performance of the proposed approach, which, at the same time, delivers a new pipeline for fine-grained visual categorization that is likely to be highly effective for real-world applications.
Global Optimization Ensemble Model for Classification Methods
Anwar, Hina; Qamar, Usman; Muzaffar Qureshi, Abdul Wahab
2014-01-01
Supervised learning is the process of data mining for deducing rules from training datasets. A broad array of supervised learning algorithms exists, every one of them with its own advantages and drawbacks. There are some basic issues that affect the accuracy of classifier while solving a supervised learning problem, like bias-variance tradeoff, dimensionality of input space, and noise in the input data space. All these problems affect the accuracy of classifier and are the reason that there is no global optimal method for classification. There is not any generalized improvement method that can increase the accuracy of any classifier while addressing all the problems stated above. This paper proposes a global optimization ensemble model for classification methods (GMC) that can improve the overall accuracy for supervised learning problems. The experimental results on various public datasets showed that the proposed model improved the accuracy of the classification models from 1% to 30% depending upon the algorithm complexity. PMID:24883382
Garcia-Chimeno, Yolanda; Garcia-Zapirain, Begonya
2015-01-01
The classification of subjects' pathologies enables a rigorousness to be applied to the treatment of certain pathologies, as doctors on occasions play with so many variables that they can end up confusing some illnesses with others. Thanks to Machine Learning techniques applied to a health-record database, it is possible to make using our algorithm. hClass contains a non-linear classification of either a supervised, non-supervised or semi-supervised type. The machine is configured using other techniques such as validation of the set to be classified (cross-validation), reduction in features (PCA) and committees for assessing the various classifiers. The tool is easy to use, and the sample matrix and features that one wishes to classify, the number of iterations and the subjects who are going to be used to train the machine all need to be introduced as inputs. As a result, the success rate is shown either via a classifier or via a committee if one has been formed. A 90% success rate is obtained in the ADABoost classifier and 89.7% in the case of a committee (comprising three classifiers) when PCA is applied. This tool can be expanded to allow the user to totally characterise the classifiers by adjusting them to each classification use.
Methods of Sparse Modeling and Dimensionality Reduction to Deal with Big Data
2015-04-01
supervised learning (c). Our framework consists of two separate phases: (a) first find an initial space in an unsupervised manner; then (b) utilize label...model that can learn thousands of topics from a large set of documents and infer the topic mixture of each document, 2) a supervised dimension reduction...model that can learn thousands of topics from a large set of documents and infer the topic mixture of each document, (i) a method of supervised
Using Computational Text Classification for Qualitative Research and Evaluation in Extension
ERIC Educational Resources Information Center
Smith, Justin G.; Tissing, Reid
2018-01-01
This article introduces a process for computational text classification that can be used in a variety of qualitative research and evaluation settings. The process leverages supervised machine learning based on an implementation of a multinomial Bayesian classifier. Applied to a community of inquiry framework, the algorithm was used to identify…
Evaluation of Semi-supervised Learning for Classification of Protein Crystallization Imagery
Sigdel, Madhav; Dinç, İmren; Dinç, Semih; Sigdel, Madhu S.; Pusey, Marc L.; Aygün, Ramazan S.
2015-01-01
In this paper, we investigate the performance of two wrapper methods for semi-supervised learning algorithms for classification of protein crystallization images with limited labeled images. Firstly, we evaluate the performance of semi-supervised approach using self-training with naïve Bayesian (NB) and sequential minimum optimization (SMO) as the base classifiers. The confidence values returned by these classifiers are used to select high confident predictions to be used for self-training. Secondly, we analyze the performance of Yet Another Two Stage Idea (YATSI) semi-supervised learning using NB, SMO, multilayer perceptron (MLP), J48 and random forest (RF) classifiers. These results are compared with the basic supervised learning using the same training sets. We perform our experiments on a dataset consisting of 2250 protein crystallization images for different proportions of training and test data. Our results indicate that NB and SMO using both self-training and YATSI semi-supervised approaches improve accuracies with respect to supervised learning. On the other hand, MLP, J48 and RF perform better using basic supervised learning. Overall, random forest classifier yields the best accuracy with supervised learning for our dataset. PMID:25914518
Gharehbaghi, Arash; Linden, Maria
2017-10-12
This paper presents a novel method for learning the cyclic contents of stochastic time series: the deep time-growing neural network (DTGNN). The DTGNN combines supervised and unsupervised methods in different levels of learning for an enhanced performance. It is employed by a multiscale learning structure to classify cyclic time series (CTS), in which the dynamic contents of the time series are preserved in an efficient manner. This paper suggests a systematic procedure for finding the design parameter of the classification method for a one-versus-multiple class application. A novel validation method is also suggested for evaluating the structural risk, both in a quantitative and a qualitative manner. The effect of the DTGNN on the performance of the classifier is statistically validated through the repeated random subsampling using different sets of CTS, from different medical applications. The validation involves four medical databases, comprised of 108 recordings of the electroencephalogram signal, 90 recordings of the electromyogram signal, 130 recordings of the heart sound signal, and 50 recordings of the respiratory sound signal. Results of the statistical validations show that the DTGNN significantly improves the performance of the classification and also exhibits an optimal structural risk.
Confidence Preserving Machine for Facial Action Unit Detection
Zeng, Jiabei; Chu, Wen-Sheng; De la Torre, Fernando; Cohn, Jeffrey F.; Xiong, Zhang
2016-01-01
Facial action unit (AU) detection from video has been a long-standing problem in automated facial expression analysis. While progress has been made, accurate detection of facial AUs remains challenging due to ubiquitous sources of errors, such as inter-personal variability, pose, and low-intensity AUs. In this paper, we refer to samples causing such errors as hard samples, and the remaining as easy samples. To address learning with the hard samples, we propose the Confidence Preserving Machine (CPM), a novel two-stage learning framework that combines multiple classifiers following an “easy-to-hard” strategy. During the training stage, CPM learns two confident classifiers. Each classifier focuses on separating easy samples of one class from all else, and thus preserves confidence on predicting each class. During the testing stage, the confident classifiers provide “virtual labels” for easy test samples. Given the virtual labels, we propose a quasi-semi-supervised (QSS) learning strategy to learn a person-specific (PS) classifier. The QSS strategy employs a spatio-temporal smoothness that encourages similar predictions for samples within a spatio-temporal neighborhood. In addition, to further improve detection performance, we introduce two CPM extensions: iCPM that iteratively augments training samples to train the confident classifiers, and kCPM that kernelizes the original CPM model to promote nonlinearity. Experiments on four spontaneous datasets GFT [15], BP4D [56], DISFA [42], and RU-FACS [3] illustrate the benefits of the proposed CPM models over baseline methods and state-of-the-art semisupervised learning and transfer learning methods. PMID:27479964
Inverse analysis of turbidites by machine learning
NASA Astrophysics Data System (ADS)
Naruse, H.; Nakao, K.
2017-12-01
This study aims to propose a method to estimate paleo-hydraulic conditions of turbidity currents from ancient turbidites by using machine-learning technique. In this method, numerical simulation was repeated under various initial conditions, which produces a data set of characteristic features of turbidites. Then, this data set of turbidites is used for supervised training of a deep-learning neural network (NN). Quantities of characteristic features of turbidites in the training data set are given to input nodes of NN, and output nodes are expected to provide the estimates of initial condition of the turbidity current. The optimization of weight coefficients of NN is then conducted to reduce root-mean-square of the difference between the true conditions and the output values of NN. The empirical relationship with numerical results and the initial conditions is explored in this method, and the discovered relationship is used for inversion of turbidity currents. This machine learning can potentially produce NN that estimates paleo-hydraulic conditions from data of ancient turbidites. We produced a preliminary implementation of this methodology. A forward model based on 1D shallow-water equations with a correction of density-stratification effect was employed. This model calculates a behavior of a surge-like turbidity current transporting mixed-size sediment, and outputs spatial distribution of volume per unit area of each grain-size class on the uniform slope. Grain-size distribution was discretized 3 classes. Numerical simulation was repeated 1000 times, and thus 1000 beds of turbidites were used as the training data for NN that has 21000 input nodes and 5 output nodes with two hidden-layers. After the machine learning finished, independent simulations were conducted 200 times in order to evaluate the performance of NN. As a result of this test, the initial conditions of validation data were successfully reconstructed by NN. The estimated values show very small deviation from the true parameters. Comparing to previous inverse modeling of turbidity currents, our methodology is superior especially in the efficiency of computation. Also, our methodology has advantage in extensibility and applicability to various sediment transport processes such as pyroclastic flows or debris flows.
Pashaei, Elnaz; Ozen, Mustafa; Aydin, Nizamettin
2015-08-01
Improving accuracy of supervised classification algorithms in biomedical applications is one of active area of research. In this study, we improve the performance of Particle Swarm Optimization (PSO) combined with C4.5 decision tree (PSO+C4.5) classifier by applying Boosted C5.0 decision tree as the fitness function. To evaluate the effectiveness of our proposed method, it is implemented on 1 microarray dataset and 5 different medical data sets obtained from UCI machine learning databases. Moreover, the results of PSO + Boosted C5.0 implementation are compared to eight well-known benchmark classification methods (PSO+C4.5, support vector machine under the kernel of Radial Basis Function, Classification And Regression Tree (CART), C4.5 decision tree, C5.0 decision tree, Boosted C5.0 decision tree, Naive Bayes and Weighted K-Nearest neighbor). Repeated five-fold cross-validation method was used to justify the performance of classifiers. Experimental results show that our proposed method not only improve the performance of PSO+C4.5 but also obtains higher classification accuracy compared to the other classification methods.
Machine Learning Topological Invariants with Neural Networks
NASA Astrophysics Data System (ADS)
Zhang, Pengfei; Shen, Huitao; Zhai, Hui
2018-02-01
In this Letter we supervisedly train neural networks to distinguish different topological phases in the context of topological band insulators. After training with Hamiltonians of one-dimensional insulators with chiral symmetry, the neural network can predict their topological winding numbers with nearly 100% accuracy, even for Hamiltonians with larger winding numbers that are not included in the training data. These results show a remarkable success that the neural network can capture the global and nonlinear topological features of quantum phases from local inputs. By opening up the neural network, we confirm that the network does learn the discrete version of the winding number formula. We also make a couple of remarks regarding the role of the symmetry and the opposite effect of regularization techniques when applying machine learning to physical systems.
Greene, Casey S; Tan, Jie; Ung, Matthew; Moore, Jason H; Cheng, Chao
2014-12-01
Recent technological advances allow for high throughput profiling of biological systems in a cost-efficient manner. The low cost of data generation is leading us to the "big data" era. The availability of big data provides unprecedented opportunities but also raises new challenges for data mining and analysis. In this review, we introduce key concepts in the analysis of big data, including both "machine learning" algorithms as well as "unsupervised" and "supervised" examples of each. We note packages for the R programming language that are available to perform machine learning analyses. In addition to programming based solutions, we review webservers that allow users with limited or no programming background to perform these analyses on large data compendia. © 2014 Wiley Periodicals, Inc.
Dipnall, J F; Pasco, J A; Berk, M; Williams, L J; Dodd, S; Jacka, F N; Meyer, D
2017-01-01
Key lifestyle-environ risk factors are operative for depression, but it is unclear how risk factors cluster. Machine-learning (ML) algorithms exist that learn, extract, identify and map underlying patterns to identify groupings of depressed individuals without constraints. The aim of this research was to use a large epidemiological study to identify and characterise depression clusters through "Graphing lifestyle-environs using machine-learning methods" (GLUMM). Two ML algorithms were implemented: unsupervised Self-organised mapping (SOM) to create GLUMM clusters and a supervised boosted regression algorithm to describe clusters. Ninety-six "lifestyle-environ" variables were used from the National health and nutrition examination study (2009-2010). Multivariate logistic regression validated clusters and controlled for possible sociodemographic confounders. The SOM identified two GLUMM cluster solutions. These solutions contained one dominant depressed cluster (GLUMM5-1, GLUMM7-1). Equal proportions of members in each cluster rated as highly depressed (17%). Alcohol consumption and demographics validated clusters. Boosted regression identified GLUMM5-1 as more informative than GLUMM7-1. Members were more likely to: have problems sleeping; unhealthy eating; ≤2 years in their home; an old home; perceive themselves underweight; exposed to work fumes; experienced sex at ≤14 years; not perform moderate recreational activities. A positive relationship between GLUMM5-1 (OR: 7.50, P<0.001) and GLUMM7-1 (OR: 7.88, P<0.001) with depression was found, with significant interactions with those married/living with partner (P=0.001). Using ML based GLUMM to form ordered depressive clusters from multitudinous lifestyle-environ variables enabled a deeper exploration of the heterogeneous data to uncover better understandings into relationships between the complex mental health factors. Copyright © 2016 Elsevier Masson SAS. All rights reserved.
Fabelo, Himar; Ortega, Samuel; Ravi, Daniele; Kiran, B Ravi; Sosa, Coralia; Bulters, Diederik; Callicó, Gustavo M; Bulstrode, Harry; Szolna, Adam; Piñeiro, Juan F; Kabwama, Silvester; Madroñal, Daniel; Lazcano, Raquel; J-O'Shanahan, Aruma; Bisshopp, Sara; Hernández, María; Báez, Abelardo; Yang, Guang-Zhong; Stanciulescu, Bogdan; Salvador, Rubén; Juárez, Eduardo; Sarmiento, Roberto
2018-01-01
Surgery for brain cancer is a major problem in neurosurgery. The diffuse infiltration into the surrounding normal brain by these tumors makes their accurate identification by the naked eye difficult. Since surgery is the common treatment for brain cancer, an accurate radical resection of the tumor leads to improved survival rates for patients. However, the identification of the tumor boundaries during surgery is challenging. Hyperspectral imaging is a non-contact, non-ionizing and non-invasive technique suitable for medical diagnosis. This study presents the development of a novel classification method taking into account the spatial and spectral characteristics of the hyperspectral images to help neurosurgeons to accurately determine the tumor boundaries in surgical-time during the resection, avoiding excessive excision of normal tissue or unintentionally leaving residual tumor. The algorithm proposed in this study to approach an efficient solution consists of a hybrid framework that combines both supervised and unsupervised machine learning methods. Firstly, a supervised pixel-wise classification using a Support Vector Machine classifier is performed. The generated classification map is spatially homogenized using a one-band representation of the HS cube, employing the Fixed Reference t-Stochastic Neighbors Embedding dimensional reduction algorithm, and performing a K-Nearest Neighbors filtering. The information generated by the supervised stage is combined with a segmentation map obtained via unsupervised clustering employing a Hierarchical K-Means algorithm. The fusion is performed using a majority voting approach that associates each cluster with a certain class. To evaluate the proposed approach, five hyperspectral images of surface of the brain affected by glioblastoma tumor in vivo from five different patients have been used. The final classification maps obtained have been analyzed and validated by specialists. These preliminary results are promising, obtaining an accurate delineation of the tumor area.
Kabwama, Silvester; Madroñal, Daniel; Lazcano, Raquel; J-O’Shanahan, Aruma; Bisshopp, Sara; Hernández, María; Báez, Abelardo; Yang, Guang-Zhong; Stanciulescu, Bogdan; Salvador, Rubén; Juárez, Eduardo; Sarmiento, Roberto
2018-01-01
Surgery for brain cancer is a major problem in neurosurgery. The diffuse infiltration into the surrounding normal brain by these tumors makes their accurate identification by the naked eye difficult. Since surgery is the common treatment for brain cancer, an accurate radical resection of the tumor leads to improved survival rates for patients. However, the identification of the tumor boundaries during surgery is challenging. Hyperspectral imaging is a non-contact, non-ionizing and non-invasive technique suitable for medical diagnosis. This study presents the development of a novel classification method taking into account the spatial and spectral characteristics of the hyperspectral images to help neurosurgeons to accurately determine the tumor boundaries in surgical-time during the resection, avoiding excessive excision of normal tissue or unintentionally leaving residual tumor. The algorithm proposed in this study to approach an efficient solution consists of a hybrid framework that combines both supervised and unsupervised machine learning methods. Firstly, a supervised pixel-wise classification using a Support Vector Machine classifier is performed. The generated classification map is spatially homogenized using a one-band representation of the HS cube, employing the Fixed Reference t-Stochastic Neighbors Embedding dimensional reduction algorithm, and performing a K-Nearest Neighbors filtering. The information generated by the supervised stage is combined with a segmentation map obtained via unsupervised clustering employing a Hierarchical K-Means algorithm. The fusion is performed using a majority voting approach that associates each cluster with a certain class. To evaluate the proposed approach, five hyperspectral images of surface of the brain affected by glioblastoma tumor in vivo from five different patients have been used. The final classification maps obtained have been analyzed and validated by specialists. These preliminary results are promising, obtaining an accurate delineation of the tumor area. PMID:29554126
Mocanu, Decebal Constantin; Mocanu, Elena; Stone, Peter; Nguyen, Phuong H; Gibescu, Madeleine; Liotta, Antonio
2018-06-19
Through the success of deep learning in various domains, artificial neural networks are currently among the most used artificial intelligence methods. Taking inspiration from the network properties of biological neural networks (e.g. sparsity, scale-freeness), we argue that (contrary to general practice) artificial neural networks, too, should not have fully-connected layers. Here we propose sparse evolutionary training of artificial neural networks, an algorithm which evolves an initial sparse topology (Erdős-Rényi random graph) of two consecutive layers of neurons into a scale-free topology, during learning. Our method replaces artificial neural networks fully-connected layers with sparse ones before training, reducing quadratically the number of parameters, with no decrease in accuracy. We demonstrate our claims on restricted Boltzmann machines, multi-layer perceptrons, and convolutional neural networks for unsupervised and supervised learning on 15 datasets. Our approach has the potential to enable artificial neural networks to scale up beyond what is currently possible.
SKYNET: an efficient and robust neural network training tool for machine learning in astronomy
NASA Astrophysics Data System (ADS)
Graff, Philip; Feroz, Farhan; Hobson, Michael P.; Lasenby, Anthony
2014-06-01
We present the first public release of our generic neural network training algorithm, called SKYNET. This efficient and robust machine learning tool is able to train large and deep feed-forward neural networks, including autoencoders, for use in a wide range of supervised and unsupervised learning applications, such as regression, classification, density estimation, clustering and dimensionality reduction. SKYNET uses a `pre-training' method to obtain a set of network parameters that has empirically been shown to be close to a good solution, followed by further optimization using a regularized variant of Newton's method, where the level of regularization is determined and adjusted automatically; the latter uses second-order derivative information to improve convergence, but without the need to evaluate or store the full Hessian matrix, by using a fast approximate method to calculate Hessian-vector products. This combination of methods allows for the training of complicated networks that are difficult to optimize using standard backpropagation techniques. SKYNET employs convergence criteria that naturally prevent overfitting, and also includes a fast algorithm for estimating the accuracy of network outputs. The utility and flexibility of SKYNET are demonstrated by application to a number of toy problems, and to astronomical problems focusing on the recovery of structure from blurred and noisy images, the identification of gamma-ray bursters, and the compression and denoising of galaxy images. The SKYNET software, which is implemented in standard ANSI C and fully parallelized using MPI, is available at http://www.mrao.cam.ac.uk/software/skynet/.
CompareSVM: supervised, Support Vector Machine (SVM) inference of gene regularity networks.
Gillani, Zeeshan; Akash, Muhammad Sajid Hamid; Rahaman, M D Matiur; Chen, Ming
2014-11-30
Predication of gene regularity network (GRN) from expression data is a challenging task. There are many methods that have been developed to address this challenge ranging from supervised to unsupervised methods. Most promising methods are based on support vector machine (SVM). There is a need for comprehensive analysis on prediction accuracy of supervised method SVM using different kernels on different biological experimental conditions and network size. We developed a tool (CompareSVM) based on SVM to compare different kernel methods for inference of GRN. Using CompareSVM, we investigated and evaluated different SVM kernel methods on simulated datasets of microarray of different sizes in detail. The results obtained from CompareSVM showed that accuracy of inference method depends upon the nature of experimental condition and size of the network. For network with nodes (<200) and average (over all sizes of networks), SVM Gaussian kernel outperform on knockout, knockdown, and multifactorial datasets compared to all the other inference methods. For network with large number of nodes (~500), choice of inference method depend upon nature of experimental condition. CompareSVM is available at http://bis.zju.edu.cn/CompareSVM/ .
Ross, Elsie Gyang; Shah, Nigam H; Dalman, Ronald L; Nead, Kevin T; Cooke, John P; Leeper, Nicholas J
2016-11-01
A key aspect of the precision medicine effort is the development of informatics tools that can analyze and interpret "big data" sets in an automated and adaptive fashion while providing accurate and actionable clinical information. The aims of this study were to develop machine learning algorithms for the identification of disease and the prognostication of mortality risk and to determine whether such models perform better than classical statistical analyses. Focusing on peripheral artery disease (PAD), patient data were derived from a prospective, observational study of 1755 patients who presented for elective coronary angiography. We employed multiple supervised machine learning algorithms and used diverse clinical, demographic, imaging, and genomic information in a hypothesis-free manner to build models that could identify patients with PAD and predict future mortality. Comparison was made to standard stepwise linear regression models. Our machine-learned models outperformed stepwise logistic regression models both for the identification of patients with PAD (area under the curve, 0.87 vs 0.76, respectively; P = .03) and for the prediction of future mortality (area under the curve, 0.76 vs 0.65, respectively; P = .10). Both machine-learned models were markedly better calibrated than the stepwise logistic regression models, thus providing more accurate disease and mortality risk estimates. Machine learning approaches can produce more accurate disease classification and prediction models. These tools may prove clinically useful for the automated identification of patients with highly morbid diseases for which aggressive risk factor management can improve outcomes. Copyright © 2016 Society for Vascular Surgery. Published by Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Jamal, Wasifa; Das, Saptarshi; Oprescu, Ioana-Anastasia; Maharatna, Koushik; Apicella, Fabio; Sicca, Federico
2014-08-01
Objective. The paper investigates the presence of autism using the functional brain connectivity measures derived from electro-encephalogram (EEG) of children during face perception tasks. Approach. Phase synchronized patterns from 128-channel EEG signals are obtained for typical children and children with autism spectrum disorder (ASD). The phase synchronized states or synchrostates temporally switch amongst themselves as an underlying process for the completion of a particular cognitive task. We used 12 subjects in each group (ASD and typical) for analyzing their EEG while processing fearful, happy and neutral faces. The minimal and maximally occurring synchrostates for each subject are chosen for extraction of brain connectivity features, which are used for classification between these two groups of subjects. Among different supervised learning techniques, we here explored the discriminant analysis and support vector machine both with polynomial kernels for the classification task. Main results. The leave one out cross-validation of the classification algorithm gives 94.7% accuracy as the best performance with corresponding sensitivity and specificity values as 85.7% and 100% respectively. Significance. The proposed method gives high classification accuracies and outperforms other contemporary research results. The effectiveness of the proposed method for classification of autistic and typical children suggests the possibility of using it on a larger population to validate it for clinical practice.
On-line Gibbs learning. II. Application to perceptron and multilayer networks
NASA Astrophysics Data System (ADS)
Kim, J. W.; Sompolinsky, H.
1998-08-01
In the preceding paper (``On-line Gibbs Learning. I. General Theory'') we have presented the on-line Gibbs algorithm (OLGA) and studied analytically its asymptotic convergence. In this paper we apply OLGA to on-line supervised learning in several network architectures: a single-layer perceptron, two-layer committee machine, and a winner-takes-all (WTA) classifier. The behavior of OLGA for a single-layer perceptron is studied both analytically and numerically for a variety of rules: a realizable perceptron rule, a perceptron rule corrupted by output and input noise, and a rule generated by a committee machine. The two-layer committee machine is studied numerically for the cases of learning a realizable rule as well as a rule that is corrupted by output noise. The WTA network is studied numerically for the case of a realizable rule. The asymptotic results reported in this paper agree with the predictions of the general theory of OLGA presented in paper I. In all the studied cases, OLGA converges to a set of weights that minimizes the generalization error. When the learning rate is chosen as a power law with an optimal power, OLGA converges with a power law that is the same as that of batch learning.
Development of machine learning models to predict inhibition of 3-dehydroquinate dehydratase.
de Ávila, Maurício Boff; de Azevedo, Walter Filgueira
2018-04-20
In this study, we describe the development of new machine learning models to predict inhibition of the enzyme 3-dehydroquinate dehydratase (DHQD). This enzyme is the third step of the shikimate pathway and is responsible for the synthesis of chorismate, which is a natural precursor of aromatic amino acids. The enzymes of shikimate pathway are absent in humans, which make them protein targets for the design of antimicrobial drugs. We focus our study on the crystallographic structures of DHQD in complex with competitive inhibitors, for which experimental inhibition constant data is available. Application of supervised machine learning techniques was able to elaborate a robust DHQD-targeted model to predict binding affinity. Combination of high-resolution crystallographic structures and binding information indicates that the prevalence of intermolecular electrostatic interactions between DHQD and competitive inhibitors is of pivotal importance for the binding affinity against this enzyme. The present findings can be used to speed up virtual screening studies focused on the DHQD structure. © 2018 John Wiley & Sons A/S.
Meng, Jun; Shi, Lin; Luan, Yushi
2014-01-01
Background Confident identification of microRNA-target interactions is significant for studying the function of microRNA (miRNA). Although some computational miRNA target prediction methods have been proposed for plants, results of various methods tend to be inconsistent and usually lead to more false positive. To address these issues, we developed an integrated model for identifying plant miRNA–target interactions. Results Three online miRNA target prediction toolkits and machine learning algorithms were integrated to identify and analyze Arabidopsis thaliana miRNA-target interactions. Principle component analysis (PCA) feature extraction and self-training technology were introduced to improve the performance. Results showed that the proposed model outperformed the previously existing methods. The results were validated by using degradome sequencing supported Arabidopsis thaliana miRNA-target interactions. The proposed model constructed on Arabidopsis thaliana was run over Oryza sativa and Vitis vinifera to demonstrate that our model is effective for other plant species. Conclusions The integrated model of online predictors and local PCA-SVM classifier gained credible and high quality miRNA-target interactions. The supervised learning algorithm of PCA-SVM classifier was employed in plant miRNA target identification for the first time. Its performance can be substantially improved if more experimentally proved training samples are provided. PMID:25051153
NASA Astrophysics Data System (ADS)
Piretzidis, Dimitrios; Sra, Gurveer; Karantaidis, George; Sideris, Michael G.
2017-04-01
A new method for identifying correlated errors in Gravity Recovery and Climate Experiment (GRACE) monthly harmonic coefficients has been developed and tested. Correlated errors are present in the differences between monthly GRACE solutions, and can be suppressed using a de-correlation filter. In principle, the de-correlation filter should be implemented only on coefficient series with correlated errors to avoid losing useful geophysical information. In previous studies, two main methods of implementing the de-correlation filter have been utilized. In the first one, the de-correlation filter is implemented starting from a specific minimum order until the maximum order of the monthly solution examined. In the second one, the de-correlation filter is implemented only on specific coefficient series, the selection of which is based on statistical testing. The method proposed in the present study exploits the capabilities of supervised machine learning algorithms such as neural networks and support vector machines (SVMs). The pattern of correlated errors can be described by several numerical and geometric features of the harmonic coefficient series. The features of extreme cases of both correlated and uncorrelated coefficients are extracted and used for the training of the machine learning algorithms. The trained machine learning algorithms are later used to identify correlated errors and provide the probability of a coefficient series to be correlated. Regarding SVMs algorithms, an extensive study is performed with various kernel functions in order to find the optimal training model for prediction. The selection of the optimal training model is based on the classification accuracy of the trained SVM algorithm on the same samples used for training. Results show excellent performance of all algorithms with a classification accuracy of 97% - 100% on a pre-selected set of training samples, both in the validation stage of the training procedure and in the subsequent use of the trained algorithms to classify independent coefficients. This accuracy is also confirmed by the external validation of the trained algorithms using the hydrology model GLDAS NOAH. The proposed method meet the requirement of identifying and de-correlating only coefficients with correlated errors. Also, there is no need of applying statistical testing or other techniques that require prior de-correlation of the harmonic coefficients.
Blood detection in wireless capsule endoscope images based on salient superpixels.
Iakovidis, Dimitris K; Chatzis, Dimitris; Chrysanthopoulos, Panos; Koulaouzidis, Anastasios
2015-08-01
Wireless capsule endoscopy (WCE) enables screening of the gastrointestinal (GI) tract with a miniature, optical endoscope packed within a small swallowable capsule, wirelessly transmitting color images. In this paper we propose a novel method for automatic blood detection in contemporary WCE images. Blood is an alarming indication for the presence of pathologies requiring further treatment. The proposed method is based on a new definition of superpixel saliency. The saliency of superpixels is assessed upon their color, enabling the identification of image regions that are likely to contain blood. The blood patterns are recognized by their color features using a supervised learning machine. Experiments performed on a public dataset using automatically selected first-order statistical features from various color components indicate that the proposed method outperforms state-of-the-art methods.
Competitive Deep-Belief Networks for Underwater Acoustic Target Recognition
Shen, Sheng; Yao, Xiaohui; Sheng, Meiping; Wang, Chen
2018-01-01
Underwater acoustic target recognition based on ship-radiated noise belongs to the small-sample-size recognition problems. A competitive deep-belief network is proposed to learn features with more discriminative information from labeled and unlabeled samples. The proposed model consists of four stages: (1) A standard restricted Boltzmann machine is pretrained using a large number of unlabeled data to initialize its parameters; (2) the hidden units are grouped according to categories, which provides an initial clustering model for competitive learning; (3) competitive training and back-propagation algorithms are used to update the parameters to accomplish the task of clustering; (4) by applying layer-wise training and supervised fine-tuning, a deep neural network is built to obtain features. Experimental results show that the proposed method can achieve classification accuracy of 90.89%, which is 8.95% higher than the accuracy obtained by the compared methods. In addition, the highest accuracy of our method is obtained with fewer features than other methods. PMID:29570642
The Practice of Supervision for Professional Learning: The Example of Future Forensic Specialists
ERIC Educational Resources Information Center
Köpsén, Susanne; Nyström, Sofia
2015-01-01
Supervision intended to support learning is of great interest in professional knowledge development. No single definition governs the implementation and enactment of supervision because of different conditions, intentions, and pedagogical approaches. Uncertainty exists at a time when knowledge and methods are undergoing constant development. This…
Semi-supervised learning for photometric supernova classification
NASA Astrophysics Data System (ADS)
Richards, Joseph W.; Homrighausen, Darren; Freeman, Peter E.; Schafer, Chad M.; Poznanski, Dovi
2012-01-01
We present a semi-supervised method for photometric supernova typing. Our approach is to first use the non-linear dimension reduction technique diffusion map to detect structure in a data base of supernova light curves and subsequently employ random forest classification on a spectroscopically confirmed training set to learn a model that can predict the type of each newly observed supernova. We demonstrate that this is an effective method for supernova typing. As supernova numbers increase, our semi-supervised method efficiently utilizes this information to improve classification, a property not enjoyed by template-based methods. Applied to supernova data simulated by Kessler et al. to mimic those of the Dark Energy Survey, our methods achieve (cross-validated) 95 per cent Type Ia purity and 87 per cent Type Ia efficiency on the spectroscopic sample, but only 50 per cent Type Ia purity and 50 per cent efficiency on the photometric sample due to their spectroscopic follow-up strategy. To improve the performance on the photometric sample, we search for better spectroscopic follow-up procedures by studying the sensitivity of our machine-learned supernova classification on the specific strategy used to obtain training sets. With a fixed amount of spectroscopic follow-up time, we find that, despite collecting data on a smaller number of supernovae, deeper magnitude-limited spectroscopic surveys are better for producing training sets. For supernova Ia (II-P) typing, we obtain a 44 per cent (1 per cent) increase in purity to 72 per cent (87 per cent) and 30 per cent (162 per cent) increase in efficiency to 65 per cent (84 per cent) of the sample using a 25th (24.5th) magnitude-limited survey instead of the shallower spectroscopic sample used in the original simulations. When redshift information is available, we incorporate it into our analysis using a novel method of altering the diffusion map representation of the supernovae. Incorporating host redshifts leads to a 5 per cent improvement in Type Ia purity and 13 per cent improvement in Type Ia efficiency. A web service for the supernova classification method used in this paper can be found at .
Automatic Classification of Time-variable X-Ray Sources
NASA Astrophysics Data System (ADS)
Lo, Kitty K.; Farrell, Sean; Murphy, Tara; Gaensler, B. M.
2014-05-01
To maximize the discovery potential of future synoptic surveys, especially in the field of transient science, it will be necessary to use automatic classification to identify some of the astronomical sources. The data mining technique of supervised classification is suitable for this problem. Here, we present a supervised learning method to automatically classify variable X-ray sources in the Second XMM-Newton Serendipitous Source Catalog (2XMMi-DR2). Random Forest is our classifier of choice since it is one of the most accurate learning algorithms available. Our training set consists of 873 variable sources and their features are derived from time series, spectra, and other multi-wavelength contextual information. The 10 fold cross validation accuracy of the training data is ~97% on a 7 class data set. We applied the trained classification model to 411 unknown variable 2XMM sources to produce a probabilistically classified catalog. Using the classification margin and the Random Forest derived outlier measure, we identified 12 anomalous sources, of which 2XMM J180658.7-500250 appears to be the most unusual source in the sample. Its X-ray spectra is suggestive of a ultraluminous X-ray source but its variability makes it highly unusual. Machine-learned classification and anomaly detection will facilitate scientific discoveries in the era of all-sky surveys.
Semi-supervised tracking of extreme weather events in global spatio-temporal climate datasets
NASA Astrophysics Data System (ADS)
Kim, S. K.; Prabhat, M.; Williams, D. N.
2017-12-01
Deep neural networks have been successfully applied to solve problem to detect extreme weather events in large scale climate datasets and attend superior performance that overshadows all previous hand-crafted methods. Recent work has shown that multichannel spatiotemporal encoder-decoder CNN architecture is able to localize events in semi-supervised bounding box. Motivated by this work, we propose new learning metric based on Variational Auto-Encoders (VAE) and Long-Short-Term-Memory (LSTM) to track extreme weather events in spatio-temporal dataset. We consider spatio-temporal object tracking problems as learning probabilistic distribution of continuous latent features of auto-encoder using stochastic variational inference. For this, we assume that our datasets are i.i.d and latent features is able to be modeled by Gaussian distribution. In proposed metric, we first train VAE to generate approximate posterior given multichannel climate input with an extreme climate event at fixed time. Then, we predict bounding box, location and class of extreme climate events using convolutional layers given input concatenating three features including embedding, sampled mean and standard deviation. Lastly, we train LSTM with concatenated input to learn timely information of dataset by recurrently feeding output back to next time-step's input of VAE. Our contribution is two-fold. First, we show the first semi-supervised end-to-end architecture based on VAE to track extreme weather events which can apply to massive scaled unlabeled climate datasets. Second, the information of timely movement of events is considered for bounding box prediction using LSTM which can improve accuracy of localization. To our knowledge, this technique has not been explored neither in climate community or in Machine Learning community.
Active Learning Strategies for Phenotypic Profiling of High-Content Screens.
Smith, Kevin; Horvath, Peter
2014-06-01
High-content screening is a powerful method to discover new drugs and carry out basic biological research. Increasingly, high-content screens have come to rely on supervised machine learning (SML) to perform automatic phenotypic classification as an essential step of the analysis. However, this comes at a cost, namely, the labeled examples required to train the predictive model. Classification performance increases with the number of labeled examples, and because labeling examples demands time from an expert, the training process represents a significant time investment. Active learning strategies attempt to overcome this bottleneck by presenting the most relevant examples to the annotator, thereby achieving high accuracy while minimizing the cost of obtaining labeled data. In this article, we investigate the impact of active learning on single-cell-based phenotype recognition, using data from three large-scale RNA interference high-content screens representing diverse phenotypic profiling problems. We consider several combinations of active learning strategies and popular SML methods. Our results show that active learning significantly reduces the time cost and can be used to reveal the same phenotypic targets identified using SML. We also identify combinations of active learning strategies and SML methods which perform better than others on the phenotypic profiling problems we studied. © 2014 Society for Laboratory Automation and Screening.
NASA Astrophysics Data System (ADS)
Mahvash Mohammadi, Neda; Hezarkhani, Ardeshir
2018-07-01
Classification of mineralised zones is an important factor for the analysis of economic deposits. In this paper, the support vector machine (SVM), a supervised learning algorithm, based on subsurface data is proposed for classification of mineralised zones in the Takht-e-Gonbad porphyry Cu-deposit (SE Iran). The effects of the input features are evaluated via calculating the accuracy rates on the SVM performance. Ultimately, the SVM model, is developed based on input features namely lithology, alteration, mineralisation, the level and, radial basis function (RBF) as a kernel function. Moreover, the optimal amount of parameters λ and C, using n-fold cross-validation method, are calculated at level 0.001 and 0.01 respectively. The accuracy of this model is 0.931 for classification of mineralised zones in the Takht-e-Gonbad porphyry deposit. The results of the study confirm the efficiency of SVM method for classification the mineralised zones.
Lu, Ake Tzu-Hui; Austin, Erin; Bonner, Ashley; Huang, Hsin-Hsiung; Cantor, Rita M
2014-09-01
Machine learning methods (MLMs), designed to develop models using high-dimensional predictors, have been used to analyze genome-wide genetic and genomic data to predict risks for complex traits. We summarize the results from six contributions to our Genetic Analysis Workshop 18 working group; these investigators applied MLMs and data mining to analyses of rare and common genetic variants measured in pedigrees. To develop risk profiles, group members analyzed blood pressure traits along with single-nucleotide polymorphisms and rare variant genotypes derived from sequence and imputation analyses in large Mexican American pedigrees. Supervised MLMs included penalized regression with varying penalties, support vector machines, and permanental classification. Unsupervised MLMs included sparse principal components analysis and sparse graphical models. Entropy-based components analyses were also used to mine these data. None of the investigators fully capitalized on the genetic information provided by the complete pedigrees. Their approaches either corrected for the nonindependence of the individuals within the pedigrees or analyzed only those who were independent. Some methods allowed for covariate adjustment, whereas others did not. We evaluated these methods using a variety of metrics. Four contributors conducted primary analyses on the real data, and the other two research groups used the simulated data with and without knowledge of the underlying simulation model. One group used the answers to the simulated data to assess power and type I errors. Although the MLMs applied were substantially different, each research group concluded that MLMs have advantages over standard statistical approaches with these high-dimensional data. © 2014 WILEY PERIODICALS, INC.
Fusion And Inference From Multiple And Massive Disparate Distributed Dynamic Data Sets
2017-07-01
principled methodology for two-sample graph testing; designed a provably almost-surely perfect vertex clustering algorithm for block model graphs; proved...3.7 Semi-Supervised Clustering Methodology ...................................................................... 9 3.8 Robust Hypothesis Testing...dimensional Euclidean space – allows the full arsenal of statistical and machine learning methodology for multivariate Euclidean data to be deployed for
Mining protein function from text using term-based support vector machines
Rice, Simon B; Nenadic, Goran; Stapley, Benjamin J
2005-01-01
Background Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. Results The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. Conclusion A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2. PMID:15960835
Identifying residential neighbourhood types from settlement points in a machine learning approach.
Jochem, Warren C; Bird, Tomas J; Tatem, Andrew J
2018-05-01
Remote sensing techniques are now commonly applied to map and monitor urban land uses to measure growth and to assist with development and planning. Recent work in this area has highlighted the use of textures and other spatial features that can be measured in very high spatial resolution imagery. Far less attention has been given to using geospatial vector data (i.e. points, lines, polygons) to map land uses. This paper presents an approach to distinguish residential settlement types (regular vs. irregular) using an existing database of settlement points locating structures. Nine data features describing the density, distance, angles, and spacing of the settlement points are calculated at multiple spatial scales. These data are analysed alone and with five common remote sensing measures on elevation, slope, vegetation, and nighttime lights in a supervised machine learning approach to classify land use areas. The method was tested in seven provinces of Afghanistan (Balkh, Helmand, Herat, Kabul, Kandahar, Kunduz, Nangarhar). Overall accuracy ranged from 78% in Kandahar to 90% in Nangarhar. This research demonstrates the potential to accurately map land uses from even the simplest representation of structures.
Malay sentiment analysis based on combined classification approaches and Senti-lexicon algorithm.
Al-Saffar, Ahmed; Awang, Suryanti; Tao, Hai; Omar, Nazlia; Al-Saiagh, Wafaa; Al-Bared, Mohammed
2018-01-01
Sentiment analysis techniques are increasingly exploited to categorize the opinion text to one or more predefined sentiment classes for the creation and automated maintenance of review-aggregation websites. In this paper, a Malay sentiment analysis classification model is proposed to improve classification performances based on the semantic orientation and machine learning approaches. First, a total of 2,478 Malay sentiment-lexicon phrases and words are assigned with a synonym and stored with the help of more than one Malay native speaker, and the polarity is manually allotted with a score. In addition, the supervised machine learning approaches and lexicon knowledge method are combined for Malay sentiment classification with evaluating thirteen features. Finally, three individual classifiers and a combined classifier are used to evaluate the classification accuracy. In experimental results, a wide-range of comparative experiments is conducted on a Malay Reviews Corpus (MRC), and it demonstrates that the feature extraction improves the performance of Malay sentiment analysis based on the combined classification. However, the results depend on three factors, the features, the number of features and the classification approach.
Malay sentiment analysis based on combined classification approaches and Senti-lexicon algorithm
Awang, Suryanti; Tao, Hai; Omar, Nazlia; Al-Saiagh, Wafaa; Al-bared, Mohammed
2018-01-01
Sentiment analysis techniques are increasingly exploited to categorize the opinion text to one or more predefined sentiment classes for the creation and automated maintenance of review-aggregation websites. In this paper, a Malay sentiment analysis classification model is proposed to improve classification performances based on the semantic orientation and machine learning approaches. First, a total of 2,478 Malay sentiment-lexicon phrases and words are assigned with a synonym and stored with the help of more than one Malay native speaker, and the polarity is manually allotted with a score. In addition, the supervised machine learning approaches and lexicon knowledge method are combined for Malay sentiment classification with evaluating thirteen features. Finally, three individual classifiers and a combined classifier are used to evaluate the classification accuracy. In experimental results, a wide-range of comparative experiments is conducted on a Malay Reviews Corpus (MRC), and it demonstrates that the feature extraction improves the performance of Malay sentiment analysis based on the combined classification. However, the results depend on three factors, the features, the number of features and the classification approach. PMID:29684036
Exploiting the potential of unlabeled endoscopic video data with self-supervised learning.
Ross, Tobias; Zimmerer, David; Vemuri, Anant; Isensee, Fabian; Wiesenfarth, Manuel; Bodenstedt, Sebastian; Both, Fabian; Kessler, Philip; Wagner, Martin; Müller, Beat; Kenngott, Hannes; Speidel, Stefanie; Kopp-Schneider, Annette; Maier-Hein, Klaus; Maier-Hein, Lena
2018-06-01
Surgical data science is a new research field that aims to observe all aspects of the patient treatment process in order to provide the right assistance at the right time. Due to the breakthrough successes of deep learning-based solutions for automatic image annotation, the availability of reference annotations for algorithm training is becoming a major bottleneck in the field. The purpose of this paper was to investigate the concept of self-supervised learning to address this issue. Our approach is guided by the hypothesis that unlabeled video data can be used to learn a representation of the target domain that boosts the performance of state-of-the-art machine learning algorithms when used for pre-training. Core of the method is an auxiliary task based on raw endoscopic video data of the target domain that is used to initialize the convolutional neural network (CNN) for the target task. In this paper, we propose the re-colorization of medical images with a conditional generative adversarial network (cGAN)-based architecture as auxiliary task. A variant of the method involves a second pre-training step based on labeled data for the target task from a related domain. We validate both variants using medical instrument segmentation as target task. The proposed approach can be used to radically reduce the manual annotation effort involved in training CNNs. Compared to the baseline approach of generating annotated data from scratch, our method decreases exploratively the number of labeled images by up to 75% without sacrificing performance. Our method also outperforms alternative methods for CNN pre-training, such as pre-training on publicly available non-medical (COCO) or medical data (MICCAI EndoVis2017 challenge) using the target task (in this instance: segmentation). As it makes efficient use of available (non-)public and (un-)labeled data, the approach has the potential to become a valuable tool for CNN (pre-)training.
Supervised Learning for Dynamical System Learning.
Hefny, Ahmed; Downey, Carlton; Gordon, Geoffrey J
2015-01-01
Recently there has been substantial interest in spectral methods for learning dynamical systems. These methods are popular since they often offer a good tradeoff between computational and statistical efficiency. Unfortunately, they can be difficult to use and extend in practice: e.g., they can make it difficult to incorporate prior information such as sparsity or structure. To address this problem, we present a new view of dynamical system learning: we show how to learn dynamical systems by solving a sequence of ordinary supervised learning problems, thereby allowing users to incorporate prior knowledge via standard techniques such as L 1 regularization. Many existing spectral methods are special cases of this new framework, using linear regression as the supervised learner. We demonstrate the effectiveness of our framework by showing examples where nonlinear regression or lasso let us learn better state representations than plain linear regression does; the correctness of these instances follows directly from our general analysis.
NASA Astrophysics Data System (ADS)
Othman, Arsalan A.; Gloaguen, Richard
2017-09-01
Lithological mapping in mountainous regions is often impeded by limited accessibility due to relief. This study aims to evaluate (1) the performance of different supervised classification approaches using remote sensing data and (2) the use of additional information such as geomorphology. We exemplify the methodology in the Bardi-Zard area in NE Iraq, a part of the Zagros Fold - Thrust Belt, known for its chromite deposits. We highlighted the improvement of remote sensing geological classification by integrating geomorphic features and spatial information in the classification scheme. We performed a Maximum Likelihood (ML) classification method besides two Machine Learning Algorithms (MLA): Support Vector Machine (SVM) and Random Forest (RF) to allow the joint use of geomorphic features, Band Ratio (BR), Principal Component Analysis (PCA), spatial information (spatial coordinates) and multispectral data of the Advanced Space-borne Thermal Emission and Reflection radiometer (ASTER) satellite. The RF algorithm showed reliable results and discriminated serpentinite, talus and terrace deposits, red argillites with conglomerates and limestone, limy conglomerates and limestone conglomerates, tuffites interbedded with basic lavas, limestone and Metamorphosed limestone and reddish green shales. The best overall accuracy (∼80%) was achieved by Random Forest (RF) algorithms in the majority of the sixteen tested combination datasets.
Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies.
Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol Y; Saltzman, Elliot; Goldstein, Louis
2010-09-13
Many different studies have claimed that articulatory information can be used to improve the performance of automatic speech recognition systems. Unfortunately, such articulatory information is not readily available in typical speaker-listener situations. Consequently, such information has to be estimated from the acoustic signal in a process which is usually termed "speech-inversion." This study aims to propose and compare various machine learning strategies for speech inversion: Trajectory mixture density networks (TMDNs), feedforward artificial neural networks (FF-ANN), support vector regression (SVR), autoregressive artificial neural network (AR-ANN), and distal supervised learning (DSL). Further, using a database generated by the Haskins Laboratories speech production model, we test the claim that information regarding constrictions produced by the distinct organs of the vocal tract (vocal tract variables) is superior to flesh-point information (articulatory pellet trajectories) for the inversion process.
NASA Astrophysics Data System (ADS)
Gómez Puente, S. M.; van Eijck, M.; Jochems, W.
2013-11-01
Background: In research on design-based learning (DBL), inadequate attention is paid to the role the teacher plays in supervising students in gathering and applying knowledge to design artifacts, systems, and innovative solutions in higher education. Purpose: In this study, we examine whether teacher actions we previously identified in the DBL literature as important in facilitating learning processes and student supervision are present in current DBL engineering practices. Sample: The sample (N=16) consisted of teachers and supervisors in two engineering study programs at a university of technology: mechanical and electrical engineering. We selected randomly teachers from freshman and second-year bachelor DBL projects responsible for student supervision and assessment. Design and method: Interviews with teachers, and interviews and observations of supervisors were used to examine how supervision and facilitation actions are applied according to the DBL framework. Results: Major findings indicate that formulating questions is the most common practice seen in facilitating learning in open-ended engineering design environments. Furthermore, other DBL actions we expected to see based upon the literature were seldom observed in the coaching practices within these two programs. Conclusions: Professionalization of teachers in supervising students need to include methods to scaffold learning by supporting students in reflecting and in providing formative feedback.
MRI texture features as biomarkers to predict MGMT methylation status in glioblastomas
DOE Office of Scientific and Technical Information (OSTI.GOV)
Korfiatis, Panagiotis; Kline, Timothy L.; Erickson, Bradley J., E-mail: bje@mayo.edu
Purpose: Imaging biomarker research focuses on discovering relationships between radiological features and histological findings. In glioblastoma patients, methylation of the O{sup 6}-methylguanine methyltransferase (MGMT) gene promoter is positively correlated with an increased effectiveness of current standard of care. In this paper, the authors investigate texture features as potential imaging biomarkers for capturing the MGMT methylation status of glioblastoma multiforme (GBM) tumors when combined with supervised classification schemes. Methods: A retrospective study of 155 GBM patients with known MGMT methylation status was conducted. Co-occurrence and run length texture features were calculated, and both support vector machines (SVMs) and random forest classifiersmore » were used to predict MGMT methylation status. Results: The best classification system (an SVM-based classifier) had a maximum area under the receiver-operating characteristic (ROC) curve of 0.85 (95% CI: 0.78–0.91) using four texture features (correlation, energy, entropy, and local intensity) originating from the T2-weighted images, yielding at the optimal threshold of the ROC curve, a sensitivity of 0.803 and a specificity of 0.813. Conclusions: Results show that supervised machine learning of MRI texture features can predict MGMT methylation status in preoperative GBM tumors, thus providing a new noninvasive imaging biomarker.« less
Identifying well-formed biomedical phrases in MEDLINE® text.
Kim, Won; Yeganova, Lana; Comeau, Donald C; Wilbur, W John
2012-12-01
In the modern world people frequently interact with retrieval systems to satisfy their information needs. Humanly understandable well-formed phrases represent a crucial interface between humans and the web, and the ability to index and search with such phrases is beneficial for human-web interactions. In this paper we consider the problem of identifying humanly understandable, well formed, and high quality biomedical phrases in MEDLINE documents. The main approaches used previously for detecting such phrases are syntactic, statistical, and a hybrid approach combining these two. In this paper we propose a supervised learning approach for identifying high quality phrases. First we obtain a set of known well-formed useful phrases from an existing source and label these phrases as positive. We then extract from MEDLINE a large set of multiword strings that do not contain stop words or punctuation. We believe this unlabeled set contains many well-formed phrases. Our goal is to identify these additional high quality phrases. We examine various feature combinations and several machine learning strategies designed to solve this problem. A proper choice of machine learning methods and features identifies in the large collection strings that are likely to be high quality phrases. We evaluate our approach by making human judgments on multiword strings extracted from MEDLINE using our methods. We find that over 85% of such extracted phrase candidates are humanly judged to be of high quality. Published by Elsevier Inc.
CMU DeepLens: deep learning for automatic image-based galaxy-galaxy strong lens finding
NASA Astrophysics Data System (ADS)
Lanusse, François; Ma, Quanbin; Li, Nan; Collett, Thomas E.; Li, Chun-Liang; Ravanbakhsh, Siamak; Mandelbaum, Rachel; Póczos, Barnabás
2018-01-01
Galaxy-scale strong gravitational lensing can not only provide a valuable probe of the dark matter distribution of massive galaxies, but also provide valuable cosmological constraints, either by studying the population of strong lenses or by measuring time delays in lensed quasars. Due to the rarity of galaxy-scale strongly lensed systems, fast and reliable automated lens finding methods will be essential in the era of large surveys such as Large Synoptic Survey Telescope, Euclid and Wide-Field Infrared Survey Telescope. To tackle this challenge, we introduce CMU DeepLens, a new fully automated galaxy-galaxy lens finding method based on deep learning. This supervised machine learning approach does not require any tuning after the training step which only requires realistic image simulations of strongly lensed systems. We train and validate our model on a set of 20 000 LSST-like mock observations including a range of lensed systems of various sizes and signal-to-noise ratios (S/N). We find on our simulated data set that for a rejection rate of non-lenses of 99 per cent, a completeness of 90 per cent can be achieved for lenses with Einstein radii larger than 1.4 arcsec and S/N larger than 20 on individual g-band LSST exposures. Finally, we emphasize the importance of realistically complex simulations for training such machine learning methods by demonstrating that the performance of models of significantly different complexities cannot be distinguished on simpler simulations. We make our code publicly available at https://github.com/McWilliamsCenter/CMUDeepLens.
Wang, Hui; Zhang, Weide; Zeng, Qiang; Li, Zuofeng; Feng, Kaiyan; Liu, Lei
2014-04-01
Extracting information from unstructured clinical narratives is valuable for many clinical applications. Although natural Language Processing (NLP) methods have been profoundly studied in electronic medical records (EMR), few studies have explored NLP in extracting information from Chinese clinical narratives. In this study, we report the development and evaluation of extracting tumor-related information from operation notes of hepatic carcinomas which were written in Chinese. Using 86 operation notes manually annotated by physicians as the training set, we explored both rule-based and supervised machine-learning approaches. Evaluating on unseen 29 operation notes, our best approach yielded 69.6% in precision, 58.3% in recall and 63.5% F-score. Copyright © 2014 Elsevier Inc. All rights reserved.
fRMSDPred: Predicting Local RMSD Between Structural Fragments Using Sequence Information
2007-04-04
machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel
NASA Astrophysics Data System (ADS)
Hwong, Y. L.; Oliver, C.; Van Kranendonk, M. J.
2016-12-01
The rise of social media has transformed the way the public engages with scientists and science organisations. `Retweet', `Like', `Share' and `Comment' are a few ways users engage with messages on Twitter and Facebook, two of the most popular social media platforms. Despite the availability of big data from these digital footprints, research into social media science communication is scant. This paper presents the results of an empirical study into the processes and outcomes of space science related social media communications using machine learning. The study is divided into two main parts. The first part is dedicated to the use of supervised learning methods to investigate the features of highly engaging messages., e.g. highly retweeted tweets and shared Facebook posts. It is hypothesised that these messages contain certain psycholinguistic features that are unique to the field of space science. We built a predictive model to forecast the engagement levels of social media posts. By using four feature sets (n-grams, psycholinguistics, grammar and social media), we were able to achieve prediction accuracies in the vicinity of 90% using three supervised learning algorithms (Naive Bayes, linear classifier and decision tree). We conducted the same experiments on social media messages from three other fields (politics, business and non-profit) and discovered several features that are exclusive to space science communications: anger, authenticity, hashtags, visual descriptions and a tentative tone. The second part of the study focuses on the extraction of topics from a corpus of texts using topic modelling. This part of the study is exploratory in nature and uses an unsupervised method called Latent Dirichlet Allocation (LDA) to uncover previously unknown topics within a large body of documents. Preliminary results indicate a strong potential of topic model algorithms to automatically uncover themes hidden within social media chatters on space related issues, with keywords such as `exoplanet', `water' and `life' being clustered together forming a topic (i.e. 'Astrobiology'). Results also demonstrate the freewheeling nature of social media conversations, while providing evidence for the role of these platforms in facilitating meaningful exchanges among science audience.
Automatic classification of diseases from free-text death certificates for real-time surveillance.
Koopman, Bevan; Karimi, Sarvnaz; Nguyen, Anthony; McGuire, Rhydwyn; Muscatello, David; Kemp, Madonna; Truran, Donna; Zhang, Ming; Thackway, Sarah
2015-07-15
Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies. However, their value can be realised only if accurate, quantitative data can be extracted from death certificates, an aim hampered by both the volume and variable nature of certificates written in natural language. This study aims to develop a set of machine learning and rule-based methods to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV. Two classification methods are presented: i) a machine learning approach, where detailed features (terms, term n-grams and SNOMED CT concepts) are extracted from death certificates and used to train a set of supervised machine learning models (Support Vector Machines); and ii) a set of keyword-matching rules. These methods were used to identify the presence of diabetes, influenza, pneumonia and HIV in a death certificate. An empirical evaluation was conducted using 340,142 death certificates, divided between training and test sets, covering deaths from 2000-2007 in New South Wales, Australia. Precision and recall (positive predictive value and sensitivity) were used as evaluation measures, with F-measure providing a single, overall measure of effectiveness. A detailed error analysis was performed on classification errors. Classification of diabetes, influenza, pneumonia and HIV was highly accurate (F-measure 0.96). More fine-grained ICD-10 classification effectiveness was more variable but still high (F-measure 0.80). The error analysis revealed that word variations as well as certain word combinations adversely affected classification. In addition, anomalies in the ground truth likely led to an underestimation of the effectiveness. The high accuracy and low cost of the classification methods allow for an effective means for automatic and real-time surveillance of diabetes, influenza, pneumonia and HIV deaths. In addition, the methods are generally applicable to other diseases of interest and to other sources of medical free-text besides death certificates.
Munkhdalai, Tsendsuren; Liu, Feifan
2018-01-01
Background Medication and adverse drug event (ADE) information extracted from electronic health record (EHR) notes can be a rich resource for drug safety surveillance. Existing observational studies have mainly relied on structured EHR data to obtain ADE information; however, ADEs are often buried in the EHR narratives and not recorded in structured data. Objective To unlock ADE-related information from EHR narratives, there is a need to extract relevant entities and identify relations among them. In this study, we focus on relation identification. This study aimed to evaluate natural language processing and machine learning approaches using the expert-annotated medical entities and relations in the context of drug safety surveillance, and investigate how different learning approaches perform under different configurations. Methods We have manually annotated 791 EHR notes with 9 named entities (eg, medication, indication, severity, and ADEs) and 7 different types of relations (eg, medication-dosage, medication-ADE, and severity-ADE). Then, we explored 3 supervised machine learning systems for relation identification: (1) a support vector machines (SVM) system, (2) an end-to-end deep neural network system, and (3) a supervised descriptive rule induction baseline system. For the neural network system, we exploited the state-of-the-art recurrent neural network (RNN) and attention models. We report the performance by macro-averaged precision, recall, and F1-score across the relation types. Results Our results show that the SVM model achieved the best average F1-score of 89.1% on test data, outperforming the long short-term memory (LSTM) model with attention (F1-score of 65.72%) as well as the rule induction baseline system (F1-score of 7.47%) by a large margin. The bidirectional LSTM model with attention achieved the best performance among different RNN models. With the inclusion of additional features in the LSTM model, its performance can be boosted to an average F1-score of 77.35%. Conclusions It shows that classical learning models (SVM) remains advantageous over deep learning models (RNN variants) for clinical relation identification, especially for long-distance intersentential relations. However, RNNs demonstrate a great potential of significant improvement if more training data become available. Our work is an important step toward mining EHRs to improve the efficacy of drug safety surveillance. Most importantly, the annotated data used in this study will be made publicly available, which will further promote drug safety research in the community. PMID:29695376
Reduced electron exposure for energy-dispersive spectroscopy using dynamic sampling
Zhang, Yan; Godaliyadda, G. M. Dilshan; Ferrier, Nicola; ...
2017-10-23
Analytical electron microscopy and spectroscopy of biological specimens, polymers, and other beam sensitive materials has been a challenging area due to irradiation damage. There is a pressing need to develop novel imaging and spectroscopic imaging methods that will minimize such sample damage as well as reduce the data acquisition time. The latter is useful for high-throughput analysis of materials structure and chemistry. Here, in this work, we present a novel machine learning based method for dynamic sparse sampling of EDS data using a scanning electron microscope. Our method, based on the supervised learning approach for dynamic sampling algorithm and neuralmore » networks based classification of EDS data, allows a dramatic reduction in the total sampling of up to 90%, while maintaining the fidelity of the reconstructed elemental maps and spectroscopic data. In conclusion, we believe this approach will enable imaging and elemental mapping of materials that would otherwise be inaccessible to these analysis techniques.« less
Enhancing clinical concept extraction with distributional semantics
Cohen, Trevor; Wu, Stephen; Gonzalez, Graciela
2011-01-01
Extracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes a basic enabling technology to unlock the knowledge within and support more advanced reasoning applications such as diagnosis explanation, disease progression modeling, and intelligent analysis of the effectiveness of treatment. The recent release of annotated training sets of de-identified clinical narratives has contributed to the development and refinement of concept extraction methods. However, as the annotation process is labor-intensive, training data are necessarily limited in the concepts and concept patterns covered, which impacts the performance of supervised machine learning applications trained with these data. This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text. The approach uses a sequential discriminative classifier (Conditional Random Fields) to extract the mentions of medical problems, treatments and tests from clinical narratives. It takes advantage of all Medline abstracts indexed as being of the publication type “clinical trials” to estimate the relatedness between words in the i2b2/VA training and testing corpora. In addition to the traditional features such as dictionary matching, pattern matching and part-of-speech tags, we also used as a feature words that appear in similar contexts to the word in question (that is, words that have a similar vector representation measured with the commonly used cosine metric, where vector representations are derived using methods of distributional semantics). To the best of our knowledge, this is the first effort exploring the use of distributional semantics, the semantics derived empirically from unannotated text often using vector space models, for a sequence classification task such as concept extraction. Therefore, we first experimented with different sliding window models and found the model with parameters that led to best performance in a preliminary sequence labeling task. The evaluation of this approach, performed against the i2b2/VA concept extraction corpus, showed that incorporating features based on the distribution of words across a large unannotated corpus significantly aids concept extraction. Compared to a supervised-only approach as a baseline, the micro-averaged f-measure for exact match increased from 80.3% to 82.3% and the micro-averaged f-measure based on inexact match increased from 89.7% to 91.3%. These improvements are highly significant according to the bootstrap resampling method and also considering the performance of other systems. Thus, distributional semantic features significantly improve the performance of concept extraction from clinical narratives by taking advantage of word distribution information obtained from unannotated data. PMID:22085698
Genome-wide prediction of cis-regulatory regions using supervised deep learning methods.
Li, Yifeng; Shi, Wenqiang; Wasserman, Wyeth W
2018-05-31
In the human genome, 98% of DNA sequences are non-protein-coding regions that were previously disregarded as junk DNA. In fact, non-coding regions host a variety of cis-regulatory regions which precisely control the expression of genes. Thus, Identifying active cis-regulatory regions in the human genome is critical for understanding gene regulation and assessing the impact of genetic variation on phenotype. The developments of high-throughput sequencing and machine learning technologies make it possible to predict cis-regulatory regions genome wide. Based on rich data resources such as the Encyclopedia of DNA Elements (ENCODE) and the Functional Annotation of the Mammalian Genome (FANTOM) projects, we introduce DECRES based on supervised deep learning approaches for the identification of enhancer and promoter regions in the human genome. Due to their ability to discover patterns in large and complex data, the introduction of deep learning methods enables a significant advance in our knowledge of the genomic locations of cis-regulatory regions. Using models for well-characterized cell lines, we identify key experimental features that contribute to the predictive performance. Applying DECRES, we delineate locations of 300,000 candidate enhancers genome wide (6.8% of the genome, of which 40,000 are supported by bidirectional transcription data), and 26,000 candidate promoters (0.6% of the genome). The predicted annotations of cis-regulatory regions will provide broad utility for genome interpretation from functional genomics to clinical applications. The DECRES model demonstrates potentials of deep learning technologies when combined with high-throughput sequencing data, and inspires the development of other advanced neural network models for further improvement of genome annotations.
Physical isolation with virtual support: Registrars' learning via remote supervision.
Wearne, Susan M; Teunissen, Pim W; Dornan, Tim; Skinner, Timothy
2014-08-26
Abstract Purpose: Changing the current geographical maldistribution of the medical workforce is important for global health. Research regarding programs that train doctors for work with disadvantaged, rural populations is needed. This paper explores one approach of remote supervision of registrars in isolated rural practice. Researching how learning occurs without on-site supervision may also reveal other key elements of postgraduate education. Methods: Thematic analysis of in-depth interviews exploring 11 respondents' experiences of learning via remote supervision. Results: Remote supervision created distinctive learning environments. Respondents' attributes interacted with external supports to influence whether and how their learning was promoted or impeded. Registrars with clinical and/or life experience, who were insightful and motivated to direct their learning, turned the challenges of isolated practice into opportunities that accelerated their professional development. Discussion: Remote supervision was not necessarily problematic but instead provided rich learning for doctors training in and for the context where they were needed. Registrars learnt through clinical responsibility for defined populations and longitudinal, supportive supervisory relationships. Responsibility and continuity may be as important as supervisory proximity for experienced registrars.
Souza, Roberto; Lucena, Oeslle; Garrafa, Julia; Gobbi, David; Saluzzi, Marina; Appenzeller, Simone; Rittner, Letícia; Frayne, Richard; Lotufo, Roberto
2018-04-15
This paper presents an open, multi-vendor, multi-field strength magnetic resonance (MR) T1-weighted volumetric brain imaging dataset, named Calgary-Campinas-359 (CC-359). The dataset is composed of images of older healthy adults (29-80 years) acquired on scanners from three vendors (Siemens, Philips and General Electric) at both 1.5 T and 3 T. CC-359 is comprised of 359 datasets, approximately 60 subjects per vendor and magnetic field strength. The dataset is approximately age and gender balanced, subject to the constraints of the available images. It provides consensus brain extraction masks for all volumes generated using supervised classification. Manual segmentation results for twelve randomly selected subjects performed by an expert are also provided. The CC-359 dataset allows investigation of 1) the influences of both vendor and magnetic field strength on quantitative analysis of brain MR; 2) parameter optimization for automatic segmentation methods; and potentially 3) machine learning classifiers with big data, specifically those based on deep learning methods, as these approaches require a large amount of data. To illustrate the utility of this dataset, we compared to the results of a supervised classifier, the results of eight publicly available skull stripping methods and one publicly available consensus algorithm. A linear mixed effects model analysis indicated that vendor (p-value<0.001) and magnetic field strength (p-value<0.001) have statistically significant impacts on skull stripping results. Copyright © 2017 Elsevier Inc. All rights reserved.
Machine learning vortices at the Kosterlitz-Thouless transition
NASA Astrophysics Data System (ADS)
Beach, Matthew J. S.; Golubeva, Anna; Melko, Roger G.
2018-01-01
Efficient and automated classification of phases from minimally processed data is one goal of machine learning in condensed-matter and statistical physics. Supervised algorithms trained on raw samples of microstates can successfully detect conventional phase transitions via learning a bulk feature such as an order parameter. In this paper, we investigate whether neural networks can learn to classify phases based on topological defects. We address this question on the two-dimensional classical XY model which exhibits a Kosterlitz-Thouless transition. We find significant feature engineering of the raw spin states is required to convincingly claim that features of the vortex configurations are responsible for learning the transition temperature. We further show a single-layer network does not correctly classify the phases of the XY model, while a convolutional network easily performs classification by learning the global magnetization. Finally, we design a deep network capable of learning vortices without feature engineering. We demonstrate the detection of vortices does not necessarily result in the best classification accuracy, especially for lattices of less than approximately 1000 spins. For larger systems, it remains a difficult task to learn vortices.
Real-Time Human Detection for Aerial Captured Video Sequences via Deep Models.
AlDahoul, Nouar; Md Sabri, Aznul Qalid; Mansoor, Ali Mohammed
2018-01-01
Human detection in videos plays an important role in various real life applications. Most of traditional approaches depend on utilizing handcrafted features which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for human detection task. Pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with soft-max and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM's training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high performance Graphical Processing Unit (GPU).
Toward a Real-Time (Day) Dreamcatcher: Sensor-Free Detection of Mind Wandering during Online Reading
ERIC Educational Resources Information Center
Mills, Caitlin; D'Mello, Sidney
2015-01-01
This paper reports the results from a sensor-free detector of mind wandering during an online reading task. Features consisted of reading behaviors (e.g., reading time) and textual features (e.g., level of difficulty) extracted from self-paced reading log files. Supervised machine learning was applied to two datasets in order to predict if…
Applying Active Learning to Assertion Classification of Concepts in Clinical Text
Chen, Yukun; Mani, Subramani; Xu, Hua
2012-01-01
Supervised machine learning methods for clinical natural language processing (NLP) research require a large number of annotated samples, which are very expensive to build because of the involvement of physicians. Active learning, an approach that actively samples from a large pool, provides an alternative solution. Its major goal in classification is to reduce the annotation effort while maintaining the quality of the predictive model. However, few studies have investigated its uses in clinical NLP. This paper reports an application of active learning to a clinical text classification task: to determine the assertion status of clinical concepts. The annotated corpus for the assertion classification task in the 2010 i2b2/VA Clinical NLP Challenge was used in this study. We implemented several existing and newly developed active learning algorithms and assessed their uses. The outcome is reported in the global ALC score, based on the Area under the average Learning Curve of the AUC (Area Under the Curve) score. Results showed that when the same number of annotated samples was used, active learning strategies could generate better classification models (best ALC – 0.7715) than the passive learning method (random sampling) (ALC – 0.7411). Moreover, to achieve the same classification performance, active learning strategies required fewer samples than the random sampling method. For example, to achieve an AUC of 0.79, the random sampling method used 32 samples, while our best active learning algorithm required only 12 samples, a reduction of 62.5% in manual annotation effort. PMID:22127105
A machine learning system to improve heart failure patient assistance.
Guidi, Gabriele; Pettenati, Maria Chiara; Melillo, Paolo; Iadanza, Ernesto
2014-11-01
In this paper, we present a clinical decision support system (CDSS) for the analysis of heart failure (HF) patients, providing various outputs such as an HF severity evaluation, HF-type prediction, as well as a management interface that compares the different patients' follow-ups. The whole system is composed of a part of intelligent core and of an HF special-purpose management tool also providing the function to act as interface for the artificial intelligence training and use. To implement the smart intelligent functions, we adopted a machine learning approach. In this paper, we compare the performance of a neural network (NN), a support vector machine, a system with fuzzy rules genetically produced, and a classification and regression tree and its direct evolution, which is the random forest, in analyzing our database. Best performances in both HF severity evaluation and HF-type prediction functions are obtained by using the random forest algorithm. The management tool allows the cardiologist to populate a "supervised database" suitable for machine learning during his or her regular outpatient consultations. The idea comes from the fact that in literature there are a few databases of this type, and they are not scalable to our case.
Computational approaches for predicting biomedical research collaborations.
Zhang, Qing; Yu, Hong
2014-01-01
Biomedical research is increasingly collaborative, and successful collaborations often produce high impact work. Computational approaches can be developed for automatically predicting biomedical research collaborations. Previous works of collaboration prediction mainly explored the topological structures of research collaboration networks, leaving out rich semantic information from the publications themselves. In this paper, we propose supervised machine learning approaches to predict research collaborations in the biomedical field. We explored both the semantic features extracted from author research interest profile and the author network topological features. We found that the most informative semantic features for author collaborations are related to research interest, including similarity of out-citing citations, similarity of abstracts. Of the four supervised machine learning models (naïve Bayes, naïve Bayes multinomial, SVMs, and logistic regression), the best performing model is logistic regression with an ROC ranging from 0.766 to 0.980 on different datasets. To our knowledge we are the first to study in depth how research interest and productivities can be used for collaboration prediction. Our approach is computationally efficient, scalable and yet simple to implement. The datasets of this study are available at https://github.com/qingzhanggithub/medline-collaboration-datasets.
Recapitulation of Ayurveda constitution types by machine learning of phenotypic traits.
Tiwari, Pradeep; Kutum, Rintu; Sethi, Tavpritesh; Shrivastava, Ankita; Girase, Bhushan; Aggarwal, Shilpi; Patil, Rutuja; Agarwal, Dhiraj; Gautam, Pramod; Agrawal, Anurag; Dash, Debasis; Ghosh, Saurabh; Juvekar, Sanjay; Mukerji, Mitali; Prasher, Bhavana
2017-01-01
In Ayurveda system of medicine individuals are classified into seven constitution types, "Prakriti", for assessing disease susceptibility and drug responsiveness. Prakriti evaluation involves clinical examination including questions about physiological and behavioural traits. A need was felt to develop models for accurately predicting Prakriti classes that have been shown to exhibit molecular differences. The present study was carried out on data of phenotypic attributes in 147 healthy individuals of three extreme Prakriti types, from a genetically homogeneous population of Western India. Unsupervised and supervised machine learning approaches were used to infer inherent structure of the data, and for feature selection and building classification models for Prakriti respectively. These models were validated in a North Indian population. Unsupervised clustering led to emergence of three natural clusters corresponding to three extreme Prakriti classes. The supervised modelling approaches could classify individuals, with distinct Prakriti types, in the training and validation sets. This study is the first to demonstrate that Prakriti types are distinct verifiable clusters within a multidimensional space of multiple interrelated phenotypic traits. It also provides a computational framework for predicting Prakriti classes from phenotypic attributes. This approach may be useful in precision medicine for stratification of endophenotypes in healthy and diseased populations.
Learned filters for object detection in multi-object visual tracking
NASA Astrophysics Data System (ADS)
Stamatescu, Victor; Wong, Sebastien; McDonnell, Mark D.; Kearney, David
2016-05-01
We investigate the application of learned convolutional filters in multi-object visual tracking. The filters were learned in both a supervised and unsupervised manner from image data using artificial neural networks. This work follows recent results in the field of machine learning that demonstrate the use learned filters for enhanced object detection and classification. Here we employ a track-before-detect approach to multi-object tracking, where tracking guides the detection process. The object detection provides a probabilistic input image calculated by selecting from features obtained using banks of generative or discriminative learned filters. We present a systematic evaluation of these convolutional filters using a real-world data set that examines their performance as generic object detectors.
Predicting radiotherapy outcomes using statistical learning techniques
NASA Astrophysics Data System (ADS)
El Naqa, Issam; Bradley, Jeffrey D.; Lindsay, Patricia E.; Hope, Andrew J.; Deasy, Joseph O.
2009-09-01
Radiotherapy outcomes are determined by complex interactions between treatment, anatomical and patient-related variables. A common obstacle to building maximally predictive outcome models for clinical practice is the failure to capture potential complexity of heterogeneous variable interactions and applicability beyond institutional data. We describe a statistical learning methodology that can automatically screen for nonlinear relations among prognostic variables and generalize to unseen data before. In this work, several types of linear and nonlinear kernels to generate interaction terms and approximate the treatment-response function are evaluated. Examples of institutional datasets of esophagitis, pneumonitis and xerostomia endpoints were used. Furthermore, an independent RTOG dataset was used for 'generalizabilty' validation. We formulated the discrimination between risk groups as a supervised learning problem. The distribution of patient groups was initially analyzed using principle components analysis (PCA) to uncover potential nonlinear behavior. The performance of the different methods was evaluated using bivariate correlations and actuarial analysis. Over-fitting was controlled via cross-validation resampling. Our results suggest that a modified support vector machine (SVM) kernel method provided superior performance on leave-one-out testing compared to logistic regression and neural networks in cases where the data exhibited nonlinear behavior on PCA. For instance, in prediction of esophagitis and pneumonitis endpoints, which exhibited nonlinear behavior on PCA, the method provided 21% and 60% improvements, respectively. Furthermore, evaluation on the independent pneumonitis RTOG dataset demonstrated good generalizabilty beyond institutional data in contrast with other models. This indicates that the prediction of treatment response can be improved by utilizing nonlinear kernel methods for discovering important nonlinear interactions among model variables. These models have the capacity to predict on unseen data. Part of this work was first presented at the Seventh International Conference on Machine Learning and Applications, San Diego, CA, USA, 11-13 December 2008.
Taniguchi, Hidetaka; Sato, Hiroshi; Shirakawa, Tomohiro
2018-05-09
Human learners can generalize a new concept from a small number of samples. In contrast, conventional machine learning methods require large amounts of data to address the same types of problems. Humans have cognitive biases that promote fast learning. Here, we developed a method to reduce the gap between human beings and machines in this type of inference by utilizing cognitive biases. We implemented a human cognitive model into machine learning algorithms and compared their performance with the currently most popular methods, naïve Bayes, support vector machine, neural networks, logistic regression and random forests. We focused on the task of spam classification, which has been studied for a long time in the field of machine learning and often requires a large amount of data to obtain high accuracy. Our models achieved superior performance with small and biased samples in comparison with other representative machine learning methods.
Dolz, Jose; Laprie, Anne; Ken, Soléakhéna; Leroy, Henri-Arthur; Reyns, Nicolas; Massoptier, Laurent; Vermandel, Maximilien
2016-01-01
To constrain the risk of severe toxicity in radiotherapy and radiosurgery, precise volume delineation of organs at risk is required. This task is still manually performed, which is time-consuming and prone to observer variability. To address these issues, and as alternative to atlas-based segmentation methods, machine learning techniques, such as support vector machines (SVM), have been recently presented to segment subcortical structures on magnetic resonance images (MRI). SVM is proposed to segment the brainstem on MRI in multicenter brain cancer context. A dataset composed by 14 adult brain MRI scans is used to evaluate its performance. In addition to spatial and probabilistic information, five different image intensity values (IIVs) configurations are evaluated as features to train the SVM classifier. Segmentation accuracy is evaluated by computing the Dice similarity coefficient (DSC), absolute volumes difference (AVD) and percentage volume difference between automatic and manual contours. Mean DSC for all proposed IIVs configurations ranged from 0.89 to 0.90. Mean AVD values were below 1.5 cm(3), where the value for best performing IIVs configuration was 0.85 cm(3), representing an absolute mean difference of 3.99% with respect to the manual segmented volumes. Results suggest consistent volume estimation and high spatial similarity with respect to expert delineations. The proposed approach outperformed presented methods to segment the brainstem, not only in volume similarity metrics, but also in segmentation time. Preliminary results showed that the approach might be promising for adoption in clinical use.
Fluid Lensing based Machine Learning for Augmenting Earth Science Coral Datasets
NASA Astrophysics Data System (ADS)
Li, A.; Instrella, R.; Chirayath, V.
2016-12-01
Recently, there has been increased interest in monitoring the effects of climate change upon the world's marine ecosystems, particularly coral reefs. These delicate ecosystems are especially threatened due to their sensitivity to ocean warming and acidification, leading to unprecedented levels of coral bleaching and die-off in recent years. However, current global aquatic remote sensing datasets are unable to quantify changes in marine ecosystems at spatial and temporal scales relevant to their growth. In this project, we employ various supervised and unsupervised machine learning algorithms to augment existing datasets from NASA's Earth Observing System (EOS), using high resolution airborne imagery. This method utilizes NASA's ongoing airborne campaigns as well as its spaceborne assets to collect remote sensing data over these afflicted regions, and employs Fluid Lensing algorithms to resolve optical distortions caused by the fluid surface, producing cm-scale resolution imagery of these diverse ecosystems from airborne platforms. Support Vector Machines (SVMs) and K-mean clustering methods were applied to satellite imagery at 0.5m resolution, producing segmented maps classifying coral based on percent cover and morphology. Compared to a previous study using multidimensional maximum a posteriori (MAP) estimation to separate these features in high resolution airborne datasets, SVMs are able to achieve above 75% accuracy when augmented with existing MAP estimates, while unsupervised methods such as K-means achieve roughly 68% accuracy, verified by manually segmented reference data provided by a marine biologist. This effort thus has broad applications for coastal remote sensing, by helping marine biologists quantify behavioral trends spanning large areas and over longer timescales, and to assess the health of coral reefs worldwide.
Function approximation using combined unsupervised and supervised learning.
Andras, Peter
2014-03-01
Function approximation is one of the core tasks that are solved using neural networks in the context of many engineering problems. However, good approximation results need good sampling of the data space, which usually requires exponentially increasing volume of data as the dimensionality of the data increases. At the same time, often the high-dimensional data is arranged around a much lower dimensional manifold. Here we propose the breaking of the function approximation task for high-dimensional data into two steps: (1) the mapping of the high-dimensional data onto a lower dimensional space corresponding to the manifold on which the data resides and (2) the approximation of the function using the mapped lower dimensional data. We use over-complete self-organizing maps (SOMs) for the mapping through unsupervised learning, and single hidden layer neural networks for the function approximation through supervised learning. We also extend the two-step procedure by considering support vector machines and Bayesian SOMs for the determination of the best parameters for the nonlinear neurons in the hidden layer of the neural networks used for the function approximation. We compare the approximation performance of the proposed neural networks using a set of functions and show that indeed the neural networks using combined unsupervised and supervised learning outperform in most cases the neural networks that learn the function approximation using the original high-dimensional data.
Rahman, Md Mahmudur; Bhattacharya, Prabir; Desai, Bipin C
2007-01-01
A content-based image retrieval (CBIR) framework for diverse collection of medical images of different imaging modalities, anatomic regions with different orientations and biological systems is proposed. Organization of images in such a database (DB) is well defined with predefined semantic categories; hence, it can be useful for category-specific searching. The proposed framework consists of machine learning methods for image prefiltering, similarity matching using statistical distance measures, and a relevance feedback (RF) scheme. To narrow down the semantic gap and increase the retrieval efficiency, we investigate both supervised and unsupervised learning techniques to associate low-level global image features (e.g., color, texture, and edge) in the projected PCA-based eigenspace with their high-level semantic and visual categories. Specially, we explore the use of a probabilistic multiclass support vector machine (SVM) and fuzzy c-mean (FCM) clustering for categorization and prefiltering of images to reduce the search space. A category-specific statistical similarity matching is proposed in a finer level on the prefiltered images. To incorporate a better perception subjectivity, an RF mechanism is also added to update the query parameters dynamically and adjust the proposed matching functions. Experiments are based on a ground-truth DB consisting of 5000 diverse medical images of 20 predefined categories. Analysis of results based on cross-validation (CV) accuracy and precision-recall for image categorization and retrieval is reported. It demonstrates the improvement, effectiveness, and efficiency achieved by the proposed framework.
Arabic Supervised Learning Method Using N-Gram
ERIC Educational Resources Information Center
Sanan, Majed; Rammal, Mahmoud; Zreik, Khaldoun
2008-01-01
Purpose: Recently, classification of Arabic documents is a real problem for juridical centers. In this case, some of the Lebanese official journal documents are classified, and the center has to classify new documents based on these documents. This paper aims to study and explain the useful application of supervised learning method on Arabic texts…
Mehryary, Farrokh; Kaewphan, Suwisa; Hakala, Kai; Ginter, Filip
2016-01-01
Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/.
Machine Learning Applications to Resting-State Functional MR Imaging Analysis.
Billings, John M; Eder, Maxwell; Flood, William C; Dhami, Devendra Singh; Natarajan, Sriraam; Whitlow, Christopher T
2017-11-01
Machine learning is one of the most exciting and rapidly expanding fields within computer science. Academic and commercial research entities are investing in machine learning methods, especially in personalized medicine via patient-level classification. There is great promise that machine learning methods combined with resting state functional MR imaging will aid in diagnosis of disease and guide potential treatment for conditions thought to be impossible to identify based on imaging alone, such as psychiatric disorders. We discuss machine learning methods and explore recent advances. Copyright © 2017 Elsevier Inc. All rights reserved.
Learning the Relationship between Galaxy Spectra and Star Formation Histories
NASA Astrophysics Data System (ADS)
Lovell, Christopher; Acquaviva, Viviana; Iyer, Kartheik; Gawiser, Eric
2018-01-01
We explore novel approaches to the problem of predicting a galaxy’s star formation history (SFH) from its Spectral Energy Distribution (SED). Traditional approaches to SED template fitting use constant or exponentially declining SFHs, and are known to incur significant bias in the inferred SFHs, which are typically skewed toward younger stellar populations. Machine learning approaches, including tree ensemble methods and convolutional neural networks, would not be affected by the same bias, and may work well in recovering unbiased and multi-episodic star formation histories. We use a supervised approach whereby models are trained using synthetic spectra, generated from three state of the art hydrodynamical simulations, including nebular emission. We explore how SED feature maps can be used to highlight areas of the spectrum with the highest predictive power and discuss the limitations of the approach when applied to real data.
El-Sayed, Hesham; Sankar, Sharmi; Daraghmi, Yousef-Awwad; Tiwari, Prayag; Rattagan, Ekarat; Mohanty, Manoranjan; Puthal, Deepak; Prasad, Mukesh
2018-05-24
Heterogeneous vehicular networks (HETVNETs) evolve from vehicular ad hoc networks (VANETs), which allow vehicles to always be connected so as to obtain safety services within intelligent transportation systems (ITSs). The services and data provided by HETVNETs should be neither interrupted nor delayed. Therefore, Quality of Service (QoS) improvement of HETVNETs is one of the topics attracting the attention of researchers and the manufacturing community. Several methodologies and frameworks have been devised by researchers to address QoS-prediction service issues. In this paper, to improve QoS, we evaluate various traffic characteristics of HETVNETs and propose a new supervised learning model to capture knowledge on all possible traffic patterns. This model is a refinement of support vector machine (SVM) kernels with a radial basis function (RBF). The proposed model produces better results than SVMs, and outperforms other prediction methods used in a traffic context, as it has lower computational complexity and higher prediction accuracy.
Active relearning for robust supervised classification of pulmonary emphysema
NASA Astrophysics Data System (ADS)
Raghunath, Sushravya; Rajagopalan, Srinivasan; Karwoski, Ronald A.; Bartholmai, Brian J.; Robb, Richard A.
2012-03-01
Radiologists are adept at recognizing the appearance of lung parenchymal abnormalities in CT scans. However, the inconsistent differential diagnosis, due to subjective aggregation, mandates supervised classification. Towards optimizing Emphysema classification, we introduce a physician-in-the-loop feedback approach in order to minimize uncertainty in the selected training samples. Using multi-view inductive learning with the training samples, an ensemble of Support Vector Machine (SVM) models, each based on a specific pair-wise dissimilarity metric, was constructed in less than six seconds. In the active relearning phase, the ensemble-expert label conflicts were resolved by an expert. This just-in-time feedback with unoptimized SVMs yielded 15% increase in classification accuracy and 25% reduction in the number of support vectors. The generality of relearning was assessed in the optimized parameter space of six different classifiers across seven dissimilarity metrics. The resultant average accuracy improved to 21%. The co-operative feedback method proposed here could enhance both diagnostic and staging throughput efficiency in chest radiology practice.
FROM2D to 3d Supervised Segmentation and Classification for Cultural Heritage Applications
NASA Astrophysics Data System (ADS)
Grilli, E.; Dininno, D.; Petrucci, G.; Remondino, F.
2018-05-01
The digital management of architectural heritage information is still a complex problem, as a heritage object requires an integrated representation of various types of information in order to develop appropriate restoration or conservation strategies. Currently, there is extensive research focused on automatic procedures of segmentation and classification of 3D point clouds or meshes, which can accelerate the study of a monument and integrate it with heterogeneous information and attributes, useful to characterize and describe the surveyed object. The aim of this study is to propose an optimal, repeatable and reliable procedure to manage various types of 3D surveying data and associate them with heterogeneous information and attributes to characterize and describe the surveyed object. In particular, this paper presents an approach for classifying 3D heritage models, starting from the segmentation of their textures based on supervised machine learning methods. Experimental results run on three different case studies demonstrate that the proposed approach is effective and with many further potentials.
Onder, Devrim; Sarioglu, Sulen; Karacali, Bilge
2013-04-01
Quasi-supervised learning is a statistical learning algorithm that contrasts two datasets by computing estimate for the posterior probability of each sample in either dataset. This method has not been applied to histopathological images before. The purpose of this study is to evaluate the performance of the method to identify colorectal tissues with or without adenocarcinoma. Light microscopic digital images from histopathological sections were obtained from 30 colorectal radical surgery materials including adenocarcinoma and non-neoplastic regions. The texture features were extracted by using local histograms and co-occurrence matrices. The quasi-supervised learning algorithm operates on two datasets, one containing samples of normal tissues labelled only indirectly, and the other containing an unlabeled collection of samples of both normal and cancer tissues. As such, the algorithm eliminates the need for manually labelled samples of normal and cancer tissues for conventional supervised learning and significantly reduces the expert intervention. Several texture feature vector datasets corresponding to different extraction parameters were tested within the proposed framework. The Independent Component Analysis dimensionality reduction approach was also identified as the one improving the labelling performance evaluated in this series. In this series, the proposed method was applied to the dataset of 22,080 vectors with reduced dimensionality 119 from 132. Regions containing cancer tissue could be identified accurately having false and true positive rates up to 19% and 88% respectively without using manually labelled ground-truth datasets in a quasi-supervised strategy. The resulting labelling performances were compared to that of a conventional powerful supervised classifier using manually labelled ground-truth data. The supervised classifier results were calculated as 3.5% and 95% for the same case. The results in this series in comparison with the benchmark classifier, suggest that quasi-supervised image texture labelling may be a useful method in the analysis and classification of pathological slides but further study is required to improve the results. Copyright © 2013 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Teranishi, Masaru; Omatu, Sigeru; Kosaka, Toshihisa
Fatigued monetary bills adversely affect the daily operation of automated teller machines (ATMs). In order to make the classification of fatigued bills more efficient, the development of an automatic fatigued monetary bill classification method is desirable. We propose a new method by which to estimate the fatigue level of monetary bills from the feature-selected frequency band acoustic energy pattern of banking machines. By using a supervised self-organizing map (SOM), we effectively estimate the fatigue level using only the feature-selected frequency band acoustic energy pattern. Furthermore, the feature-selected frequency band acoustic energy pattern improves the estimation accuracy of the fatigue level of monetary bills by adding frequency domain information to the acoustic energy pattern. The experimental results with real monetary bill samples reveal the effectiveness of the proposed method.
Learning With Mixed Hard/Soft Pointwise Constraints.
Gnecco, Giorgio; Gori, Marco; Melacci, Stefano; Sanguineti, Marcello
2015-09-01
A learning paradigm is proposed and investigated, in which the classical framework of learning from examples is enhanced by the introduction of hard pointwise constraints, i.e., constraints imposed on a finite set of examples that cannot be violated. Such constraints arise, e.g., when requiring coherent decisions of classifiers acting on different views of the same pattern. The classical examples of supervised learning, which can be violated at the cost of some penalization (quantified by the choice of a suitable loss function) play the role of soft pointwise constraints. Constrained variational calculus is exploited to derive a representer theorem that provides a description of the functional structure of the optimal solution to the proposed learning paradigm. It is shown that such an optimal solution can be represented in terms of a set of support constraints, which generalize the concept of support vectors and open the doors to a novel learning paradigm, called support constraint machines. The general theory is applied to derive the representation of the optimal solution to the problem of learning from hard linear pointwise constraints combined with soft pointwise constraints induced by supervised examples. In some cases, closed-form optimal solutions are obtained.
SU-F-J-72: A Clinical Usable Integrated Contouring Quality Evaluation Software for Radiotherapy
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jiang, S; Dolly, S; Cai, B
Purpose: To introduce the Auto Contour Evaluation (ACE) software, which is the clinical usable, user friendly, efficient and all-in-one toolbox for automatically identify common contouring errors in radiotherapy treatment planning using supervised machine learning techniques. Methods: ACE is developed with C# using Microsoft .Net framework and Windows Presentation Foundation (WPF) for elegant GUI design and smooth GUI transition animations through the integration of graphics engines and high dots per inch (DPI) settings on modern high resolution monitors. The industrial standard software design pattern, Model-View-ViewModel (MVVM) pattern, is chosen to be the major architecture of ACE for neat coding structure, deepmore » modularization, easy maintainability and seamless communication with other clinical software. ACE consists of 1) a patient data importing module integrated with clinical patient database server, 2) a 2D DICOM image and RT structure simultaneously displaying module, 3) a 3D RT structure visualization module using Visualization Toolkit or VTK library and 4) a contour evaluation module using supervised pattern recognition algorithms to detect contouring errors and display detection results. ACE relies on supervised learning algorithms to handle all image processing and data processing jobs. Implementations of related algorithms are powered by Accord.Net scientific computing library for better efficiency and effectiveness. Results: ACE can take patient’s CT images and RT structures from commercial treatment planning software via direct user input or from patients’ database. All functionalities including 2D and 3D image visualization and RT contours error detection have been demonstrated with real clinical patient cases. Conclusion: ACE implements supervised learning algorithms and combines image processing and graphical visualization modules for RT contours verification. ACE has great potential for automated radiotherapy contouring quality verification. Structured with MVVM pattern, it is highly maintainable and extensible, and support smooth connections with other clinical software tools.« less
Cell classification using big data analytics plus time stretch imaging (Conference Presentation)
NASA Astrophysics Data System (ADS)
Jalali, Bahram; Chen, Claire L.; Mahjoubfar, Ata
2016-09-01
We show that blood cells can be classified with high accuracy and high throughput by combining machine learning with time stretch quantitative phase imaging. Our diagnostic system captures quantitative phase images in a flow microscope at millions of frames per second and extracts multiple biophysical features from individual cells including morphological characteristics, light absorption and scattering parameters, and protein concentration. These parameters form a hyperdimensional feature space in which supervised learning and cell classification is performed. We show binary classification of T-cells against colon cancer cells, as well classification of algae cell strains with high and low lipid content. The label-free screening averts the negative impact of staining reagents on cellular viability or cell signaling. The combination of time stretch machine vision and learning offers unprecedented cell analysis capabilities for cancer diagnostics, drug development and liquid biopsy for personalized genomics.
A robust data-driven approach for gene ontology annotation.
Li, Yanpeng; Yu, Hong
2014-01-01
Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks. © The Author(s) 2014. Published by Oxford University Press.
Zeng, Xueqiang; Luo, Gang
2017-12-01
Machine learning is broadly used for clinical data analysis. Before training a model, a machine learning algorithm must be selected. Also, the values of one or more model parameters termed hyper-parameters must be set. Selecting algorithms and hyper-parameter values requires advanced machine learning knowledge and many labor-intensive manual iterations. To lower the bar to machine learning, miscellaneous automatic selection methods for algorithms and/or hyper-parameter values have been proposed. Existing automatic selection methods are inefficient on large data sets. This poses a challenge for using machine learning in the clinical big data era. To address the challenge, this paper presents progressive sampling-based Bayesian optimization, an efficient and automatic selection method for both algorithms and hyper-parameter values. We report an implementation of the method. We show that compared to a state of the art automatic selection method, our method can significantly reduce search time, classification error rate, and standard deviation of error rate due to randomization. This is major progress towards enabling fast turnaround in identifying high-quality solutions required by many machine learning-based clinical data analysis tasks.
Mikhchi, Abbas; Honarvar, Mahmood; Kashan, Nasser Emam Jomeh; Aminafshar, Mehdi
2016-06-21
Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation. Copyright © 2016 Elsevier Ltd. All rights reserved.
Robust evaluation of time series classification algorithms for structural health monitoring
NASA Astrophysics Data System (ADS)
Harvey, Dustin Y.; Worden, Keith; Todd, Michael D.
2014-03-01
Structural health monitoring (SHM) systems provide real-time damage and performance information for civil, aerospace, and mechanical infrastructure through analysis of structural response measurements. The supervised learning methodology for data-driven SHM involves computation of low-dimensional, damage-sensitive features from raw measurement data that are then used in conjunction with machine learning algorithms to detect, classify, and quantify damage states. However, these systems often suffer from performance degradation in real-world applications due to varying operational and environmental conditions. Probabilistic approaches to robust SHM system design suffer from incomplete knowledge of all conditions a system will experience over its lifetime. Info-gap decision theory enables nonprobabilistic evaluation of the robustness of competing models and systems in a variety of decision making applications. Previous work employed info-gap models to handle feature uncertainty when selecting various components of a supervised learning system, namely features from a pre-selected family and classifiers. In this work, the info-gap framework is extended to robust feature design and classifier selection for general time series classification through an efficient, interval arithmetic implementation of an info-gap data model. Experimental results are presented for a damage type classification problem on a ball bearing in a rotating machine. The info-gap framework in conjunction with an evolutionary feature design system allows for fully automated design of a time series classifier to meet performance requirements under maximum allowable uncertainty.
A Fast Reduced Kernel Extreme Learning Machine.
Deng, Wan-Yu; Ong, Yew-Soon; Zheng, Qing-Hua
2016-04-01
In this paper, we present a fast and accurate kernel-based supervised algorithm referred to as the Reduced Kernel Extreme Learning Machine (RKELM). In contrast to the work on Support Vector Machine (SVM) or Least Square SVM (LS-SVM), which identifies the support vectors or weight vectors iteratively, the proposed RKELM randomly selects a subset of the available data samples as support vectors (or mapping samples). By avoiding the iterative steps of SVM, significant cost savings in the training process can be readily attained, especially on Big datasets. RKELM is established based on the rigorous proof of universal learning involving reduced kernel-based SLFN. In particular, we prove that RKELM can approximate any nonlinear functions accurately under the condition of support vectors sufficiency. Experimental results on a wide variety of real world small instance size and large instance size applications in the context of binary classification, multi-class problem and regression are then reported to show that RKELM can perform at competitive level of generalized performance as the SVM/LS-SVM at only a fraction of the computational effort incurred. Copyright © 2015 Elsevier Ltd. All rights reserved.
Entity recognition in the biomedical domain using a hybrid approach.
Basaldella, Marco; Furrer, Lenz; Tasso, Carlo; Rinaldi, Fabio
2017-11-09
This article describes a high-recall, high-precision approach for the extraction of biomedical entities from scientific articles. The approach uses a two-stage pipeline, combining a dictionary-based entity recognizer with a machine-learning classifier. First, the OGER entity recognizer, which has a bias towards high recall, annotates the terms that appear in selected domain ontologies. Subsequently, the Distiller framework uses this information as a feature for a machine learning algorithm to select the relevant entities only. For this step, we compare two different supervised machine-learning algorithms: Conditional Random Fields and Neural Networks. In an in-domain evaluation using the CRAFT corpus, we test the performance of the combined systems when recognizing chemicals, cell types, cellular components, biological processes, molecular functions, organisms, proteins, and biological sequences. Our best system combines dictionary-based candidate generation with Neural-Network-based filtering. It achieves an overall precision of 86% at a recall of 60% on the named entity recognition task, and a precision of 51% at a recall of 49% on the concept recognition task. These results are to our knowledge the best reported so far in this particular task.
NASA Astrophysics Data System (ADS)
Zhang, Bin; Liu, Yueyan; Zhang, Zuyu; Shen, Yonglin
2017-10-01
A multifeature soft-probability cascading scheme to solve the problem of land use and land cover (LULC) classification using high-spatial-resolution images to map rural residential areas in China is proposed. The proposed method is used to build midlevel LULC features. Local features are frequently considered as low-level feature descriptors in a midlevel feature learning method. However, spectral and textural features, which are very effective low-level features, are neglected. The acquisition of the dictionary of sparse coding is unsupervised, and this phenomenon reduces the discriminative power of the midlevel feature. Thus, we propose to learn supervised features based on sparse coding, a support vector machine (SVM) classifier, and a conditional random field (CRF) model to utilize the different effective low-level features and improve the discriminability of midlevel feature descriptors. First, three kinds of typical low-level features, namely, dense scale-invariant feature transform, gray-level co-occurrence matrix, and spectral features, are extracted separately. Second, combined with sparse coding and the SVM classifier, the probabilities of the different LULC classes are inferred to build supervised feature descriptors. Finally, the CRF model, which consists of two parts: unary potential and pairwise potential, is employed to construct an LULC classification map. Experimental results show that the proposed classification scheme can achieve impressive performance when the total accuracy reached about 87%.
Vingron, Martin
2016-01-01
Non-methylated islands (NMIs) of DNA are genomic regions that are important for gene regulation and development. A recent study of genome-wide non-methylation data in vertebrates by Long et al. (eLife 2013;2:e00348) has shown that many experimentally identified non-methylated regions do not overlap with classically defined CpG islands which are computationally predicted using simple DNA sequence features. This is especially true in cold-blooded vertebrates such as Danio rerio (zebrafish). In order to investigate how predictive DNA sequence is of a region’s methylation status, we applied a supervised learning approach using a spectrum kernel support vector machine, to see if a more complex model and supervised learning can be used to improve non-methylated island prediction and to understand the sequence properties of these regions. We demonstrate that DNA sequence is highly predictive of methylation status, and that in contrast to existing CpG island prediction methods our method is able to provide more useful predictions of NMIs genome-wide in all vertebrate organisms that were studied. Our results also show that in cold-blooded vertebrates (Anolis carolinensis, Xenopus tropicalis and Danio rerio) where genome-wide classical CpG island predictions consist primarily of false positives, longer primarily AT-rich DNA sequence features are able to identify these regions much more accurately. PMID:27984582
A random forest learning assisted "divide and conquer" approach for peptide conformation search.
Chen, Xin; Yang, Bing; Lin, Zijing
2018-06-11
Computational determination of peptide conformations is challenging as it is a problem of finding minima in a high-dimensional space. The "divide and conquer" approach is promising for reliably reducing the search space size. A random forest learning model is proposed here to expand the scope of applicability of the "divide and conquer" approach. A random forest classification algorithm is used to characterize the distributions of the backbone φ-ψ units ("words"). A random forest supervised learning model is developed to analyze the combinations of the φ-ψ units ("grammar"). It is found that amino acid residues may be grouped as equivalent "words", while the φ-ψ combinations in low-energy peptide conformations follow a distinct "grammar". The finding of equivalent words empowers the "divide and conquer" method with the flexibility of fragment substitution. The learnt grammar is used to improve the efficiency of the "divide and conquer" method by removing unfavorable φ-ψ combinations without the need of dedicated human effort. The machine learning assisted search method is illustrated by efficiently searching the conformations of GGG/AAA/GGGG/AAAA/GGGGG through assembling the structures of GFG/GFGG. Moreover, the computational cost of the new method is shown to increase rather slowly with the peptide length.
NASA Astrophysics Data System (ADS)
van Hecke, Kevin; de Croon, Guido C. H. E.; Hennes, Daniel; Setterfield, Timothy P.; Saenz-Otero, Alvar; Izzo, Dario
2017-11-01
Although machine learning holds an enormous promise for autonomous space robots, it is currently not employed because of the inherent uncertain outcome of learning processes. In this article we investigate a learning mechanism, Self-Supervised Learning (SSL), which is very reliable and hence an important candidate for real-world deployment even on safety-critical systems such as space robots. To demonstrate this reliability, we introduce a novel SSL setup that allows a stereo vision equipped robot to cope with the failure of one of its cameras. The setup learns to estimate average depth using a monocular image, by using the stereo vision depths from the past as trusted ground truth. We present preliminary results from an experiment on the International Space Station (ISS) performed with the MIT/NASA SPHERES VERTIGO satellite. The presented experiments were performed on October 8th, 2015 on board the ISS. The main goals were (1) data gathering, and (2) navigation based on stereo vision. First the astronaut Kimiya Yui moved the satellite around the Japanese Experiment Module to gather stereo vision data for learning. Subsequently, the satellite freely explored the space in the module based on its (trusted) stereo vision system and a pre-programmed exploration behavior, while simultaneously performing the self-supervised learning of monocular depth estimation on board. The two main goals were successfully achieved, representing the first online learning robotic experiments in space. These results lay the groundwork for a follow-up experiment in which the satellite will use the learned single-camera depth estimation for autonomous exploration in the ISS, and are an advancement towards future space robots that continuously improve their navigation capabilities over time, even in harsh and completely unknown space environments.
A supervised learning rule for classification of spatiotemporal spike patterns.
Lilin Guo; Zhenzhong Wang; Adjouadi, Malek
2016-08-01
This study introduces a novel supervised algorithm for spiking neurons that take into consideration synapse delays and axonal delays associated with weights. It can be utilized for both classification and association and uses several biologically influenced properties, such as axonal and synaptic delays. This algorithm also takes into consideration spike-timing-dependent plasticity as in Remote Supervised Method (ReSuMe). This paper focuses on the classification aspect alone. Spiked neurons trained according to this proposed learning rule are capable of classifying different categories by the associated sequences of precisely timed spikes. Simulation results have shown that the proposed learning method greatly improves classification accuracy when compared to the Spike Pattern Association Neuron (SPAN) and the Tempotron learning rule.
Guo, Lilin; Wang, Zhenzhong; Cabrerizo, Mercedes; Adjouadi, Malek
2017-05-01
This study introduces a novel learning algorithm for spiking neurons, called CCDS, which is able to learn and reproduce arbitrary spike patterns in a supervised fashion allowing the processing of spatiotemporal information encoded in the precise timing of spikes. Unlike the Remote Supervised Method (ReSuMe), synapse delays and axonal delays in CCDS are variants which are modulated together with weights during learning. The CCDS rule is both biologically plausible and computationally efficient. The properties of this learning rule are investigated extensively through experimental evaluations in terms of reliability, adaptive learning performance, generality to different neuron models, learning in the presence of noise, effects of its learning parameters and classification performance. Results presented show that the CCDS learning method achieves learning accuracy and learning speed comparable with ReSuMe, but improves classification accuracy when compared to both the Spike Pattern Association Neuron (SPAN) learning rule and the Tempotron learning rule. The merit of CCDS rule is further validated on a practical example involving the automated detection of interictal spikes in EEG records of patients with epilepsy. Results again show that with proper encoding, the CCDS rule achieves good recognition performance.
Machine learning in heart failure: ready for prime time.
Awan, Saqib Ejaz; Sohel, Ferdous; Sanfilippo, Frank Mario; Bennamoun, Mohammed; Dwivedi, Girish
2018-03-01
The aim of this review is to present an up-to-date overview of the application of machine learning methods in heart failure including diagnosis, classification, readmissions and medication adherence. Recent studies have shown that the application of machine learning techniques may have the potential to improve heart failure outcomes and management, including cost savings by improving existing diagnostic and treatment support systems. Recently developed deep learning methods are expected to yield even better performance than traditional machine learning techniques in performing complex tasks by learning the intricate patterns hidden in big medical data. The review summarizes the recent developments in the application of machine and deep learning methods in heart failure management.
Process Recording in Supervision of Students Learning to Practice with Children
ERIC Educational Resources Information Center
Mullin, Walter J.; Canning, James J.
2007-01-01
This article addresses the use of process recordings in supervising social work students learning to practice with children. Although process recordings are a traditional method of teaching and learning social work practice, they have received little attention in the literature of social work practice and social work education. Process recordings…
Stone, James R; Wilde, Elisabeth A; Taylor, Brian A; Tate, David F; Levin, Harvey; Bigler, Erin D; Scheibel, Randall S; Newsome, Mary R; Mayer, Andrew R; Abildskov, Tracy; Black, Garrett M; Lennon, Michael J; York, Gerald E; Agarwal, Rajan; DeVillasante, Jorge; Ritter, John L; Walker, Peter B; Ahlers, Stephen T; Tustison, Nicholas J
2016-01-01
White matter hyperintensities (WMHs) are foci of abnormal signal intensity in white matter regions seen with magnetic resonance imaging (MRI). WMHs are associated with normal ageing and have shown prognostic value in neurological conditions such as traumatic brain injury (TBI). The impracticality of manually quantifying these lesions limits their clinical utility and motivates the utilization of machine learning techniques for automated segmentation workflows. This study develops a concatenated random forest framework with image features for segmenting WMHs in a TBI cohort. The framework is built upon the Advanced Normalization Tools (ANTs) and ANTsR toolkits. MR (3D FLAIR, T2- and T1-weighted) images from 24 service members and veterans scanned in the Chronic Effects of Neurotrauma Consortium's (CENC) observational study were acquired. Manual annotations were employed for both training and evaluation using a leave-one-out strategy. Performance measures include sensitivity, positive predictive value, [Formula: see text] score and relative volume difference. Final average results were: sensitivity = 0.68 ± 0.38, positive predictive value = 0.51 ± 0.40, [Formula: see text] = 0.52 ± 0.36, relative volume difference = 43 ± 26%. In addition, three lesion size ranges are selected to illustrate the variation in performance with lesion size. Paired with correlative outcome data, supervised learning methods may allow for identification of imaging features predictive of diagnosis and prognosis in individual TBI patients.
Automatic classification of time-variable X-ray sources
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lo, Kitty K.; Farrell, Sean; Murphy, Tara
2014-05-01
To maximize the discovery potential of future synoptic surveys, especially in the field of transient science, it will be necessary to use automatic classification to identify some of the astronomical sources. The data mining technique of supervised classification is suitable for this problem. Here, we present a supervised learning method to automatically classify variable X-ray sources in the Second XMM-Newton Serendipitous Source Catalog (2XMMi-DR2). Random Forest is our classifier of choice since it is one of the most accurate learning algorithms available. Our training set consists of 873 variable sources and their features are derived from time series, spectra, andmore » other multi-wavelength contextual information. The 10 fold cross validation accuracy of the training data is ∼97% on a 7 class data set. We applied the trained classification model to 411 unknown variable 2XMM sources to produce a probabilistically classified catalog. Using the classification margin and the Random Forest derived outlier measure, we identified 12 anomalous sources, of which 2XMM J180658.7–500250 appears to be the most unusual source in the sample. Its X-ray spectra is suggestive of a ultraluminous X-ray source but its variability makes it highly unusual. Machine-learned classification and anomaly detection will facilitate scientific discoveries in the era of all-sky surveys.« less
2017-01-01
Background Machine learning techniques may be an effective and efficient way to classify open-text reports on doctor’s activity for the purposes of quality assurance, safety, and continuing professional development. Objective The objective of the study was to evaluate the accuracy of machine learning algorithms trained to classify open-text reports of doctor performance and to assess the potential for classifications to identify significant differences in doctors’ professional performance in the United Kingdom. Methods We used 1636 open-text comments (34,283 words) relating to the performance of 548 doctors collected from a survey of clinicians’ colleagues using the General Medical Council Colleague Questionnaire (GMC-CQ). We coded 77.75% (1272/1636) of the comments into 5 global themes (innovation, interpersonal skills, popularity, professionalism, and respect) using a qualitative framework. We trained 8 machine learning algorithms to classify comments and assessed their performance using several training samples. We evaluated doctor performance using the GMC-CQ and compared scores between doctors with different classifications using t tests. Results Individual algorithm performance was high (range F score=.68 to .83). Interrater agreement between the algorithms and the human coder was highest for codes relating to “popular” (recall=.97), “innovator” (recall=.98), and “respected” (recall=.87) codes and was lower for the “interpersonal” (recall=.80) and “professional” (recall=.82) codes. A 10-fold cross-validation demonstrated similar performance in each analysis. When combined together into an ensemble of multiple algorithms, mean human-computer interrater agreement was .88. Comments that were classified as “respected,” “professional,” and “interpersonal” related to higher doctor scores on the GMC-CQ compared with comments that were not classified (P<.05). Scores did not vary between doctors who were rated as popular or innovative and those who were not rated at all (P>.05). Conclusions Machine learning algorithms can classify open-text feedback of doctor performance into multiple themes derived by human raters with high performance. Colleague open-text comments that signal respect, professionalism, and being interpersonal may be key indicators of doctor’s performance. PMID:28298265
Interpreting Medical Information Using Machine Learning and Individual Conditional Expectation.
Nohara, Yasunobu; Wakata, Yoshifumi; Nakashima, Naoki
2015-01-01
Recently, machine-learning techniques have spread many fields. However, machine-learning is still not popular in medical research field due to difficulty of interpreting. In this paper, we introduce a method of interpreting medical information using machine learning technique. The method gave new explanation of partial dependence plot and individual conditional expectation plot from medical research field.
Event Recognition Based on Deep Learning in Chinese Texts
Zhang, Yajun; Liu, Zongtian; Zhou, Wen
2016-01-01
Event recognition is the most fundamental and critical task in event-based natural language processing systems. Existing event recognition methods based on rules and shallow neural networks have certain limitations. For example, extracting features using methods based on rules is difficult; methods based on shallow neural networks converge too quickly to a local minimum, resulting in low recognition precision. To address these problems, we propose the Chinese emergency event recognition model based on deep learning (CEERM). Firstly, we use a word segmentation system to segment sentences. According to event elements labeled in the CEC 2.0 corpus, we classify words into five categories: trigger words, participants, objects, time and location. Each word is vectorized according to the following six feature layers: part of speech, dependency grammar, length, location, distance between trigger word and core word and trigger word frequency. We obtain deep semantic features of words by training a feature vector set using a deep belief network (DBN), then analyze those features in order to identify trigger words by means of a back propagation neural network. Extensive testing shows that the CEERM achieves excellent recognition performance, with a maximum F-measure value of 85.17%. Moreover, we propose the dynamic-supervised DBN, which adds supervised fine-tuning to a restricted Boltzmann machine layer by monitoring its training performance. Test analysis reveals that the new DBN improves recognition performance and effectively controls the training time. Although the F-measure increases to 88.11%, the training time increases by only 25.35%. PMID:27501231
Event Recognition Based on Deep Learning in Chinese Texts.
Zhang, Yajun; Liu, Zongtian; Zhou, Wen
2016-01-01
Event recognition is the most fundamental and critical task in event-based natural language processing systems. Existing event recognition methods based on rules and shallow neural networks have certain limitations. For example, extracting features using methods based on rules is difficult; methods based on shallow neural networks converge too quickly to a local minimum, resulting in low recognition precision. To address these problems, we propose the Chinese emergency event recognition model based on deep learning (CEERM). Firstly, we use a word segmentation system to segment sentences. According to event elements labeled in the CEC 2.0 corpus, we classify words into five categories: trigger words, participants, objects, time and location. Each word is vectorized according to the following six feature layers: part of speech, dependency grammar, length, location, distance between trigger word and core word and trigger word frequency. We obtain deep semantic features of words by training a feature vector set using a deep belief network (DBN), then analyze those features in order to identify trigger words by means of a back propagation neural network. Extensive testing shows that the CEERM achieves excellent recognition performance, with a maximum F-measure value of 85.17%. Moreover, we propose the dynamic-supervised DBN, which adds supervised fine-tuning to a restricted Boltzmann machine layer by monitoring its training performance. Test analysis reveals that the new DBN improves recognition performance and effectively controls the training time. Although the F-measure increases to 88.11%, the training time increases by only 25.35%.
Web Mining: Machine Learning for Web Applications.
ERIC Educational Resources Information Center
Chen, Hsinchun; Chau, Michael
2004-01-01
Presents an overview of machine learning research and reviews methods used for evaluating machine learning systems. Ways that machine-learning algorithms were used in traditional information retrieval systems in the "pre-Web" era are described, and the field of Web mining and how machine learning has been used in different Web mining…
Automated source classification of new transient sources
NASA Astrophysics Data System (ADS)
Oertel, M.; Kreikenbohm, A.; Wilms, J.; DeLuca, A.
2017-10-01
The EXTraS project harvests the hitherto unexplored temporal domain information buried in the serendipitous data collected by the European Photon Imaging Camera (EPIC) onboard the ESA XMM-Newton mission since its launch. This includes a search for fast transients, missed by standard image analysis, and a search and characterization of variability in hundreds of thousands of sources. We present an automated classification scheme for new transient sources in the EXTraS project. The method is as follows: source classification features of a training sample are used to train machine learning algorithms (performed in R; randomForest (Breiman, 2001) in supervised mode) which are then tested on a sample of known source classes and used for classification.
NASA Astrophysics Data System (ADS)
Lu, Xinguo; Chen, Dan
2017-08-01
Traditional supervised classifiers neglect a large amount of data which not have sufficient follow-up information, only work with labeled data. Consequently, the small sample size limits the advancement of design appropriate classifier. In this paper, a transductive learning method which combined with the filtering strategy in transductive framework and progressive labeling strategy is addressed. The progressive labeling strategy does not need to consider the distribution of labeled samples to evaluate the distribution of unlabeled samples, can effective solve the problem of evaluate the proportion of positive and negative samples in work set. Our experiment result demonstrate that the proposed technique have great potential in cancer prediction based on gene expression.
On-line Machine Learning and Event Detection in Petascale Data Streams
NASA Astrophysics Data System (ADS)
Thompson, David R.; Wagstaff, K. L.
2012-01-01
Traditional statistical data mining involves off-line analysis in which all data are available and equally accessible. However, petascale datasets have challenged this premise since it is often impossible to store, let alone analyze, the relevant observations. This has led the machine learning community to investigate adaptive processing chains where data mining is a continuous process. Here pattern recognition permits triage and followup decisions at multiple stages of a processing pipeline. Such techniques can also benefit new astronomical instruments such as the Large Synoptic Survey Telescope (LSST) and Square Kilometre Array (SKA) that will generate petascale data volumes. We summarize some machine learning perspectives on real time data mining, with representative cases of astronomical applications and event detection in high volume datastreams. The first is a "supervised classification" approach currently used for transient event detection at the Very Long Baseline Array (VLBA). It injects known signals of interest - faint single-pulse anomalies - and tunes system parameters to recover these events. This permits meaningful event detection for diverse instrument configurations and observing conditions whose noise cannot be well-characterized in advance. Second, "semi-supervised novelty detection" finds novel events based on statistical deviations from previous patterns. It detects outlier signals of interest while considering known examples of false alarm interference. Applied to data from the Parkes pulsar survey, the approach identifies anomalous "peryton" phenomena that do not match previous event models. Finally, we consider online light curve classification that can trigger adaptive followup measurements of candidate events. Classifier performance analyses suggest optimal survey strategies, and permit principled followup decisions from incomplete data. These examples trace a broad range of algorithm possibilities available for online astronomical data mining. This talk describes research performed at the Jet Propulsion Laboratory, California Institute of Technology. Copyright 2012, All Rights Reserved. U.S. Government support acknowledged.
Enhanced Data Representation by Kernel Metric Learning for Dementia Diagnosis
Cárdenas-Peña, David; Collazos-Huertas, Diego; Castellanos-Dominguez, German
2017-01-01
Alzheimer's disease (AD) is the kind of dementia that affects the most people around the world. Therefore, an early identification supporting effective treatments is required to increase the life quality of a wide number of patients. Recently, computer-aided diagnosis tools for dementia using Magnetic Resonance Imaging scans have been successfully proposed to discriminate between patients with AD, mild cognitive impairment, and healthy controls. Most of the attention has been given to the clinical data, provided by initiatives as the ADNI, supporting reliable researches on intervention, prevention, and treatments of AD. Therefore, there is a need for improving the performance of classification machines. In this paper, we propose a kernel framework for learning metrics that enhances conventional machines and supports the diagnosis of dementia. Our framework aims at building discriminative spaces through the maximization of center kernel alignment function, aiming at improving the discrimination of the three considered neurological classes. The proposed metric learning performance is evaluated on the widely-known ADNI database using three supervised classification machines (k-nn, SVM and NNs) for multi-class and bi-class scenarios from structural MRIs. Specifically, from ADNI collection 286 AD patients, 379 MCI patients and 231 healthy controls are used for development and validation of our proposed metric learning framework. For the experimental validation, we split the data into two subsets: 30% of subjects used like a blindfolded assessment and 70% employed for parameter tuning. Then, in the preprocessing stage, each structural MRI scan a total of 310 morphological measurements are automatically extracted from by FreeSurfer software package and concatenated to build an input feature matrix. Obtained test performance results, show that including a supervised metric learning improves the compared baseline classifiers in both scenarios. In the multi-class scenario, we achieve the best performance (accuracy 60.1%) for pretrained 1-layered NN, and we obtain measures over 90% in the average for HC vs. AD task. From the machine learning point of view, our proposal enhances the classifier performance by building spaces with a better class separability. From the clinical application, our enhancement results in a more balanced performance in each class than the compared approaches from the CADDementia challenge by increasing the sensitivity of pathological groups and the specificity of healthy controls. PMID:28798659
Liang, Yong; Chai, Hua; Liu, Xiao-Ying; Xu, Zong-Ben; Zhang, Hai; Leung, Kwong-Sak
2016-03-01
One of the most important objectives of the clinical cancer research is to diagnose cancer more accurately based on the patients' gene expression profiles. Both Cox proportional hazards model (Cox) and accelerated failure time model (AFT) have been widely adopted to the high risk and low risk classification or survival time prediction for the patients' clinical treatment. Nevertheless, two main dilemmas limit the accuracy of these prediction methods. One is that the small sample size and censored data remain a bottleneck for training robust and accurate Cox classification model. In addition to that, similar phenotype tumours and prognoses are actually completely different diseases at the genotype and molecular level. Thus, the utility of the AFT model for the survival time prediction is limited when such biological differences of the diseases have not been previously identified. To try to overcome these two main dilemmas, we proposed a novel semi-supervised learning method based on the Cox and AFT models to accurately predict the treatment risk and the survival time of the patients. Moreover, we adopted the efficient L1/2 regularization approach in the semi-supervised learning method to select the relevant genes, which are significantly associated with the disease. The results of the simulation experiments show that the semi-supervised learning model can significant improve the predictive performance of Cox and AFT models in survival analysis. The proposed procedures have been successfully applied to four real microarray gene expression and artificial evaluation datasets. The advantages of our proposed semi-supervised learning method include: 1) significantly increase the available training samples from censored data; 2) high capability for identifying the survival risk classes of patient in Cox model; 3) high predictive accuracy for patients' survival time in AFT model; 4) strong capability of the relevant biomarker selection. Consequently, our proposed semi-supervised learning model is one more appropriate tool for survival analysis in clinical cancer research.
Multi-Stage Convex Relaxation Methods for Machine Learning
2013-03-01
Many problems in machine learning can be naturally formulated as non-convex optimization problems. However, such direct nonconvex formulations have...original nonconvex formulation. We will develop theoretical properties of this method and algorithmic consequences. Related convex and nonconvex machine learning methods will also be investigated.
Automated classification of cell morphology by coherence-controlled holographic microscopy
NASA Astrophysics Data System (ADS)
Strbkova, Lenka; Zicha, Daniel; Vesely, Pavel; Chmelik, Radim
2017-08-01
In the last few years, classification of cells by machine learning has become frequently used in biology. However, most of the approaches are based on morphometric (MO) features, which are not quantitative in terms of cell mass. This may result in poor classification accuracy. Here, we study the potential contribution of coherence-controlled holographic microscopy enabling quantitative phase imaging for the classification of cell morphologies. We compare our approach with the commonly used method based on MO features. We tested both classification approaches in an experiment with nutritionally deprived cancer tissue cells, while employing several supervised machine learning algorithms. Most of the classifiers provided higher performance when quantitative phase features were employed. Based on the results, it can be concluded that the quantitative phase features played an important role in improving the performance of the classification. The methodology could be valuable help in refining the monitoring of live cells in an automated fashion. We believe that coherence-controlled holographic microscopy, as a tool for quantitative phase imaging, offers all preconditions for the accurate automated analysis of live cell behavior while enabling noninvasive label-free imaging with sufficient contrast and high-spatiotemporal phase sensitivity.
Automated classification of cell morphology by coherence-controlled holographic microscopy.
Strbkova, Lenka; Zicha, Daniel; Vesely, Pavel; Chmelik, Radim
2017-08-01
In the last few years, classification of cells by machine learning has become frequently used in biology. However, most of the approaches are based on morphometric (MO) features, which are not quantitative in terms of cell mass. This may result in poor classification accuracy. Here, we study the potential contribution of coherence-controlled holographic microscopy enabling quantitative phase imaging for the classification of cell morphologies. We compare our approach with the commonly used method based on MO features. We tested both classification approaches in an experiment with nutritionally deprived cancer tissue cells, while employing several supervised machine learning algorithms. Most of the classifiers provided higher performance when quantitative phase features were employed. Based on the results, it can be concluded that the quantitative phase features played an important role in improving the performance of the classification. The methodology could be valuable help in refining the monitoring of live cells in an automated fashion. We believe that coherence-controlled holographic microscopy, as a tool for quantitative phase imaging, offers all preconditions for the accurate automated analysis of live cell behavior while enabling noninvasive label-free imaging with sufficient contrast and high-spatiotemporal phase sensitivity. (2017) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).
The operation of large computer-controlled manufacturing systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Upton, D.M.
1988-01-01
This work examines methods for operation of large computer-controlled manufacturing systems, with more than 50 or so disparate CNC machines in congregation. The central theme is the development of a distributed control system, which requires minimal central supervision, and allows manufacturing system re-configuration without extensive control software re-writes. Provision is made for machines to learn from their experience and provide estimates of the time necessary to effect various tasks. Routing is opportunistic, with varying degrees of myopia depending on the prevailing situation. Necessary curtailments of opportunism are built in to the system, in order to provide a society of machinesmore » that operate in unison rather than in chaos. Negotiation and contention resolution are carried out using a UHF radio communications network, along with processing capability on both pallets and tools. Graceful and robust error recovery is facilitated by ensuring adequate pessimistic consideration of failure modes at each stage in the scheme. Theoretical models are developed and an examination is made of fundamental characteristics of auction-based scheduling methods.« less
Using Support Vector Machine Ensembles for Target Audience Classification on Twitter
Lo, Siaw Ling; Chiong, Raymond; Cornforth, David
2015-01-01
The vast amount and diversity of the content shared on social media can pose a challenge for any business wanting to use it to identify potential customers. In this paper, our aim is to investigate the use of both unsupervised and supervised learning methods for target audience classification on Twitter with minimal annotation efforts. Topic domains were automatically discovered from contents shared by followers of an account owner using Twitter Latent Dirichlet Allocation (LDA). A Support Vector Machine (SVM) ensemble was then trained using contents from different account owners of the various topic domains identified by Twitter LDA. Experimental results show that the methods presented are able to successfully identify a target audience with high accuracy. In addition, we show that using a statistical inference approach such as bootstrapping in over-sampling, instead of using random sampling, to construct training datasets can achieve a better classifier in an SVM ensemble. We conclude that such an ensemble system can take advantage of data diversity, which enables real-world applications for differentiating prospective customers from the general audience, leading to business advantage in the crowded social media space. PMID:25874768
Using support vector machine ensembles for target audience classification on Twitter.
Lo, Siaw Ling; Chiong, Raymond; Cornforth, David
2015-01-01
The vast amount and diversity of the content shared on social media can pose a challenge for any business wanting to use it to identify potential customers. In this paper, our aim is to investigate the use of both unsupervised and supervised learning methods for target audience classification on Twitter with minimal annotation efforts. Topic domains were automatically discovered from contents shared by followers of an account owner using Twitter Latent Dirichlet Allocation (LDA). A Support Vector Machine (SVM) ensemble was then trained using contents from different account owners of the various topic domains identified by Twitter LDA. Experimental results show that the methods presented are able to successfully identify a target audience with high accuracy. In addition, we show that using a statistical inference approach such as bootstrapping in over-sampling, instead of using random sampling, to construct training datasets can achieve a better classifier in an SVM ensemble. We conclude that such an ensemble system can take advantage of data diversity, which enables real-world applications for differentiating prospective customers from the general audience, leading to business advantage in the crowded social media space.
Ma, Xiao H; Jia, Jia; Zhu, Feng; Xue, Ying; Li, Ze R; Chen, Yu Z
2009-05-01
Machine learning methods have been explored as ligand-based virtual screening tools for facilitating drug lead discovery. These methods predict compounds of specific pharmacodynamic, pharmacokinetic or toxicological properties based on their structure-derived structural and physicochemical properties. Increasing attention has been directed at these methods because of their capability in predicting compounds of diverse structures and complex structure-activity relationships without requiring the knowledge of target 3D structure. This article reviews current progresses in using machine learning methods for virtual screening of pharmacodynamically active compounds from large compound libraries, and analyzes and compares the reported performances of machine learning tools with those of structure-based and other ligand-based (such as pharmacophore and clustering) virtual screening methods. The feasibility to improve the performance of machine learning methods in screening large libraries is discussed.
Prediction and Validation of Disease Genes Using HeteSim Scores.
Zeng, Xiangxiang; Liao, Yuanlu; Liu, Yuansheng; Zou, Quan
2017-01-01
Deciphering the gene disease association is an important goal in biomedical research. In this paper, we use a novel relevance measure, called HeteSim, to prioritize candidate disease genes. Two methods based on heterogeneous networks constructed using protein-protein interaction, gene-phenotype associations, and phenotype-phenotype similarity, are presented. In HeteSim_MultiPath (HSMP), HeteSim scores of different paths are combined with a constant that dampens the contributions of longer paths. In HeteSim_SVM (HSSVM), HeteSim scores are combined with a machine learning method. The 3-fold experiments show that our non-machine learning method HSMP performs better than the existing non-machine learning methods, our machine learning method HSSVM obtains similar accuracy with the best existing machine learning method CATAPULT. From the analysis of the top 10 predicted genes for different diseases, we found that HSSVM avoid the disadvantage of the existing machine learning based methods, which always predict similar genes for different diseases. The data sets and Matlab code for the two methods are freely available for download at http://lab.malab.cn/data/HeteSim/index.jsp.
NASA Astrophysics Data System (ADS)
D'Amore, M.; Le Scaon, R.; Helbert, J.; Maturilli, A.
2017-12-01
Machine-learning achieved unprecedented results in high-dimensional data processing tasks with wide applications in various fields. Due to the growing number of complex nonlinear systems that have to be investigated in science and the bare raw size of data nowadays available, ML offers the unique ability to extract knowledge, regardless the specific application field. Examples are image segmentation, supervised/unsupervised/ semi-supervised classification, feature extraction, data dimensionality analysis/reduction.The MASCS instrument has mapped Mercury surface in the 400-1145 nm wavelength range during orbital observations by the MESSENGER spacecraft. We have conducted k-means unsupervised hierarchical clustering to identify and characterize spectral units from MASCS observations. The results display a dichotomy: a polar and equatorial units, possibly linked to compositional differences or weathering due to irradiation. To explore possible relations between composition and spectral behavior, we have compared the spectral provinces with elemental abundance maps derived from MESSENGER's X-Ray Spectrometer (XRS).For the Vesta application on DAWN Visible and infrared spectrometer (VIR) data, we explored several Machine Learning techniques: image segmentation method, stream algorithm and hierarchical clustering.The algorithm successfully separates the Olivine outcrops around two craters on Vesta's surface [1]. New maps summarizing the spectral and chemical signature of the surface could be automatically produced.We conclude that instead of hand digging in data, scientist could choose a subset of algorithms with well known feature (i.e. efficacy on the particular problem, speed, accuracy) and focus their effort in understanding what important characteristic of the groups found in the data mean. [1] E Ammannito et al. "Olivine in an unexpected location on Vesta's surface". In: Nature 504.7478 (2013), pp. 122-125.
Morabito, Francesco Carlo; Campolo, Maurizio; Mammone, Nadia; Versaci, Mario; Franceschetti, Silvana; Tagliavini, Fabrizio; Sofia, Vito; Fatuzzo, Daniela; Gambardella, Antonio; Labate, Angelo; Mumoli, Laura; Tripodi, Giovanbattista Gaspare; Gasparini, Sara; Cianci, Vittoria; Sueri, Chiara; Ferlazzo, Edoardo; Aguglia, Umberto
2017-03-01
A novel technique of quantitative EEG for differentiating patients with early-stage Creutzfeldt-Jakob disease (CJD) from other forms of rapidly progressive dementia (RPD) is proposed. The discrimination is based on the extraction of suitable features from the time-frequency representation of the EEG signals through continuous wavelet transform (CWT). An average measure of complexity of the EEG signal obtained by permutation entropy (PE) is also included. The dimensionality of the feature space is reduced through a multilayer processing system based on the recently emerged deep learning (DL) concept. The DL processor includes a stacked auto-encoder, trained by unsupervised learning techniques, and a classifier whose parameters are determined in a supervised way by associating the known category labels to the reduced vector of high-level features generated by the previous processing blocks. The supervised learning step is carried out by using either support vector machines (SVM) or multilayer neural networks (MLP-NN). A subset of EEG from patients suffering from Alzheimer's Disease (AD) and healthy controls (HC) is considered for differentiating CJD patients. When fine-tuning the parameters of the global processing system by a supervised learning procedure, the proposed system is able to achieve an average accuracy of 89%, an average sensitivity of 92%, and an average specificity of 89% in differentiating CJD from RPD. Similar results are obtained for CJD versus AD and CJD versus HC.
Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction.
Gupta, Shashank; Pawar, Sachin; Ramrakhiyani, Nitin; Palshikar, Girish Keshav; Varma, Vasudeva
2018-06-13
Social media is a useful platform to share health-related information due to its vast reach. This makes it a good candidate for public-health monitoring tasks, specifically for pharmacovigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from Twitter. Medical information extraction from social media is challenging, mainly due to short and highly informal nature of text, as compared to more technical and formal medical reports. Current methods in ADR mention extraction rely on supervised learning methods, which suffer from labeled data scarcity problem. The state-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which is Long-Short-Term-Memory network (LSTM). Deep neural networks, due to their large number of free parameters rely heavily on large annotated corpora for learning the end task. But in the real-world, it is hard to get large labeled data, mainly due to the heavy cost associated with the manual annotation. To this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. Through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction. In this study, we tackle the problem of labeled data scarcity for Adverse Drug Reaction mention extraction from social media and propose a novel semi-supervised learning based method which can leverage large unlabeled corpus available in abundance on the web. Through empirical study, we demonstrate that our proposed method outperforms fully supervised learning based baseline which relies on large manually annotated corpus for a good performance.
OpenCL based machine learning labeling of biomedical datasets
NASA Astrophysics Data System (ADS)
Amoros, Oscar; Escalera, Sergio; Puig, Anna
2011-03-01
In this paper, we propose a two-stage labeling method of large biomedical datasets through a parallel approach in a single GPU. Diagnostic methods, structures volume measurements, and visualization systems are of major importance for surgery planning, intra-operative imaging and image-guided surgery. In all cases, to provide an automatic and interactive method to label or to tag different structures contained into input data becomes imperative. Several approaches to label or segment biomedical datasets has been proposed to discriminate different anatomical structures in an output tagged dataset. Among existing methods, supervised learning methods for segmentation have been devised to easily analyze biomedical datasets by a non-expert user. However, they still have some problems concerning practical application, such as slow learning and testing speeds. In addition, recent technological developments have led to widespread availability of multi-core CPUs and GPUs, as well as new software languages, such as NVIDIA's CUDA and OpenCL, allowing to apply parallel programming paradigms in conventional personal computers. Adaboost classifier is one of the most widely applied methods for labeling in the Machine Learning community. In a first stage, Adaboost trains a binary classifier from a set of pre-labeled samples described by a set of features. This binary classifier is defined as a weighted combination of weak classifiers. Each weak classifier is a simple decision function estimated on a single feature value. Then, at the testing stage, each weak classifier is independently applied on the features of a set of unlabeled samples. In this work, we propose an alternative representation of the Adaboost binary classifier. We use this proposed representation to define a new GPU-based parallelized Adaboost testing stage using OpenCL. We provide numerical experiments based on large available data sets and we compare our results to CPU-based strategies in terms of time and labeling speeds.
Nagasawa, Shinji; Al-Naamani, Eman; Saeki, Akinori
2018-05-17
Owing to the diverse chemical structures, organic photovoltaic (OPV) applications with a bulk heterojunction framework have greatly evolved over the last two decades, which has produced numerous organic semiconductors exhibiting improved power conversion efficiencies (PCEs). Despite the recent fast progress in materials informatics and data science, data-driven molecular design of OPV materials remains challenging. We report a screening of conjugated molecules for polymer-fullerene OPV applications by supervised learning methods (artificial neural network (ANN) and random forest (RF)). Approximately 1000 experimental parameters including PCE, molecular weight, and electronic properties are manually collected from the literature and subjected to machine learning with digitized chemical structures. Contrary to the low correlation coefficient in ANN, RF yields an acceptable accuracy, which is twice that of random classification. We demonstrate the application of RF screening for the design, synthesis, and characterization of a conjugated polymer, which facilitates a rapid development of optoelectronic materials.
Quantitative consensus of supervised learners for diffuse lung parenchymal HRCT patterns
NASA Astrophysics Data System (ADS)
Raghunath, Sushravya; Rajagopalan, Srinivasan; Karwoski, Ronald A.; Bartholmai, Brian J.; Robb, Richard A.
2013-03-01
Automated lung parenchymal classification usually relies on supervised learning of expert chosen regions representative of the visually differentiable HRCT patterns specific to different pathologies (eg. emphysema, ground glass, honey combing, reticular and normal). Considering the elusiveness of a single most discriminating similarity measure, a plurality of weak learners can be combined to improve the machine learnability. Though a number of quantitative combination strategies exist, their efficacy is data and domain dependent. In this paper, we investigate multiple (N=12) quantitative consensus approaches to combine the clusters obtained with multiple (n=33) probability density-based similarity measures. Our study shows that hypergraph based meta-clustering and probabilistic clustering provides optimal expert-metric agreement.
Wu, Mon-Ju; Mwangi, Benson; Bauer, Isabelle E; Passos, Ives C; Sanches, Marsal; Zunta-Soares, Giovana B; Meyer, Thomas D; Hasan, Khader M; Soares, Jair C
2017-01-15
Diagnosis, clinical management and research of psychiatric disorders remain subjective - largely guided by historically developed categories which may not effectively capture underlying pathophysiological mechanisms of dysfunction. Here, we report a novel approach of identifying and validating distinct and biologically meaningful clinical phenotypes of bipolar disorders using both unsupervised and supervised machine learning techniques. First, neurocognitive data were analyzed using an unsupervised machine learning approach and two distinct clinical phenotypes identified namely; phenotype I and phenotype II. Second, diffusion weighted imaging scans were pre-processed using the tract-based spatial statistics (TBSS) method and 'skeletonized' white matter fractional anisotropy (FA) and mean diffusivity (MD) maps extracted. The 'skeletonized' white matter FA and MD maps were entered into the Elastic Net machine learning algorithm to distinguish individual subjects' phenotypic labels (e.g. phenotype I vs. phenotype II). This calculation was performed to ascertain whether the identified clinical phenotypes were biologically distinct. Original neurocognitive measurements distinguished individual subjects' phenotypic labels with 94% accuracy (sensitivity=92%, specificity=97%). TBSS derived FA and MD measurements predicted individual subjects' phenotypic labels with 76% and 65% accuracy respectively. In addition, individual subjects belonging to phenotypes I and II were distinguished from healthy controls with 57% and 92% accuracy respectively. Neurocognitive task variables identified as most relevant in distinguishing phenotypic labels included; Affective Go/No-Go (AGN), Cambridge Gambling Task (CGT) coupled with inferior fronto-occipital fasciculus and callosal white matter pathways. These results suggest that there may exist two biologically distinct clinical phenotypes in bipolar disorders which can be identified from healthy controls with high accuracy and at an individual subject level. We suggest a strong clinical utility of the proposed approach in defining and validating biologically meaningful and less heterogeneous clinical sub-phenotypes of major psychiatric disorders. Copyright © 2016 Elsevier Inc. All rights reserved.
A blended supervision model in Australian general practice training.
Ingham, Gerard; Fry, Jennifer
2016-05-01
The Royal Australian College of General Practitioners' Standards for general practice training allow different models of registrar supervision, provided these models achieve the outcomes of facilitating registrars' learning and ensuring patient safety. In this article, we describe a model of supervision called 'blended supervision', and its initial implementation and evaluation. The blended supervision model integrates offsite supervision with available local supervision resources. It is a pragmatic alternative to traditional supervision. Further evaluation of the cost-effectiveness, safety and effectiveness of this model is required, as is the recruitment and training of remote supervisors. A framework of questions was developed to outline the training practice's supervision methods and explain how blended supervision is achieving supervision and teaching outcomes. The supervision and teaching framework can be used to understand the supervision methods of all practices, not just practices using blended supervision.
Task-specific image partitioning.
Kim, Sungwoong; Nowozin, Sebastian; Kohli, Pushmeet; Yoo, Chang D
2013-02-01
Image partitioning is an important preprocessing step for many of the state-of-the-art algorithms used for performing high-level computer vision tasks. Typically, partitioning is conducted without regard to the task in hand. We propose a task-specific image partitioning framework to produce a region-based image representation that will lead to a higher task performance than that reached using any task-oblivious partitioning framework and existing supervised partitioning framework, albeit few in number. The proposed method partitions the image by means of correlation clustering, maximizing a linear discriminant function defined over a superpixel graph. The parameters of the discriminant function that define task-specific similarity/dissimilarity among superpixels are estimated based on structured support vector machine (S-SVM) using task-specific training data. The S-SVM learning leads to a better generalization ability while the construction of the superpixel graph used to define the discriminant function allows a rich set of features to be incorporated to improve discriminability and robustness. We evaluate the learned task-aware partitioning algorithms on three benchmark datasets. Results show that task-aware partitioning leads to better labeling performance than the partitioning computed by the state-of-the-art general-purpose and supervised partitioning algorithms. We believe that the task-specific image partitioning paradigm is widely applicable to improving performance in high-level image understanding tasks.
MutSα's Multi-Domain Allosteric Response to Three DNA Damage Types Revealed by Machine Learning
NASA Astrophysics Data System (ADS)
Melvin, Ryan L.; Thompson, William G.; Godwin, Ryan C.; Gmeiner, William H.; Salsbury, Freddie R.
2017-03-01
MutSalpha is a key component in the mismatch repair (MMR) pathway. This protein is responsible for initiating the signaling pathways for DNA repair or cell death. Herein we investigate this heterodimer’s post-recognition, post-binding response to three types of DNA damage involving cytotoxic, anti-cancer agents - carboplatin, cisplatin, and FdU. Through a combination of supervised and unsupervised machine learning techniques along with more traditional structural and kinetic analysis applied to all-atom molecular dynamics (MD) calculations, we predict that MutSalpha has a distinct response to each of the three damage types. Via a binary classification tree (a supervised machine learning technique), we identify key hydrogen bond motifs unique to each type of damage and suggest residues for experimental mutation studies. Through a combination of a recently developed clustering (unsupervised learning) algorithm, RMSF calculations, PCA, and correlated motions we predict that each type of damage causes MutS↵to explore a specific region of conformation space. Detailed analysis suggests a short range effect for carboplatin - primarily altering the structures and kinetics of residues within 10 angstroms of the damaged DNA - and distinct longer-range effects for cisplatin and FdU. In our simulations, we also observe that a key phenylalanine residue - known to stack with a mismatched or unmatched bases in MMR - stacks with the base complementary to the damaged base in 88.61% of MD frames containing carboplatinated DNA. Similarly, this Phe71 stacks with the base complementary to damage in 91.73% of frames with cisplatinated DNA. This residue, however, stacks with the damaged base itself in 62.18% of trajectory frames with FdU-substituted DNA and has no stacking interaction at all in 30.72% of these frames. Each drug investigated here induces a unique perturbation in the MutS↵complex, indicating the possibility of a distinct signaling event and specific repair or death pathway (or set of pathways) for a given type of damage.
A survey on object detection in optical remote sensing images
NASA Astrophysics Data System (ADS)
Cheng, Gong; Han, Junwei
2016-07-01
Object detection in optical remote sensing images, being a fundamental but challenging problem in the field of aerial and satellite image analysis, plays an important role for a wide range of applications and is receiving significant attention in recent years. While enormous methods exist, a deep review of the literature concerning generic object detection is still lacking. This paper aims to provide a review of the recent progress in this field. Different from several previously published surveys that focus on a specific object class such as building and road, we concentrate on more generic object categories including, but are not limited to, road, building, tree, vehicle, ship, airport, urban-area. Covering about 270 publications we survey (1) template matching-based object detection methods, (2) knowledge-based object detection methods, (3) object-based image analysis (OBIA)-based object detection methods, (4) machine learning-based object detection methods, and (5) five publicly available datasets and three standard evaluation metrics. We also discuss the challenges of current studies and propose two promising research directions, namely deep learning-based feature representation and weakly supervised learning-based geospatial object detection. It is our hope that this survey will be beneficial for the researchers to have better understanding of this research field.
Guided Text Search Using Adaptive Visual Analytics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Steed, Chad A; Symons, Christopher T; Senter, James K
This research demonstrates the promise of augmenting interactive visualizations with semi- supervised machine learning techniques to improve the discovery of significant associations and insights in the search and analysis of textual information. More specifically, we have developed a system called Gryffin that hosts a unique collection of techniques that facilitate individualized investigative search pertaining to an ever-changing set of analytical questions over an indexed collection of open-source documents related to critical national infrastructure. The Gryffin client hosts dynamic displays of the search results via focus+context record listings, temporal timelines, term-frequency views, and multiple coordinate views. Furthermore, as the analyst interactsmore » with the display, the interactions are recorded and used to label the search records. These labeled records are then used to drive semi-supervised machine learning algorithms that re-rank the unlabeled search records such that potentially relevant records are moved to the top of the record listing. Gryffin is described in the context of the daily tasks encountered at the US Department of Homeland Security s Fusion Center, with whom we are collaborating in its development. The resulting system is capable of addressing the analysts information overload that can be directly attributed to the deluge of information that must be addressed in the search and investigative analysis of textual information.« less
Topic detection using paragraph vectors to support active learning in systematic reviews.
Hashimoto, Kazuma; Kontonatsios, Georgios; Miwa, Makoto; Ananiadou, Sophia
2016-08-01
Systematic reviews require expert reviewers to manually screen thousands of citations in order to identify all relevant articles to the review. Active learning text classification is a supervised machine learning approach that has been shown to significantly reduce the manual annotation workload by semi-automating the citation screening process of systematic reviews. In this paper, we present a new topic detection method that induces an informative representation of studies, to improve the performance of the underlying active learner. Our proposed topic detection method uses a neural network-based vector space model to capture semantic similarities between documents. We firstly represent documents within the vector space, and cluster the documents into a predefined number of clusters. The centroids of the clusters are treated as latent topics. We then represent each document as a mixture of latent topics. For evaluation purposes, we employ the active learning strategy using both our novel topic detection method and a baseline topic model (i.e., Latent Dirichlet Allocation). Results obtained demonstrate that our method is able to achieve a high sensitivity of eligible studies and a significantly reduced manual annotation cost when compared to the baseline method. This observation is consistent across two clinical and three public health reviews. The tool introduced in this work is available from https://nactem.ac.uk/pvtopic/. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.
Taxonomy-aware feature engineering for microbiome classification.
Oudah, Mai; Henschel, Andreas
2018-06-15
What is a healthy microbiome? The pursuit of this and many related questions, especially in light of the recently recognized microbial component in a wide range of diseases has sparked a surge in metagenomic studies. They are often not simply attributable to a single pathogen but rather are the result of complex ecological processes. Relatedly, the increasing DNA sequencing depth and number of samples in metagenomic case-control studies enabled the applicability of powerful statistical methods, e.g. Machine Learning approaches. For the latter, the feature space is typically shaped by the relative abundances of operational taxonomic units, as determined by cost-effective phylogenetic marker gene profiles. While a substantial body of microbiome/microbiota research involves unsupervised and supervised Machine Learning, very little attention has been put on feature selection and engineering. We here propose the first algorithm to exploit phylogenetic hierarchy (i.e. an all-encompassing taxonomy) in feature engineering for microbiota classification. The rationale is to exploit the often mono- or oligophyletic distribution of relevant (but hidden) traits by virtue of taxonomic abstraction. The algorithm is embedded in a comprehensive microbiota classification pipeline, which we applied to a diverse range of datasets, distinguishing healthy from diseased microbiota samples. We demonstrate substantial improvements over the state-of-the-art microbiota classification tools in terms of classification accuracy, regardless of the actual Machine Learning technique while using drastically reduced feature spaces. Moreover, generalized features bear great explanatory value: they provide a concise description of conditions and thus help to provide pathophysiological insights. Indeed, the automatically and reproducibly derived features are consistent with previously published domain expert analyses.
Self-Taught Low-Rank Coding for Visual Learning.
Li, Sheng; Li, Kang; Fu, Yun
2018-03-01
The lack of labeled data presents a common challenge in many computer vision and machine learning tasks. Semisupervised learning and transfer learning methods have been developed to tackle this challenge by utilizing auxiliary samples from the same domain or from a different domain, respectively. Self-taught learning, which is a special type of transfer learning, has fewer restrictions on the choice of auxiliary data. It has shown promising performance in visual learning. However, existing self-taught learning methods usually ignore the structure information in data. In this paper, we focus on building a self-taught coding framework, which can effectively utilize the rich low-level pattern information abstracted from the auxiliary domain, in order to characterize the high-level structural information in the target domain. By leveraging a high quality dictionary learned across auxiliary and target domains, the proposed approach learns expressive codings for the samples in the target domain. Since many types of visual data have been proven to contain subspace structures, a low-rank constraint is introduced into the coding objective to better characterize the structure of the given target set. The proposed representation learning framework is called self-taught low-rank (S-Low) coding, which can be formulated as a nonconvex rank-minimization and dictionary learning problem. We devise an efficient majorization-minimization augmented Lagrange multiplier algorithm to solve it. Based on the proposed S-Low coding mechanism, both unsupervised and supervised visual learning algorithms are derived. Extensive experiments on five benchmark data sets demonstrate the effectiveness of our approach.
Purification of Training Samples Based on Spectral Feature and Superpixel Segmentation
NASA Astrophysics Data System (ADS)
Guan, X.; Qi, W.; He, J.; Wen, Q.; Chen, T.; Wang, Z.
2018-04-01
Remote sensing image classification is an effective way to extract information from large volumes of high-spatial resolution remote sensing images. Generally, supervised image classification relies on abundant and high-precision training data, which is often manually interpreted by human experts to provide ground truth for training and evaluating the performance of the classifier. Remote sensing enterprises accumulated lots of manually interpreted products from early lower-spatial resolution remote sensing images by executing their routine research and business programs. However, these manually interpreted products may not match the very high resolution (VHR) image properly because of different dates or spatial resolution of both data, thus, hindering suitability of manually interpreted products in training classification models, or small coverage area of these manually interpreted products. We also face similar problems in our laboratory in 21st Century Aerospace Technology Co. Ltd (short for 21AT). In this work, we propose a method to purify the interpreted product to match newly available VHRI data and provide the best training data for supervised image classifiers in VHR image classification. And results indicate that our proposed method can efficiently purify the input data for future machine learning use.
Tan, Lirong; Holland, Scott K; Deshpande, Aniruddha K; Chen, Ye; Choo, Daniel I; Lu, Long J
2015-12-01
We developed a machine learning model to predict whether or not a cochlear implant (CI) candidate will develop effective language skills within 2 years after the CI surgery by using the pre-implant brain fMRI data from the candidate. The language performance was measured 2 years after the CI surgery by the Clinical Evaluation of Language Fundamentals-Preschool, Second Edition (CELF-P2). Based on the CELF-P2 scores, the CI recipients were designated as either effective or ineffective CI users. For feature extraction from the fMRI data, we constructed contrast maps using the general linear model, and then utilized the Bag-of-Words (BoW) approach that we previously published to convert the contrast maps into feature vectors. We trained both supervised models and semi-supervised models to classify CI users as effective or ineffective. Compared with the conventional feature extraction approach, which used each single voxel as a feature, our BoW approach gave rise to much better performance for the classification of effective versus ineffective CI users. The semi-supervised model with the feature set extracted by the BoW approach from the contrast of speech versus silence achieved a leave-one-out cross-validation AUC as high as 0.97. Recursive feature elimination unexpectedly revealed that two features were sufficient to provide highly accurate classification of effective versus ineffective CI users based on our current dataset. We have validated the hypothesis that pre-implant cortical activation patterns revealed by fMRI during infancy correlate with language performance 2 years after cochlear implantation. The two brain regions highlighted by our classifier are potential biomarkers for the prediction of CI outcomes. Our study also demonstrated the superiority of the semi-supervised model over the supervised model. It is always worthwhile to try a semi-supervised model when unlabeled data are available.
Xiao, Jiajie; Melvin, Ryan L; Salsbury, Freddie R
2018-03-02
Thrombin is a key component for chemotherapeutic and antithrombotic therapy development. As the physiologic and pathologic roles of the light chain still remain vague, here, we continue previous efforts to understand the impacts of the disease-associated single deletion of LYS9 in the light chain. By combining supervised and unsupervised machine learning methodologies and more traditional structural analyses on data from 10 μs molecular dynamics simulations, we show that the conformational ensemble of the ΔK9 mutant is significantly perturbed. Our analyses consistently indicate that LYS9 deletion destabilizes both the catalytic cleft and regulatory functional regions and result in some conformational changes that occur in tens to hundreds of nanosecond scaled motions. We also reveal that the two forms of thrombin each prefer a distinct binding mode of a Na + ion. We expand our understanding of previous experimental observations and shed light on the mechanisms of the LYS9 deletion associated bleeding disorder by providing consistent but more quantitative and detailed structural analyses than early studies in literature. With a novel application of supervised learning, i.e. the decision tree learning on the hydrogen bonding features in the wild-type and ΔK9 mutant forms of thrombin, we predict that seven pairs of critical hydrogen bonding interactions are significant for establishing distinct behaviors of wild-type thrombin and its ΔK9 mutant form. Our calculations indicate the LYS9 in the light chain has both localized and long-range allosteric effects on thrombin, supporting the opinion that light chain has an important role as an allosteric effector.
Deep generative learning for automated EHR diagnosis of traditional Chinese medicine.
Liang, Zhaohui; Liu, Jun; Ou, Aihua; Zhang, Honglai; Li, Ziping; Huang, Jimmy Xiangji
2018-05-04
Computer-aided medical decision-making (CAMDM) is the method to utilize massive EMR data as both empirical and evidence support for the decision procedure of healthcare activities. Well-developed information infrastructure, such as hospital information systems and disease surveillance systems, provides abundant data for CAMDM. However, the complexity of EMR data with abstract medical knowledge makes the conventional model incompetent for the analysis. Thus a deep belief networks (DBN) based model is proposed to simulate the information analysis and decision-making procedure in medical practice. The purpose of this paper is to evaluate a deep learning architecture as an effective solution for CAMDM. A two-step model is applied in our study. At the first step, an optimized seven-layer deep belief network (DBN) is applied as an unsupervised learning algorithm to perform model training to acquire feature representation. Then a support vector machine model is adopted to DBN at the second step of the supervised learning. There are two data sets used in the experiments. One is a plain text data set indexed by medical experts. The other is a structured dataset on primary hypertension. The data are randomly divided to generate the training set for the unsupervised learning and the testing set for the supervised learning. The model performance is evaluated by the statistics of mean and variance, the average precision and coverage on the data sets. Two conventional shallow models (support vector machine / SVM and decision tree / DT) are applied as the comparisons to show the superiority of our proposed approach. The deep learning (DBN + SVM) model outperforms simple SVM and DT on two data sets in terms of all the evaluation measures, which confirms our motivation that the deep model is good at capturing the key features with less dependence when the index is built up by manpower. Our study shows the two-step deep learning model achieves high performance for medical information retrieval over the conventional shallow models. It is able to capture the features of both plain text and the highly-structured database of EMR data. The performance of the deep model is superior to the conventional shallow learning models such as SVM and DT. It is an appropriate knowledge-learning model for information retrieval of EMR system. Therefore, deep learning provides a good solution to improve the performance of CAMDM systems. Copyright © 2018. Published by Elsevier B.V.
A machine learning approach to triaging patients with chronic obstructive pulmonary disease
Qirko, Klajdi; Smith, Ted; Corcoran, Ethan; Wysham, Nicholas G.; Bazaz, Gaurav; Kappel, George; Gerber, Anthony N.
2017-01-01
COPD patients are burdened with a daily risk of acute exacerbation and loss of control, which could be mitigated by effective, on-demand decision support tools. In this study, we present a machine learning-based strategy for early detection of exacerbations and subsequent triage. Our application uses physician opinion in a statistically and clinically comprehensive set of patient cases to train a supervised prediction algorithm. The accuracy of the model is assessed against a panel of physicians each triaging identical cases in a representative patient validation set. Our results show that algorithm accuracy and safety indicators surpass all individual pulmonologists in both identifying exacerbations and predicting the consensus triage in a 101 case validation set. The algorithm is also the top performer in sensitivity, specificity, and ppv when predicting a patient’s need for emergency care. PMID:29166411
Support Vector Machines for Hyperspectral Remote Sensing Classification
NASA Technical Reports Server (NTRS)
Gualtieri, J. Anthony; Cromp, R. F.
1998-01-01
The Support Vector Machine provides a new way to design classification algorithms which learn from examples (supervised learning) and generalize when applied to new data. We demonstrate its success on a difficult classification problem from hyperspectral remote sensing, where we obtain performances of 96%, and 87% correct for a 4 class problem, and a 16 class problem respectively. These results are somewhat better than other recent results on the same data. A key feature of this classifier is its ability to use high-dimensional data without the usual recourse to a feature selection step to reduce the dimensionality of the data. For this application, this is important, as hyperspectral data consists of several hundred contiguous spectral channels for each exemplar. We provide an introduction to this new approach, and demonstrate its application to classification of an agriculture scene.
Ethoscopes: An open platform for high-throughput ethomics.
Geissmann, Quentin; Garcia Rodriguez, Luis; Beckwith, Esteban J; French, Alice S; Jamasb, Arian R; Gilestro, Giorgio F
2017-10-01
Here, we present the use of ethoscopes, which are machines for high-throughput analysis of behavior in Drosophila and other animals. Ethoscopes provide a software and hardware solution that is reproducible and easily scalable. They perform, in real-time, tracking and profiling of behavior by using a supervised machine learning algorithm, are able to deliver behaviorally triggered stimuli to flies in a feedback-loop mode, and are highly customizable and open source. Ethoscopes can be built easily by using 3D printing technology and rely on Raspberry Pi microcomputers and Arduino boards to provide affordable and flexible hardware. All software and construction specifications are available at http://lab.gilest.ro/ethoscope.
Utilizing Chinese Admission Records for MACE Prediction of Acute Coronary Syndrome
Hu, Danqing; Huang, Zhengxing; Chan, Tak-Ming; Dong, Wei; Lu, Xudong; Duan, Huilong
2016-01-01
Background: Clinical major adverse cardiovascular event (MACE) prediction of acute coronary syndrome (ACS) is important for a number of applications including physician decision support, quality of care assessment, and efficient healthcare service delivery on ACS patients. Admission records, as typical media to contain clinical information of patients at the early stage of their hospitalizations, provide significant potential to be explored for MACE prediction in a proactive manner. Methods: We propose a hybrid approach for MACE prediction by utilizing a large volume of admission records. Firstly, both a rule-based medical language processing method and a machine learning method (i.e., Conditional Random Fields (CRFs)) are developed to extract essential patient features from unstructured admission records. After that, state-of-the-art supervised machine learning algorithms are applied to construct MACE prediction models from data. Results: We comparatively evaluate the performance of the proposed approach on a real clinical dataset consisting of 2930 ACS patient samples collected from a Chinese hospital. Our best model achieved 72% AUC in MACE prediction. In comparison of the performance between our models and two well-known ACS risk score tools, i.e., GRACE and TIMI, our learned models obtain better performances with a significant margin. Conclusions: Experimental results reveal that our approach can obtain competitive performance in MACE prediction. The comparison of classifiers indicates the proposed approach has a competitive generality with datasets extracted by different feature extraction methods. Furthermore, our MACE prediction model obtained a significant improvement by comparison with both GRACE and TIMI. It indicates that using admission records can effectively provide MACE prediction service for ACS patients at the early stage of their hospitalizations. PMID:27649220
Predictive analysis and data mining among the employment of fresh graduate students in HEI
NASA Astrophysics Data System (ADS)
Rahman, Nor Azziaty Abdul; Tan, Kian Lam; Lim, Chen Kim
2017-10-01
Management of higher education have a problem in producing 100% of graduates who can meet the needs of industry while industry is also facing the problem of finding skilled graduates who suit their needs partly due to the lack of an effective method in assessing problem solving skills as well as weaknesses in the assessment of problem-solving skills. The purpose of this paper is to propose a suitable classification model that can be used in making prediction and assessment of the attributes of the student's dataset to meet the selection criteria of work demanded by the industry of the graduates in the academic field. Supervised and unsupervised Machine Learning Algorithms were used in this research where; K-Nearest Neighbor, Naïve Bayes, Decision Tree, Neural Network, Logistic Regression and Support Vector Machine. The proposed model will help the university management to make a better long-term plans for producing graduates who are skilled, knowledgeable and fulfill the industry needs as well.
Classification Algorithms for Big Data Analysis, a Map Reduce Approach
NASA Astrophysics Data System (ADS)
Ayma, V. A.; Ferreira, R. S.; Happ, P.; Oliveira, D.; Feitosa, R.; Costa, G.; Plaza, A.; Gamba, P.
2015-03-01
Since many years ago, the scientific community is concerned about how to increase the accuracy of different classification methods, and major achievements have been made so far. Besides this issue, the increasing amount of data that is being generated every day by remote sensors raises more challenges to be overcome. In this work, a tool within the scope of InterIMAGE Cloud Platform (ICP), which is an open-source, distributed framework for automatic image interpretation, is presented. The tool, named ICP: Data Mining Package, is able to perform supervised classification procedures on huge amounts of data, usually referred as big data, on a distributed infrastructure using Hadoop MapReduce. The tool has four classification algorithms implemented, taken from WEKA's machine learning library, namely: Decision Trees, Naïve Bayes, Random Forest and Support Vector Machines (SVM). The results of an experimental analysis using a SVM classifier on data sets of different sizes for different cluster configurations demonstrates the potential of the tool, as well as aspects that affect its performance.
An active role for machine learning in drug development
Murphy, Robert F.
2014-01-01
Due to the complexity of biological systems, cutting-edge machine-learning methods will be critical for future drug development. In particular, machine-vision methods to extract detailed information from imaging assays and active-learning methods to guide experimentation will be required to overcome the dimensionality problem in drug development. PMID:21587249
Machine Learning for Medical Imaging
Korfiatis, Panagiotis; Akkus, Zeynettin; Kline, Timothy L.
2017-01-01
Machine learning is a technique for recognizing patterns that can be applied to medical images. Although it is a powerful tool that can help in rendering medical diagnoses, it can be misapplied. Machine learning typically begins with the machine learning algorithm system computing the image features that are believed to be of importance in making the prediction or diagnosis of interest. The machine learning algorithm system then identifies the best combination of these image features for classifying the image or computing some metric for the given image region. There are several methods that can be used, each with different strengths and weaknesses. There are open-source versions of most of these machine learning methods that make them easy to try and apply to images. Several metrics for measuring the performance of an algorithm exist; however, one must be aware of the possible associated pitfalls that can result in misleading metrics. More recently, deep learning has started to be used; this method has the benefit that it does not require image feature identification and calculation as a first step; rather, features are identified as part of the learning process. Machine learning has been used in medical imaging and will have a greater influence in the future. Those working in medical imaging must be aware of how machine learning works. ©RSNA, 2017 PMID:28212054
Machine Learning for Medical Imaging.
Erickson, Bradley J; Korfiatis, Panagiotis; Akkus, Zeynettin; Kline, Timothy L
2017-01-01
Machine learning is a technique for recognizing patterns that can be applied to medical images. Although it is a powerful tool that can help in rendering medical diagnoses, it can be misapplied. Machine learning typically begins with the machine learning algorithm system computing the image features that are believed to be of importance in making the prediction or diagnosis of interest. The machine learning algorithm system then identifies the best combination of these image features for classifying the image or computing some metric for the given image region. There are several methods that can be used, each with different strengths and weaknesses. There are open-source versions of most of these machine learning methods that make them easy to try and apply to images. Several metrics for measuring the performance of an algorithm exist; however, one must be aware of the possible associated pitfalls that can result in misleading metrics. More recently, deep learning has started to be used; this method has the benefit that it does not require image feature identification and calculation as a first step; rather, features are identified as part of the learning process. Machine learning has been used in medical imaging and will have a greater influence in the future. Those working in medical imaging must be aware of how machine learning works. © RSNA, 2017.
Patient-specific semi-supervised learning for postoperative brain tumor segmentation.
Meier, Raphael; Bauer, Stefan; Slotboom, Johannes; Wiest, Roland; Reyes, Mauricio
2014-01-01
In contrast to preoperative brain tumor segmentation, the problem of postoperative brain tumor segmentation has been rarely approached so far. We present a fully-automatic segmentation method using multimodal magnetic resonance image data and patient-specific semi-supervised learning. The idea behind our semi-supervised approach is to effectively fuse information from both pre- and postoperative image data of the same patient to improve segmentation of the postoperative image. We pose image segmentation as a classification problem and solve it by adopting a semi-supervised decision forest. The method is evaluated on a cohort of 10 high-grade glioma patients, with segmentation performance and computation time comparable or superior to a state-of-the-art brain tumor segmentation method. Moreover, our results confirm that the inclusion of preoperative MR images lead to a better performance regarding postoperative brain tumor segmentation.
Machine Learning and Inverse Problem in Geodynamics
NASA Astrophysics Data System (ADS)
Shahnas, M. H.; Yuen, D. A.; Pysklywec, R.
2017-12-01
During the past few decades numerical modeling and traditional HPC have been widely deployed in many diverse fields for problem solutions. However, in recent years the rapid emergence of machine learning (ML), a subfield of the artificial intelligence (AI), in many fields of sciences, engineering, and finance seems to mark a turning point in the replacement of traditional modeling procedures with artificial intelligence-based techniques. The study of the circulation in the interior of Earth relies on the study of high pressure mineral physics, geochemistry, and petrology where the number of the mantle parameters is large and the thermoelastic parameters are highly pressure- and temperature-dependent. More complexity arises from the fact that many of these parameters that are incorporated in the numerical models as input parameters are not yet well established. In such complex systems the application of machine learning algorithms can play a valuable role. Our focus in this study is the application of supervised machine learning (SML) algorithms in predicting mantle properties with the emphasis on SML techniques in solving the inverse problem. As a sample problem we focus on the spin transition in ferropericlase and perovskite that may cause slab and plume stagnation at mid-mantle depths. The degree of the stagnation depends on the degree of negative density anomaly at the spin transition zone. The training and testing samples for the machine learning models are produced by the numerical convection models with known magnitudes of density anomaly (as the class labels of the samples). The volume fractions of the stagnated slabs and plumes which can be considered as measures for the degree of stagnation are assigned as sample features. The machine learning models can determine the magnitude of the spin transition-induced density anomalies that can cause flow stagnation at mid-mantle depths. Employing support vector machine (SVM) algorithms we show that SML techniques can successfully predict the magnitude of the mantle density anomalies and can also be used in characterizing mantle flow patterns. The technique can be extended to more complex problems in mantle dynamics by employing deep learning algorithms for estimation of mantle properties such as viscosity, elastic parameters, and thermal and chemical anomalies.
Dropout Prediction in E-Learning Courses through the Combination of Machine Learning Techniques
ERIC Educational Resources Information Center
Lykourentzou, Ioanna; Giannoukos, Ioannis; Nikolopoulos, Vassilis; Mpardis, George; Loumos, Vassili
2009-01-01
In this paper, a dropout prediction method for e-learning courses, based on three popular machine learning techniques and detailed student data, is proposed. The machine learning techniques used are feed-forward neural networks, support vector machines and probabilistic ensemble simplified fuzzy ARTMAP. Since a single technique may fail to…
Closing the loop: from paper to protein annotation using supervised Gene Ontology classification.
Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick
2014-01-01
Gene function curation of the literature with Gene Ontology (GO) concepts is one particularly time-consuming task in genomics, and the help from bioinformatics is highly requested to keep up with the flow of publications. In 2004, the first BioCreative challenge already designed a task of automatic GO concepts assignment from a full text. At this time, results were judged far from reaching the performances required by real curation workflows. In particular, supervised approaches produced the most disappointing results because of lack of training data. Ten years later, the available curation data have massively grown. In 2013, the BioCreative IV GO task revisited the automatic GO assignment task. For this issue, we investigated the power of our supervised classifier, GOCat. GOCat computes similarities between an input text and already curated instances contained in a knowledge base to infer GO concepts. The subtask A consisted in selecting GO evidence sentences for a relevant gene in a full text. For this, we designed a state-of-the-art supervised statistical approach, using a naïve Bayes classifier and the official training set, and obtained fair results. The subtask B consisted in predicting GO concepts from the previous output. For this, we applied GOCat and reached leading results, up to 65% for hierarchical recall in the top 20 outputted concepts. Contrary to previous competitions, machine learning has this time outperformed standard dictionary-based approaches. Thanks to BioCreative IV, we were able to design a complete workflow for curation: given a gene name and a full text, this system is able to select evidence sentences for curation and to deliver highly relevant GO concepts. Contrary to previous competitions, machine learning this time outperformed dictionary-based systems. Observed performances are sufficient for being used in a real semiautomatic curation workflow. GOCat is available at http://eagl.unige.ch/GOCat/. http://eagl.unige.ch/GOCat4FT/. © The Author(s) 2014. Published by Oxford University Press.
Jet-images — deep learning edition
de Oliveira, Luke; Kagan, Michael; Mackey, Lester; ...
2016-07-13
Building on the notion of a particle physics detector as a camera and the collimated streams of high energy particles, or jets, it measures as an image, we investigate the potential of machine learning techniques based on deep learning architectures to identify highly boosted W bosons. Modern deep learning algorithms trained on jet images can out-perform standard physically-motivated feature driven approaches to jet tagging. We develop techniques for visualizing how these features are learned by the network and what additional information is used to improve performance. Finally, this interplay between physically-motivated feature driven tools and supervised learning algorithms is generalmore » and can be used to significantly increase the sensitivity to discover new particles and new forces, and gain a deeper understanding of the physics within jets.« less
Jet-images — deep learning edition
DOE Office of Scientific and Technical Information (OSTI.GOV)
de Oliveira, Luke; Kagan, Michael; Mackey, Lester
Building on the notion of a particle physics detector as a camera and the collimated streams of high energy particles, or jets, it measures as an image, we investigate the potential of machine learning techniques based on deep learning architectures to identify highly boosted W bosons. Modern deep learning algorithms trained on jet images can out-perform standard physically-motivated feature driven approaches to jet tagging. We develop techniques for visualizing how these features are learned by the network and what additional information is used to improve performance. Finally, this interplay between physically-motivated feature driven tools and supervised learning algorithms is generalmore » and can be used to significantly increase the sensitivity to discover new particles and new forces, and gain a deeper understanding of the physics within jets.« less
Using deep learning in image hyper spectral segmentation, classification, and detection
NASA Astrophysics Data System (ADS)
Zhao, Xiuying; Su, Zhenyu
2018-02-01
Recent years have shown that deep learning neural networks are a valuable tool in the field of computer vision. Deep learning method can be used in applications like remote sensing such as Land cover Classification, Detection of Vehicle in Satellite Images, Hyper spectral Image classification. This paper addresses the use of the deep learning artificial neural network in Satellite image segmentation. Image segmentation plays an important role in image processing. The hue of the remote sensing image often has a large hue difference, which will result in the poor display of the images in the VR environment. Image segmentation is a pre processing technique applied to the original images and splits the image into many parts which have different hue to unify the color. Several computational models based on supervised, unsupervised, parametric, probabilistic region based image segmentation techniques have been proposed. Recently, one of the machine learning technique known as, deep learning with convolution neural network has been widely used for development of efficient and automatic image segmentation models. In this paper, we focus on study of deep neural convolution network and its variants for automatic image segmentation rather than traditional image segmentation strategies.
Mwangi, Benson; Soares, Jair C; Hasan, Khader M
2014-10-30
Neuroimaging machine learning studies have largely utilized supervised algorithms - meaning they require both neuroimaging scan data and corresponding target variables (e.g. healthy vs. diseased) to be successfully 'trained' for a prediction task. Noticeably, this approach may not be optimal or possible when the global structure of the data is not well known and the researcher does not have an a priori model to fit the data. We set out to investigate the utility of an unsupervised machine learning technique; t-distributed stochastic neighbour embedding (t-SNE) in identifying 'unseen' sample population patterns that may exist in high-dimensional neuroimaging data. Multimodal neuroimaging scans from 92 healthy subjects were pre-processed using atlas-based methods, integrated and input into the t-SNE algorithm. Patterns and clusters discovered by the algorithm were visualized using a 2D scatter plot and further analyzed using the K-means clustering algorithm. t-SNE was evaluated against classical principal component analysis. Remarkably, based on unlabelled multimodal scan data, t-SNE separated study subjects into two very distinct clusters which corresponded to subjects' gender labels (cluster silhouette index value=0.79). The resulting clusters were used to develop an unsupervised minimum distance clustering model which identified 93.5% of subjects' gender. Notably, from a neuropsychiatric perspective this method may allow discovery of data-driven disease phenotypes or sub-types of treatment responders. Copyright © 2014 Elsevier B.V. All rights reserved.
Ithapu, Vamsi; Singh, Vikas; Lindner, Christopher; Austin, Benjamin P; Hinrichs, Chris; Carlsson, Cynthia M; Bendlin, Barbara B; Johnson, Sterling C
2014-08-01
Precise detection and quantification of white matter hyperintensities (WMH) observed in T2-weighted Fluid Attenuated Inversion Recovery (FLAIR) Magnetic Resonance Images (MRI) is of substantial interest in aging, and age-related neurological disorders such as Alzheimer's disease (AD). This is mainly because WMH may reflect co-morbid neural injury or cerebral vascular disease burden. WMH in the older population may be small, diffuse, and irregular in shape, and sufficiently heterogeneous within and across subjects. Here, we pose hyperintensity detection as a supervised inference problem and adapt two learning models, specifically, Support Vector Machines and Random Forests, for this task. Using texture features engineered by texton filter banks, we provide a suite of effective segmentation methods for this problem. Through extensive evaluations on healthy middle-aged and older adults who vary in AD risk, we show that our methods are reliable and robust in segmenting hyperintense regions. A measure of hyperintensity accumulation, referred to as normalized effective WMH volume, is shown to be associated with dementia in older adults and parental family history in cognitively normal subjects. We provide an open source library for hyperintensity detection and accumulation (interfaced with existing neuroimaging tools), that can be adapted for segmentation problems in other neuroimaging studies. Copyright © 2014 Wiley Periodicals, Inc.
Approaches to Machine Learning.
1984-02-16
The field of machine learning strives to develop methods and techniques to automatic the acquisition of new information, new skills, and new ways of organizing existing information. In this article, we review the major approaches to machine learning in symbolic domains, covering the tasks of learning concepts from examples, learning search methods, conceptual clustering, and language acquisition. We illustrate each of the basic approaches with paradigmatic examples. (Author)
a New Object-Based Framework to Detect Shodows in High-Resolution Satellite Imagery Over Urban Areas
NASA Astrophysics Data System (ADS)
Tatar, N.; Saadatseresht, M.; Arefi, H.; Hadavand, A.
2015-12-01
In this paper a new object-based framework to detect shadow areas in high resolution satellite images is proposed. To produce shadow map in pixel level state of the art supervised machine learning algorithms are employed. Automatic ground truth generation based on Otsu thresholding on shadow and non-shadow indices is used to train the classifiers. It is followed by segmenting the image scene and create image objects. To detect shadow objects, a majority voting on pixel-based shadow detection result is designed. GeoEye-1 multi-spectral image over an urban area in Qom city of Iran is used in the experiments. Results shows the superiority of our proposed method over traditional pixel-based, visually and quantitatively.
An online semi-supervised brain-computer interface.
Gu, Zhenghui; Yu, Zhuliang; Shen, Zhifang; Li, Yuanqing
2013-09-01
Practical brain-computer interface (BCI) systems should require only low training effort for the user, and the algorithms used to classify the intent of the user should be computationally efficient. However, due to inter- and intra-subject variations in EEG signal, intermittent training/calibration is often unavoidable. In this paper, we present an online semi-supervised P300 BCI speller system. After a short initial training (around or less than 1 min in our experiments), the system is switched to a mode where the user can input characters through selective attention. In this mode, a self-training least squares support vector machine (LS-SVM) classifier is gradually enhanced in back end with the unlabeled EEG data collected online after every character input. In this way, the classifier is gradually enhanced. Even though the user may experience some errors in input at the beginning due to the small initial training dataset, the accuracy approaches that of fully supervised method in a few minutes. The algorithm based on LS-SVM and its sequential update has low computational complexity; thus, it is suitable for online applications. The effectiveness of the algorithm has been validated through data analysis on BCI Competition III dataset II (P300 speller BCI data). The performance of the online system was evaluated through experimental results on eight healthy subjects, where all of them achieved the spelling accuracy of 85 % or above within an average online semi-supervised learning time of around 3 min.
Lin, Tong; Liu, Tiebing; Lin, Yucheng; Yan, Lailai; Chen, Zhongxue; Wang, Jingyu
2017-09-01
The etiology and pathophysiology of schizophrenia (SCZ) remain obscure. This study explored the associations between SCZ risk and serum levels of 39 macro and trace elements (MTE). A 1:1 matched case-control study was conducted among 114 schizophrenia patients and 114 healthy controls matched by age, sex and region. Blood samples were collected to determine the concentrations of 39 MTE by ICP-AES and ICP-MS. Both supervised learning methods and classical statistical testing were used to uncover the difference of MTE levels between cases and controls. The best prediction accuracies were 99.21% achieved by support vector machines in the original feature space (without dimensionality reduction), and 98.82% achieved by Naive Bayes with dimensionality reduction. More than half of MTE were found to be significantly different between SCZ patients and the controls. The presented investigation showed that there existed remarkable differences in concentrations of MTE between SCZ patients and healthy controls. The results of this study might be useful to diagnosis and prognosis of SCZ; they also indicated other promising applications in pharmacy and nutrition. However, the results should be interpreted with caution due to limited sample size and the lack of potential confounding factors, such as alcohol, smoking, body mass index (BMI), use of antipsychotics and dietary intakes. In the future the application of the analyses will be useful in designs that have larger sample sizes. Copyright © 2017 Elsevier GmbH. All rights reserved.
Machine learning in infrared object classification - an all-sky selection of YSO candidates
NASA Astrophysics Data System (ADS)
Marton, Gabor; Zahorecz, Sarolta; Toth, L. Viktor; Magnus McGehee, Peregrine; Kun, Maria
2015-08-01
Object classification is a fundamental and challenging problem in the era of big data. I will discuss up-to-date methods and their application to classify infrared point sources.We analysed the ALLWISE catalogue, the most recent public source catalogue of the Wide-field Infrared Survey Explorer (WISE) to compile a reliable list of Young Stellar Object (YSO) candidates. We tested and compared classical and up-to-date statistical methods as well, to discriminate source types like extragalactic objects, evolved stars, main sequence stars, objects related to the interstellar medium and YSO candidates by using their mid-IR WISE properties and associated near-IR 2MASS data.In the particular classification problem the Support Vector Machines (SVM), a class of supervised learning algorithm turned out to be the best tool. As a result we classify Class I and II YSOs with >90% accuracy while the fraction of contaminating extragalactic objects remains well below 1%, based on the number of known objects listed in the SIMBAD and VizieR databases. We compare our results to other classification schemes from the literature and show that the SVM outperforms methods that apply linear cuts on the colour-colour and colour-magnitude space. Our homogenous YSO candidate catalog can serve as an excellent pathfinder for future detailed observations of individual objects and a starting point of statistical studies that aim to add pieces to the big picture of star formation theory.
Respiratory Artefact Removal in Forced Oscillation Measurements: A Machine Learning Approach.
Pham, Thuy T; Thamrin, Cindy; Robinson, Paul D; McEwan, Alistair L; Leong, Philip H W
2017-08-01
Respiratory artefact removal for the forced oscillation technique can be treated as an anomaly detection problem. Manual removal is currently considered the gold standard, but this approach is laborious and subjective. Most existing automated techniques used simple statistics and/or rejected anomalous data points. Unfortunately, simple statistics are insensitive to numerous artefacts, leading to low reproducibility of results. Furthermore, rejecting anomalous data points causes an imbalance between the inspiratory and expiratory contributions. From a machine learning perspective, such methods are unsupervised and can be considered simple feature extraction. We hypothesize that supervised techniques can be used to find improved features that are more discriminative and more highly correlated with the desired output. Features thus found are then used for anomaly detection by applying quartile thresholding, which rejects complete breaths if one of its features is out of range. The thresholds are determined by both saliency and performance metrics rather than qualitative assumptions as in previous works. Feature ranking indicates that our new landmark features are among the highest scoring candidates regardless of age across saliency criteria. F1-scores, receiver operating characteristic, and variability of the mean resistance metrics show that the proposed scheme outperforms previous simple feature extraction approaches. Our subject-independent detector, 1IQR-SU, demonstrated approval rates of 80.6% for adults and 98% for children, higher than existing methods. Our new features are more relevant. Our removal is objective and comparable to the manual method. This is a critical work to automate forced oscillation technique quality control.
Player Modeling for Intelligent Difficulty Adjustment
NASA Astrophysics Data System (ADS)
Missura, Olana; Gärtner, Thomas
In this paper we aim at automatically adjusting the difficulty of computer games by clustering players into different types and supervised prediction of the type from short traces of gameplay. An important ingredient of video games is to challenge players by providing them with tasks of appropriate and increasing difficulty. How this difficulty should be chosen and increase over time strongly depends on the ability, experience, perception and learning curve of each individual player. It is a subjective parameter that is very difficult to set. Wrong choices can easily lead to players stopping to play the game as they get bored (if underburdened) or frustrated (if overburdened). An ideal game should be able to adjust its difficulty dynamically governed by the player’s performance. Modern video games utilise a game-testing process to investigate among other factors the perceived difficulty for a multitude of players. In this paper, we investigate how machine learning techniques can be used for automatic difficulty adjustment. Our experiments confirm the potential of machine learning in this application.
Taming Many-Parameter BSM Models with Bayesian Neural Networks
NASA Astrophysics Data System (ADS)
Kuchera, M. P.; Karbo, A.; Prosper, H. B.; Sanchez, A.; Taylor, J. Z.
2017-09-01
The search for physics Beyond the Standard Model (BSM) is a major focus of large-scale high energy physics experiments. One method is to look for specific deviations from the Standard Model that are predicted by BSM models. In cases where the model has a large number of free parameters, standard search methods become intractable due to computation time. This talk presents results using Bayesian Neural Networks, a supervised machine learning method, to enable the study of higher-dimensional models. The popular phenomenological Minimal Supersymmetric Standard Model was studied as an example of the feasibility and usefulness of this method. Graphics Processing Units (GPUs) are used to expedite the calculations. Cross-section predictions for 13 TeV proton collisions will be presented. My participation in the Conference Experience for Undergraduates (CEU) in 2004-2006 exposed me to the national and global significance of cutting-edge research. At the 2005 CEU, I presented work from the previous summer's SULI internship at Lawrence Berkeley Laboratory, where I learned to program while working on the Majorana Project. That work inspired me to follow a similar research path, which led me to my current work on computational methods applied to BSM physics.
2016-01-01
Background As more and more researchers are turning to big data for new opportunities of biomedical discoveries, machine learning models, as the backbone of big data analysis, are mentioned more often in biomedical journals. However, owing to the inherent complexity of machine learning methods, they are prone to misuse. Because of the flexibility in specifying machine learning models, the results are often insufficiently reported in research articles, hindering reliable assessment of model validity and consistent interpretation of model outputs. Objective To attain a set of guidelines on the use of machine learning predictive models within clinical settings to make sure the models are correctly applied and sufficiently reported so that true discoveries can be distinguished from random coincidence. Methods A multidisciplinary panel of machine learning experts, clinicians, and traditional statisticians were interviewed, using an iterative process in accordance with the Delphi method. Results The process produced a set of guidelines that consists of (1) a list of reporting items to be included in a research article and (2) a set of practical sequential steps for developing predictive models. Conclusions A set of guidelines was generated to enable correct application of machine learning models and consistent reporting of model specifications and results in biomedical research. We believe that such guidelines will accelerate the adoption of big data analysis, particularly with machine learning methods, in the biomedical research community. PMID:27986644
Protein classification using modified n-grams and skip-grams.
Islam, S M Ashiqul; Heil, Benjamin J; Kearney, Christopher Michel; Baker, Erich J
2018-05-01
Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG). A meta-comparison of cross-validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein characteristic prediction to a broader range of scientists. m-NGSG is freely available at Bitbucket: https://bitbucket.org/sm_islam/mngsg/src. A web server is available at watson.ecs.baylor.edu/ngsg. erich_baker@baylor.edu. Supplementary data are available at Bioinformatics online.
NASA Astrophysics Data System (ADS)
Fijani, Elham; Nadiri, Ata Allah; Asghari Moghaddam, Asghar; Tsai, Frank T.-C.; Dixon, Barnali
2013-10-01
Contamination of wells with nitrate-N (NO3-N) poses various threats to human health. Contamination of groundwater is a complex process and full of uncertainty in regional scale. Development of an integrative vulnerability assessment methodology can be useful to effectively manage (including prioritization of limited resource allocation to monitor high risk areas) and protect this valuable freshwater source. This study introduces a supervised committee machine with artificial intelligence (SCMAI) model to improve the DRASTIC method for groundwater vulnerability assessment for the Maragheh-Bonab plain aquifer in Iran. Four different AI models are considered in the SCMAI model, whose input is the DRASTIC parameters. The SCMAI model improves the committee machine artificial intelligence (CMAI) model by replacing the linear combination in the CMAI with a nonlinear supervised ANN framework. To calibrate the AI models, NO3-N concentration data are divided in two datasets for the training and validation purposes. The target value of the AI models in the training step is the corrected vulnerability indices that relate to the first NO3-N concentration dataset. After model training, the AI models are verified by the second NO3-N concentration dataset. The results show that the four AI models are able to improve the DRASTIC method. Since the best AI model performance is not dominant, the SCMAI model is considered to combine the advantages of individual AI models to achieve the optimal performance. The SCMAI method re-predicts the groundwater vulnerability based on the different AI model prediction values. The results show that the SCMAI outperforms individual AI models and committee machine with artificial intelligence (CMAI) model. The SCMAI model ensures that no water well with high NO3-N levels would be classified as low risk and vice versa. The study concludes that the SCMAI model is an effective model to improve the DRASTIC model and provides a confident estimate of the pollution risk.
The clinical learning environment and supervision by staff nurses: developing the instrument.
Saarikoski, Mikko; Leino-Kilpi, Helena
2002-03-01
The aims of this study were (1) to describe students' perceptions of the clinical learning environment and clinical supervision and (2) to develop an evaluation scale by using the empirical results of this study. The data were collected using the Clinical Learning Environment and Supervision instrument (CLES). The instrument was based on the literature review of earlier studies. The derived instrument was tested empirically in a study involving nurse students (N=416) from four nursing colleges in Finland. The results demonstrated that the method of supervision, the number of separate supervision sessions and the psychological content of supervisory contact within a positive ward atmosphere are the most important variables in the students' clinical learning. The results also suggest that ward managers can create the conditions of a positive ward culture and a positive attitude towards students and their learning needs. The construct validity of the instrument was analysed by using exploratory factor analysis. The analysis indicated that the most important factor in the students' clinical learning is the supervisory relationship. The two most important factors constituting a 'good' clinical learning environment are the management style of the ward manager and the premises of nursing on the ward. The results of the factor analysis support the theoretical construction of the clinical learning environment modelled by earlier empirical studies.
Studying depression using imaging and machine learning methods.
Patel, Meenal J; Khalaf, Alexander; Aizenstein, Howard J
2016-01-01
Depression is a complex clinical entity that can pose challenges for clinicians regarding both accurate diagnosis and effective timely treatment. These challenges have prompted the development of multiple machine learning methods to help improve the management of this disease. These methods utilize anatomical and physiological data acquired from neuroimaging to create models that can identify depressed patients vs. non-depressed patients and predict treatment outcomes. This article (1) presents a background on depression, imaging, and machine learning methodologies; (2) reviews methodologies of past studies that have used imaging and machine learning to study depression; and (3) suggests directions for future depression-related studies.
Learning time series for intelligent monitoring
NASA Technical Reports Server (NTRS)
Manganaris, Stefanos; Fisher, Doug
1994-01-01
We address the problem of classifying time series according to their morphological features in the time domain. In a supervised machine-learning framework, we induce a classification procedure from a set of preclassified examples. For each class, we infer a model that captures its morphological features using Bayesian model induction and the minimum message length approach to assign priors. In the performance task, we classify a time series in one of the learned classes when there is enough evidence to support that decision. Time series with sufficiently novel features, belonging to classes not present in the training set, are recognized as such. We report results from experiments in a monitoring domain of interest to NASA.
Assessing the Readability of Medical Documents: A Ranking Approach.
Zheng, Jiaping; Yu, Hong
2018-03-23
The use of electronic health record (EHR) systems with patient engagement capabilities, including viewing, downloading, and transmitting health information, has recently grown tremendously. However, using these resources to engage patients in managing their own health remains challenging due to the complex and technical nature of the EHR narratives. Our objective was to develop a machine learning-based system to assess readability levels of complex documents such as EHR notes. We collected difficulty ratings of EHR notes and Wikipedia articles using crowdsourcing from 90 readers. We built a supervised model to assess readability based on relative orders of text difficulty using both surface text features and word embeddings. We evaluated system performance using the Kendall coefficient of concordance against human ratings. Our system achieved significantly higher concordance (.734) with human annotators than did a baseline using the Flesch-Kincaid Grade Level, a widely adopted readability formula (.531). The improvement was also consistent across different disease topics. This method's concordance with an individual human user's ratings was also higher than the concordance between different human annotators (.658). We explored methods to automatically assess the readability levels of clinical narratives. Our ranking-based system using simple textual features and easy-to-learn word embeddings outperformed a widely used readability formula. Our ranking-based method can predict relative difficulties of medical documents. It is not constrained to a predefined set of readability levels, a common design in many machine learning-based systems. Furthermore, the feature set does not rely on complex processing of the documents. One potential application of our readability ranking is personalization, allowing patients to better accommodate their own background knowledge. ©Jiaping Zheng, Hong Yu. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 23.03.2018.
Unresolved Galaxy Classifier for ESA/Gaia mission: Support Vector Machines approach
NASA Astrophysics Data System (ADS)
Bellas-Velidis, Ioannis; Kontizas, Mary; Dapergolas, Anastasios; Livanou, Evdokia; Kontizas, Evangelos; Karampelas, Antonios
A software package Unresolved Galaxy Classifier (UGC) is being developed for the ground-based pipeline of ESA's Gaia mission. It aims to provide an automated taxonomic classification and specific parameters estimation analyzing Gaia BP/RP instrument low-dispersion spectra of unresolved galaxies. The UGC algorithm is based on a supervised learning technique, the Support Vector Machines (SVM). The software is implemented in Java as two separate modules. An offline learning module provides functions for SVM-models training. Once trained, the set of models can be repeatedly applied to unknown galaxy spectra by the pipeline's application module. A library of galaxy models synthetic spectra, simulated for the BP/RP instrument, is used to train and test the modules. Science tests show a very good classification performance of UGC and relatively good regression performance, except for some of the parameters. Possible approaches to improve the performance are discussed.
Classification of hospital admissions into emergency and elective care: a machine learning approach.
Krämer, Jonas; Schreyögg, Jonas; Busse, Reinhard
2017-11-25
Rising admissions from emergency departments (EDs) to hospitals are a primary concern for many healthcare systems. The issue of how to differentiate urgent admissions from non-urgent or even elective admissions is crucial. We aim to develop a model for classifying inpatient admissions based on a patient's primary diagnosis as either emergency care or elective care and predicting urgency as a numerical value. We use supervised machine learning techniques and train the model with physician-expert judgments. Our model is accurate (96%) and has a high area under the ROC curve (>.99). We provide the first comprehensive classification and urgency categorization for inpatient emergency and elective care. This model assigns urgency values to every relevant diagnosis in the ICD catalog, and these values are easily applicable to existing hospital data. Our findings may provide a basis for policy makers to create incentives for hospitals to reduce the number of inappropriate ED admissions.
NASA Astrophysics Data System (ADS)
Bright, Ido; Lin, Guang; Kutz, J. Nathan
2013-12-01
Compressive sensing is used to determine the flow characteristics around a cylinder (Reynolds number and pressure/flow field) from a sparse number of pressure measurements on the cylinder. Using a supervised machine learning strategy, library elements encoding the dimensionally reduced dynamics are computed for various Reynolds numbers. Convex L1 optimization is then used with a limited number of pressure measurements on the cylinder to reconstruct, or decode, the full pressure field and the resulting flow field around the cylinder. Aside from the highly turbulent regime (large Reynolds number) where only the Reynolds number can be identified, accurate reconstruction of the pressure field and Reynolds number is achieved. The proposed data-driven strategy thus achieves encoding of the fluid dynamics using the L2 norm, and robust decoding (flow field reconstruction) using the sparsity promoting L1 norm.
Hathout, Rania M; Metwally, Abdelkader A
2016-11-01
This study represents one of the series applying computer-oriented processes and tools in digging for information, analysing data and finally extracting correlations and meaningful outcomes. In this context, binding energies could be used to model and predict the mass of loaded drugs in solid lipid nanoparticles after molecular docking of literature-gathered drugs using MOE® software package on molecularly simulated tripalmitin matrices using GROMACS®. Consequently, Gaussian processes as a supervised machine learning artificial intelligence technique were used to correlate the drugs' descriptors (e.g. M.W., xLogP, TPSA and fragment complexity) with their molecular docking binding energies. Lower percentage bias was obtained compared to previous studies which allows the accurate estimation of the loaded mass of any drug in the investigated solid lipid nanoparticles by just projecting its chemical structure to its main features (descriptors). Copyright © 2016 Elsevier B.V. All rights reserved.
Machine learning in cardiovascular medicine: are we there yet?
Shameer, Khader; Johnson, Kipp W; Glicksberg, Benjamin S; Dudley, Joel T; Sengupta, Partho P
2018-01-19
Artificial intelligence (AI) broadly refers to analytical algorithms that iteratively learn from data, allowing computers to find hidden insights without being explicitly programmed where to look. These include a family of operations encompassing several terms like machine learning, cognitive learning, deep learning and reinforcement learning-based methods that can be used to integrate and interpret complex biomedical and healthcare data in scenarios where traditional statistical methods may not be able to perform. In this review article, we discuss the basics of machine learning algorithms and what potential data sources exist; evaluate the need for machine learning; and examine the potential limitations and challenges of implementing machine in the context of cardiovascular medicine. The most promising avenues for AI in medicine are the development of automated risk prediction algorithms which can be used to guide clinical care; use of unsupervised learning techniques to more precisely phenotype complex disease; and the implementation of reinforcement learning algorithms to intelligently augment healthcare providers. The utility of a machine learning-based predictive model will depend on factors including data heterogeneity, data depth, data breadth, nature of modelling task, choice of machine learning and feature selection algorithms, and orthogonal evidence. A critical understanding of the strength and limitations of various methods and tasks amenable to machine learning is vital. By leveraging the growing corpus of big data in medicine, we detail pathways by which machine learning may facilitate optimal development of patient-specific models for improving diagnoses, intervention and outcome in cardiovascular medicine. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Computational Intelligence in Early Diabetes Diagnosis: A Review
Shankaracharya; Odedra, Devang; Samanta, Subir; Vidyarthi, Ambarish S.
2010-01-01
The development of an effective diabetes diagnosis system by taking advantage of computational intelligence is regarded as a primary goal nowadays. Many approaches based on artificial network and machine learning algorithms have been developed and tested against diabetes datasets, which were mostly related to individuals of Pima Indian origin. Yet, despite high accuracies of up to 99% in predicting the correct diabetes diagnosis, none of these approaches have reached clinical application so far. One reason for this failure may be that diabetologists or clinical investigators are sparsely informed about, or trained in the use of, computational diagnosis tools. Therefore, this article aims at sketching out an outline of the wide range of options, recent developments, and potentials in machine learning algorithms as diabetes diagnosis tools. One focus is on supervised and unsupervised methods, which have made significant impacts in the detection and diagnosis of diabetes at primary and advanced stages. Particular attention is paid to algorithms that show promise in improving diabetes diagnosis. A key advance has been the development of a more in-depth understanding and theoretical analysis of critical issues related to algorithmic construction and learning theory. These include trade-offs for maximizing generalization performance, use of physically realistic constraints, and incorporation of prior knowledge and uncertainty. The review presents and explains the most accurate algorithms, and discusses advantages and pitfalls of methodologies. This should provide a good resource for researchers from all backgrounds interested in computational intelligence-based diabetes diagnosis methods, and allows them to extend their knowledge into this kind of research. PMID:21713313
Computational intelligence in early diabetes diagnosis: a review.
Shankaracharya; Odedra, Devang; Samanta, Subir; Vidyarthi, Ambarish S
2010-01-01
The development of an effective diabetes diagnosis system by taking advantage of computational intelligence is regarded as a primary goal nowadays. Many approaches based on artificial network and machine learning algorithms have been developed and tested against diabetes datasets, which were mostly related to individuals of Pima Indian origin. Yet, despite high accuracies of up to 99% in predicting the correct diabetes diagnosis, none of these approaches have reached clinical application so far. One reason for this failure may be that diabetologists or clinical investigators are sparsely informed about, or trained in the use of, computational diagnosis tools. Therefore, this article aims at sketching out an outline of the wide range of options, recent developments, and potentials in machine learning algorithms as diabetes diagnosis tools. One focus is on supervised and unsupervised methods, which have made significant impacts in the detection and diagnosis of diabetes at primary and advanced stages. Particular attention is paid to algorithms that show promise in improving diabetes diagnosis. A key advance has been the development of a more in-depth understanding and theoretical analysis of critical issues related to algorithmic construction and learning theory. These include trade-offs for maximizing generalization performance, use of physically realistic constraints, and incorporation of prior knowledge and uncertainty. The review presents and explains the most accurate algorithms, and discusses advantages and pitfalls of methodologies. This should provide a good resource for researchers from all backgrounds interested in computational intelligence-based diabetes diagnosis methods, and allows them to extend their knowledge into this kind of research.
NASA Astrophysics Data System (ADS)
Ceylan Koydemir, Hatice; Feng, Steve; Liang, Kyle; Nadkarni, Rohan; Tseng, Derek; Benien, Parul; Ozcan, Aydogan
2017-03-01
Giardia lamblia causes a disease known as giardiasis, which results in diarrhea, abdominal cramps, and bloating. Although conventional pathogen detection methods used in water analysis laboratories offer high sensitivity and specificity, they are time consuming, and need experts to operate bulky equipment and analyze the samples. Here we present a field-portable and cost-effective smartphone-based waterborne pathogen detection platform that can automatically classify Giardia cysts using machine learning. Our platform enables the detection and quantification of Giardia cysts in one hour, including sample collection, labeling, filtration, and automated counting steps. We evaluated the performance of three prototypes using Giardia-spiked water samples from different sources (e.g., reagent-grade, tap, non-potable, and pond water samples). We populated a training database with >30,000 cysts and estimated our detection sensitivity and specificity using 20 different classifier models, including decision trees, nearest neighbor classifiers, support vector machines (SVMs), and ensemble classifiers, and compared their speed of training and classification, as well as predicted accuracies. Among them, cubic SVM, medium Gaussian SVM, and bagged-trees were the most promising classifier types with accuracies of 94.1%, 94.2%, and 95%, respectively; we selected the latter as our preferred classifier for the detection and enumeration of Giardia cysts that are imaged using our mobile-phone fluorescence microscope. Without the need for any experts or microbiologists, this field-portable pathogen detection platform can present a useful tool for water quality monitoring in resource-limited-settings.
A Machine Learning Approach to Predict Gene Regulatory Networks in Seed Development in Arabidopsis
Ni, Ying; Aghamirzaie, Delasa; Elmarakeby, Haitham; Collakova, Eva; Li, Song; Grene, Ruth; Heath, Lenwood S.
2016-01-01
Gene regulatory networks (GRNs) provide a representation of relationships between regulators and their target genes. Several methods for GRN inference, both unsupervised and supervised, have been developed to date. Because regulatory relationships consistently reprogram in diverse tissues or under different conditions, GRNs inferred without specific biological contexts are of limited applicability. In this report, a machine learning approach is presented to predict GRNs specific to developing Arabidopsis thaliana embryos. We developed the Beacon GRN inference tool to predict GRNs occurring during seed development in Arabidopsis based on a support vector machine (SVM) model. We developed both global and local inference models and compared their performance, demonstrating that local models are generally superior for our application. Using both the expression levels of the genes expressed in developing embryos and prior known regulatory relationships, GRNs were predicted for specific embryonic developmental stages. The targets that are strongly positively correlated with their regulators are mostly expressed at the beginning of seed development. Potential direct targets were identified based on a match between the promoter regions of these inferred targets and the cis elements recognized by specific regulators. Our analysis also provides evidence for previously unknown inhibitory effects of three positive regulators of gene expression. The Beacon GRN inference tool provides a valuable model system for context-specific GRN inference and is freely available at https://github.com/BeaconProjectAtVirginiaTech/beacon_network_inference.git. PMID:28066488
Deep Logic Networks: Inserting and Extracting Knowledge From Deep Belief Networks.
Tran, Son N; d'Avila Garcez, Artur S
2018-02-01
Developments in deep learning have seen the use of layerwise unsupervised learning combined with supervised learning for fine-tuning. With this layerwise approach, a deep network can be seen as a more modular system that lends itself well to learning representations. In this paper, we investigate whether such modularity can be useful to the insertion of background knowledge into deep networks, whether it can improve learning performance when it is available, and to the extraction of knowledge from trained deep networks, and whether it can offer a better understanding of the representations learned by such networks. To this end, we use a simple symbolic language-a set of logical rules that we call confidence rules-and show that it is suitable for the representation of quantitative reasoning in deep networks. We show by knowledge extraction that confidence rules can offer a low-cost representation for layerwise networks (or restricted Boltzmann machines). We also show that layerwise extraction can produce an improvement in the accuracy of deep belief networks. Furthermore, the proposed symbolic characterization of deep networks provides a novel method for the insertion of prior knowledge and training of deep networks. With the use of this method, a deep neural-symbolic system is proposed and evaluated, with the experimental results indicating that modularity through the use of confidence rules and knowledge insertion can be beneficial to network performance.
Semi-supervised Learning for Phenotyping Tasks.
Dligach, Dmitriy; Miller, Timothy; Savova, Guergana K
2015-01-01
Supervised learning is the dominant approach to automatic electronic health records-based phenotyping, but it is expensive due to the cost of manual chart review. Semi-supervised learning takes advantage of both scarce labeled and plentiful unlabeled data. In this work, we study a family of semi-supervised learning algorithms based on Expectation Maximization (EM) in the context of several phenotyping tasks. We first experiment with the basic EM algorithm. When the modeling assumptions are violated, basic EM leads to inaccurate parameter estimation. Augmented EM attenuates this shortcoming by introducing a weighting factor that downweights the unlabeled data. Cross-validation does not always lead to the best setting of the weighting factor and other heuristic methods may be preferred. We show that accurate phenotyping models can be trained with only a few hundred labeled (and a large number of unlabeled) examples, potentially providing substantial savings in the amount of the required manual chart review.
Alanazi, Hamdan O; Abdullah, Abdul Hanan; Qureshi, Kashif Naseer
2017-04-01
Recently, Artificial Intelligence (AI) has been used widely in medicine and health care sector. In machine learning, the classification or prediction is a major field of AI. Today, the study of existing predictive models based on machine learning methods is extremely active. Doctors need accurate predictions for the outcomes of their patients' diseases. In addition, for accurate predictions, timing is another significant factor that influences treatment decisions. In this paper, existing predictive models in medicine and health care have critically reviewed. Furthermore, the most famous machine learning methods have explained, and the confusion between a statistical approach and machine learning has clarified. A review of related literature reveals that the predictions of existing predictive models differ even when the same dataset is used. Therefore, existing predictive models are essential, and current methods must be improved.
Taylor, Jonathan Christopher; Fenner, John Wesley
2017-11-29
Semi-quantification methods are well established in the clinic for assisted reporting of (I123) Ioflupane images. Arguably, these are limited diagnostic tools. Recent research has demonstrated the potential for improved classification performance offered by machine learning algorithms. A direct comparison between methods is required to establish whether a move towards widespread clinical adoption of machine learning algorithms is justified. This study compared three machine learning algorithms with that of a range of semi-quantification methods, using the Parkinson's Progression Markers Initiative (PPMI) research database and a locally derived clinical database for validation. Machine learning algorithms were based on support vector machine classifiers with three different sets of features: Voxel intensities Principal components of image voxel intensities Striatal binding radios from the putamen and caudate. Semi-quantification methods were based on striatal binding ratios (SBRs) from both putamina, with and without consideration of the caudates. Normal limits for the SBRs were defined through four different methods: Minimum of age-matched controls Mean minus 1/1.5/2 standard deviations from age-matched controls Linear regression of normal patient data against age (minus 1/1.5/2 standard errors) Selection of the optimum operating point on the receiver operator characteristic curve from normal and abnormal training data Each machine learning and semi-quantification technique was evaluated with stratified, nested 10-fold cross-validation, repeated 10 times. The mean accuracy of the semi-quantitative methods for classification of local data into Parkinsonian and non-Parkinsonian groups varied from 0.78 to 0.87, contrasting with 0.89 to 0.95 for classifying PPMI data into healthy controls and Parkinson's disease groups. The machine learning algorithms gave mean accuracies between 0.88 to 0.92 and 0.95 to 0.97 for local and PPMI data respectively. Classification performance was lower for the local database than the research database for both semi-quantitative and machine learning algorithms. However, for both databases, the machine learning methods generated equal or higher mean accuracies (with lower variance) than any of the semi-quantification approaches. The gain in performance from using machine learning algorithms as compared to semi-quantification was relatively small and may be insufficient, when considered in isolation, to offer significant advantages in the clinical context.
Revisit of Machine Learning Supported Biological and Biomedical Studies.
Yu, Xiang-Tian; Wang, Lu; Zeng, Tao
2018-01-01
Generally, machine learning includes many in silico methods to transform the principles underlying natural phenomenon to human understanding information, which aim to save human labor, to assist human judge, and to create human knowledge. It should have wide application potential in biological and biomedical studies, especially in the era of big biological data. To look through the application of machine learning along with biological development, this review provides wide cases to introduce the selection of machine learning methods in different practice scenarios involved in the whole biological and biomedical study cycle and further discusses the machine learning strategies for analyzing omics data in some cutting-edge biological studies. Finally, the notes on new challenges for machine learning due to small-sample high-dimension are summarized from the key points of sample unbalance, white box, and causality.
Ethoscopes: An open platform for high-throughput ethomics
Geissmann, Quentin; Garcia Rodriguez, Luis; Beckwith, Esteban J.; French, Alice S.; Jamasb, Arian R.
2017-01-01
Here, we present the use of ethoscopes, which are machines for high-throughput analysis of behavior in Drosophila and other animals. Ethoscopes provide a software and hardware solution that is reproducible and easily scalable. They perform, in real-time, tracking and profiling of behavior by using a supervised machine learning algorithm, are able to deliver behaviorally triggered stimuli to flies in a feedback-loop mode, and are highly customizable and open source. Ethoscopes can be built easily by using 3D printing technology and rely on Raspberry Pi microcomputers and Arduino boards to provide affordable and flexible hardware. All software and construction specifications are available at http://lab.gilest.ro/ethoscope. PMID:29049280
DL-ReSuMe: A Delay Learning-Based Remote Supervised Method for Spiking Neurons.
Taherkhani, Aboozar; Belatreche, Ammar; Li, Yuhua; Maguire, Liam P
2015-12-01
Recent research has shown the potential capability of spiking neural networks (SNNs) to model complex information processing in the brain. There is biological evidence to prove the use of the precise timing of spikes for information coding. However, the exact learning mechanism in which the neuron is trained to fire at precise times remains an open problem. The majority of the existing learning methods for SNNs are based on weight adjustment. However, there is also biological evidence that the synaptic delay is not constant. In this paper, a learning method for spiking neurons, called delay learning remote supervised method (DL-ReSuMe), is proposed to merge the delay shift approach and ReSuMe-based weight adjustment to enhance the learning performance. DL-ReSuMe uses more biologically plausible properties, such as delay learning, and needs less weight adjustment than ReSuMe. Simulation results have shown that the proposed DL-ReSuMe approach achieves learning accuracy and learning speed improvements compared with ReSuMe.
An online supervised learning method based on gradient descent for spiking neurons.
Xu, Yan; Yang, Jing; Zhong, Shuiming
2017-09-01
The purpose of supervised learning with temporal encoding for spiking neurons is to make the neurons emit a specific spike train encoded by precise firing times of spikes. The gradient-descent-based (GDB) learning methods are widely used and verified in the current research. Although the existing GDB multi-spike learning (or spike sequence learning) methods have good performance, they work in an offline manner and still have some limitations. This paper proposes an online GDB spike sequence learning method for spiking neurons that is based on the online adjustment mechanism of real biological neuron synapses. The method constructs error function and calculates the adjustment of synaptic weights as soon as the neurons emit a spike during their running process. We analyze and synthesize desired and actual output spikes to select appropriate input spikes in the calculation of weight adjustment in this paper. The experimental results show that our method obviously improves learning performance compared with the offline learning manner and has certain advantage on learning accuracy compared with other learning methods. Stronger learning ability determines that the method has large pattern storage capacity. Copyright © 2017 Elsevier Ltd. All rights reserved.
Large-scale weakly supervised object localization via latent category learning.
Chong Wang; Kaiqi Huang; Weiqiang Ren; Junge Zhang; Maybank, Steve
2015-04-01
Localizing objects in cluttered backgrounds is challenging under large-scale weakly supervised conditions. Due to the cluttered image condition, objects usually have large ambiguity with backgrounds. Besides, there is also a lack of effective algorithm for large-scale weakly supervised localization in cluttered backgrounds. However, backgrounds contain useful latent information, e.g., the sky in the aeroplane class. If this latent information can be learned, object-background ambiguity can be largely reduced and background can be suppressed effectively. In this paper, we propose the latent category learning (LCL) in large-scale cluttered conditions. LCL is an unsupervised learning method which requires only image-level class labels. First, we use the latent semantic analysis with semantic object representation to learn the latent categories, which represent objects, object parts or backgrounds. Second, to determine which category contains the target object, we propose a category selection strategy by evaluating each category's discrimination. Finally, we propose the online LCL for use in large-scale conditions. Evaluation on the challenging PASCAL Visual Object Class (VOC) 2007 and the large-scale imagenet large-scale visual recognition challenge 2013 detection data sets shows that the method can improve the annotation precision by 10% over previous methods. More importantly, we achieve the detection precision which outperforms previous results by a large margin and can be competitive to the supervised deformable part model 5.0 baseline on both data sets.
Korjus, Kristjan; Hebart, Martin N.; Vicente, Raul
2016-01-01
Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier’s generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term “Cross-validation and cross-testing” improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do. PMID:27564393
Korjus, Kristjan; Hebart, Martin N; Vicente, Raul
2016-01-01
Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier's generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term "Cross-validation and cross-testing" improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do.
Applications of Machine Learning and Rule Induction,
1995-02-15
An important area of application for machine learning is in automating the acquisition of knowledge bases required for expert systems. In this paper...we review the major paradigms for machine learning , including neural networks, instance-based methods, genetic learning, rule induction, and analytic
Zhang, Jian; Lockhart, Thurmon E.; Soangra, Rahul
2013-01-01
Fatigue in lower extremity musculature is associated with decline in postural stability, motor performance and alters normal walking patterns in human subjects. Automated recognition of lower extremity muscle fatigue condition may be advantageous in early detection of fall and injury risks. Supervised machine learning methods such as Support Vector Machines (SVM) have been previously used for classifying healthy and pathological gait patterns and also for separating old and young gait patterns. In this study we explore the classification potential of SVM in recognition of gait patterns utilizing an inertial measurement unit associated with lower extremity muscular fatigue. Both kinematic and kinetic gait patterns of 17 participants (29±11 years) were recorded and analyzed in normal and fatigued state of walking. Lower extremities were fatigued by performance of a squatting exercise until the participants reached 60% of their baseline maximal voluntary exertion level. Feature selection methods were used to classify fatigue and no-fatigue conditions based on temporal and frequency information of the signals. Additionally, influences of three different kernel schemes (i.e., linear, polynomial, and radial basis function) were investigated for SVM classification. The results indicated that lower extremity muscle fatigue condition influenced gait and loading responses. In terms of the SVM classification results, an accuracy of 96% was reached in distinguishing the two gait patterns (fatigue and no-fatigue) within the same subject using the kinematic, time and frequency domain features. It is also found that linear kernel and RBF kernel were equally good to identify intra-individual fatigue characteristics. These results suggest that intra-subject fatigue classification using gait patterns from an inertial sensor holds considerable potential in identifying “at-risk” gait due to muscle fatigue. PMID:24081829
McAllister, Patrick; Zheng, Huiru; Bond, Raymond; Moorhead, Anne
2018-04-01
Obesity is increasing worldwide and can cause many chronic conditions such as type-2 diabetes, heart disease, sleep apnea, and some cancers. Monitoring dietary intake through food logging is a key method to maintain a healthy lifestyle to prevent and manage obesity. Computer vision methods have been applied to food logging to automate image classification for monitoring dietary intake. In this work we applied pretrained ResNet-152 and GoogleNet convolutional neural networks (CNNs), initially trained using ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset with MatConvNet package, to extract features from food image datasets; Food 5K, Food-11, RawFooT-DB, and Food-101. Deep features were extracted from CNNs and used to train machine learning classifiers including artificial neural network (ANN), support vector machine (SVM), Random Forest, and Naive Bayes. Results show that using ResNet-152 deep features with SVM with RBF kernel can accurately detect food items with 99.4% accuracy using Food-5K validation food image dataset and 98.8% with Food-5K evaluation dataset using ANN, SVM-RBF, and Random Forest classifiers. Trained with ResNet-152 features, ANN can achieve 91.34%, 99.28% when applied to Food-11 and RawFooT-DB food image datasets respectively and SVM with RBF kernel can achieve 64.98% with Food-101 image dataset. From this research it is clear that using deep CNN features can be used efficiently for diverse food item image classification. The work presented in this research shows that pretrained ResNet-152 features provide sufficient generalisation power when applied to a range of food image classification tasks. Copyright © 2018 Elsevier Ltd. All rights reserved.
Machine Learning Methods for Analysis of Metabolic Data and Metabolic Pathway Modeling
Cuperlovic-Culf, Miroslava
2018-01-01
Machine learning uses experimental data to optimize clustering or classification of samples or features, or to develop, augment or verify models that can be used to predict behavior or properties of systems. It is expected that machine learning will help provide actionable knowledge from a variety of big data including metabolomics data, as well as results of metabolism models. A variety of machine learning methods has been applied in bioinformatics and metabolism analyses including self-organizing maps, support vector machines, the kernel machine, Bayesian networks or fuzzy logic. To a lesser extent, machine learning has also been utilized to take advantage of the increasing availability of genomics and metabolomics data for the optimization of metabolic network models and their analysis. In this context, machine learning has aided the development of metabolic networks, the calculation of parameters for stoichiometric and kinetic models, as well as the analysis of major features in the model for the optimal application of bioreactors. Examples of this very interesting, albeit highly complex, application of machine learning for metabolism modeling will be the primary focus of this review presenting several different types of applications for model optimization, parameter determination or system analysis using models, as well as the utilization of several different types of machine learning technologies. PMID:29324649
Machine Learning Methods for Analysis of Metabolic Data and Metabolic Pathway Modeling.
Cuperlovic-Culf, Miroslava
2018-01-11
Machine learning uses experimental data to optimize clustering or classification of samples or features, or to develop, augment or verify models that can be used to predict behavior or properties of systems. It is expected that machine learning will help provide actionable knowledge from a variety of big data including metabolomics data, as well as results of metabolism models. A variety of machine learning methods has been applied in bioinformatics and metabolism analyses including self-organizing maps, support vector machines, the kernel machine, Bayesian networks or fuzzy logic. To a lesser extent, machine learning has also been utilized to take advantage of the increasing availability of genomics and metabolomics data for the optimization of metabolic network models and their analysis. In this context, machine learning has aided the development of metabolic networks, the calculation of parameters for stoichiometric and kinetic models, as well as the analysis of major features in the model for the optimal application of bioreactors. Examples of this very interesting, albeit highly complex, application of machine learning for metabolism modeling will be the primary focus of this review presenting several different types of applications for model optimization, parameter determination or system analysis using models, as well as the utilization of several different types of machine learning technologies.
Bae, Sangwon; Chung, Tammy; Ferreira, Denzil; Dey, Anind K; Suffoletto, Brian
2018-08-01
Real-time detection of drinking could improve timely delivery of interventions aimed at reducing alcohol consumption and alcohol-related injury, but existing detection methods are burdensome or impractical. To evaluate whether phone sensor data and machine learning models are useful to detect alcohol use events, and to discuss implications of these results for just-in-time mobile interventions. 38 non-treatment seeking young adult heavy drinkers downloaded AWARE app (which continuously collected mobile phone sensor data), and reported alcohol consumption (number of drinks, start/end time of prior day's drinking) for 28days. We tested various machine learning models using the 20 most informative sensor features to classify time periods as non-drinking, low-risk (1 to 3/4 drinks per occasion for women/men), and high-risk drinking (>4/5 drinks per occasion for women/men). Among 30 participants in the analyses, 207 non-drinking, 41 low-risk, and 45 high-risk drinking episodes were reported. A Random Forest model using 30-min windows with 1day of historical data performed best for detecting high-risk drinking, correctly classifying high-risk drinking windows 90.9% of the time. The most informative sensor features were related to time (i.e., day of week, time of day), movement (e.g., change in activities), device usage (e.g., screen duration), and communication (e.g., call duration, typing speed). Preliminary evidence suggests that sensor data captured from mobile phones of young adults is useful in building accurate models to detect periods of high-risk drinking. Interventions using mobile phone sensor features could trigger delivery of a range of interventions to potentially improve effectiveness. Copyright © 2017 Elsevier Ltd. All rights reserved.
Li, Dingcheng; Sohn, Sunghwan; Wu, Stephen Tze-Inn; Wagholikar, Kavishwar; Torii, Manabu; Liu, Hongfang
2012-01-01
Objective This paper describes the coreference resolution system submitted by Mayo Clinic for the 2011 i2b2/VA/Cincinnati shared task Track 1C. The goal of the task was to construct a system that links the markables corresponding to the same entity. Materials and methods The task organizers provided progress notes and discharge summaries that were annotated with the markables of treatment, problem, test, person, and pronoun. We used a multi-pass sieve algorithm that applies deterministic rules in the order of preciseness and simultaneously gathers information about the entities in the documents. Our system, MedCoref, also uses a state-of-the-art machine learning framework as an alternative to the final, rule-based pronoun resolution sieve. Results The best system that uses a multi-pass sieve has an overall score of 0.836 (average of B3, MUC, Blanc, and CEAF F score) for the training set and 0.843 for the test set. Discussion A supervised machine learning system that typically uses a single function to find coreferents cannot accommodate irregularities encountered in data especially given the insufficient number of examples. On the other hand, a completely deterministic system could lead to a decrease in recall (sensitivity) when the rules are not exhaustive. The sieve-based framework allows one to combine reliable machine learning components with rules designed by experts. Conclusion Using relatively simple rules, part-of-speech information, and semantic type properties, an effective coreference resolution system could be designed. The source code of the system described is available at https://sourceforge.net/projects/ohnlp/files/MedCoref. PMID:22707745
Vineyard water status assessment using on-the-go thermal imaging and machine learning.
Gutiérrez, Salvador; Diago, María P; Fernández-Novales, Juan; Tardaguila, Javier
2018-01-01
The high impact of irrigation in crop quality and yield in grapevine makes the development of plant water status monitoring systems an essential issue in the context of sustainable viticulture. This study presents an on-the-go approach for the estimation of vineyard water status using thermal imaging and machine learning. The experiments were conducted during seven different weeks from July to September in season 2016. A thermal camera was embedded on an all-terrain vehicle moving at 5 km/h to take on-the-go thermal images of the vineyard canopy at 1.2 m of distance and 1.0 m from the ground. The two sides of the canopy were measured for the development of side-specific and global models. Stem water potential was acquired and used as reference method. Additionally, reference temperatures Tdry and Twet were determined for the calculation of two thermal indices: the crop water stress index (CWSI) and the Jones index (Ig). Prediction models were built with and without considering the reference temperatures as input of the training algorithms. When using the reference temperatures, the best models casted determination coefficients R2 of 0.61 and 0.58 for cross validation and prediction (RMSE values of 0.190 MPa and 0.204 MPa), respectively. Nevertheless, when the reference temperatures were not considered in the training of the models, their performance statistics responded in the same way, returning R2 values up to 0.62 and 0.65 for cross validation and prediction (RMSE values of 0.190 MPa and 0.184 MPa), respectively. The outcomes provided by the machine learning algorithms support the use of thermal imaging for fast, reliable estimation of a vineyard water status, even suppressing the necessity of supervised acquisition of reference temperatures. The new developed on-the-go method can be very useful in the grape and wine industry for assessing and mapping vineyard water status.
Vineyard water status assessment using on-the-go thermal imaging and machine learning
Gutiérrez, Salvador; Diago, María P.; Fernández-Novales, Juan
2018-01-01
The high impact of irrigation in crop quality and yield in grapevine makes the development of plant water status monitoring systems an essential issue in the context of sustainable viticulture. This study presents an on-the-go approach for the estimation of vineyard water status using thermal imaging and machine learning. The experiments were conducted during seven different weeks from July to September in season 2016. A thermal camera was embedded on an all-terrain vehicle moving at 5 km/h to take on-the-go thermal images of the vineyard canopy at 1.2 m of distance and 1.0 m from the ground. The two sides of the canopy were measured for the development of side-specific and global models. Stem water potential was acquired and used as reference method. Additionally, reference temperatures Tdry and Twet were determined for the calculation of two thermal indices: the crop water stress index (CWSI) and the Jones index (Ig). Prediction models were built with and without considering the reference temperatures as input of the training algorithms. When using the reference temperatures, the best models casted determination coefficients R2 of 0.61 and 0.58 for cross validation and prediction (RMSE values of 0.190 MPa and 0.204 MPa), respectively. Nevertheless, when the reference temperatures were not considered in the training of the models, their performance statistics responded in the same way, returning R2 values up to 0.62 and 0.65 for cross validation and prediction (RMSE values of 0.190 MPa and 0.184 MPa), respectively. The outcomes provided by the machine learning algorithms support the use of thermal imaging for fast, reliable estimation of a vineyard water status, even suppressing the necessity of supervised acquisition of reference temperatures. The new developed on-the-go method can be very useful in the grape and wine industry for assessing and mapping vineyard water status. PMID:29389982
Applications of machine learning in cancer prediction and prognosis.
Cruz, Joseph A; Wishart, David S
2007-02-11
Machine learning is a branch of artificial intelligence that employs a variety of statistical, probabilistic and optimization techniques that allows computers to "learn" from past examples and to detect hard-to-discern patterns from large, noisy or complex data sets. This capability is particularly well-suited to medical applications, especially those that depend on complex proteomic and genomic measurements. As a result, machine learning is frequently used in cancer diagnosis and detection. More recently machine learning has been applied to cancer prognosis and prediction. This latter approach is particularly interesting as it is part of a growing trend towards personalized, predictive medicine. In assembling this review we conducted a broad survey of the different types of machine learning methods being used, the types of data being integrated and the performance of these methods in cancer prediction and prognosis. A number of trends are noted, including a growing dependence on protein biomarkers and microarray data, a strong bias towards applications in prostate and breast cancer, and a heavy reliance on "older" technologies such artificial neural networks (ANNs) instead of more recently developed or more easily interpretable machine learning methods. A number of published studies also appear to lack an appropriate level of validation or testing. Among the better designed and validated studies it is clear that machine learning methods can be used to substantially (15-25%) improve the accuracy of predicting cancer susceptibility, recurrence and mortality. At a more fundamental level, it is also evident that machine learning is also helping to improve our basic understanding of cancer development and progression.
Yao, Chen; Zhu, Xiaojin; Weigel, Kent A
2016-11-07
Genomic prediction for novel traits, which can be costly and labor-intensive to measure, is often hampered by low accuracy due to the limited size of the reference population. As an option to improve prediction accuracy, we introduced a semi-supervised learning strategy known as the self-training model, and applied this method to genomic prediction of residual feed intake (RFI) in dairy cattle. We describe a self-training model that is wrapped around a support vector machine (SVM) algorithm, which enables it to use data from animals with and without measured phenotypes. Initially, a SVM model was trained using data from 792 animals with measured RFI phenotypes. Then, the resulting SVM was used to generate self-trained phenotypes for 3000 animals for which RFI measurements were not available. Finally, the SVM model was re-trained using data from up to 3792 animals, including those with measured and self-trained RFI phenotypes. Incorporation of additional animals with self-trained phenotypes enhanced the accuracy of genomic predictions compared to that of predictions that were derived from the subset of animals with measured phenotypes. The optimal ratio of animals with self-trained phenotypes to animals with measured phenotypes (2.5, 2.0, and 1.8) and the maximum increase achieved in prediction accuracy measured as the correlation between predicted and actual RFI phenotypes (5.9, 4.1, and 2.4%) decreased as the size of the initial training set (300, 400, and 500 animals with measured phenotypes) increased. The optimal number of animals with self-trained phenotypes may be smaller when prediction accuracy is measured as the mean squared error rather than the correlation between predicted and actual RFI phenotypes. Our results demonstrate that semi-supervised learning models that incorporate self-trained phenotypes can achieve genomic prediction accuracies that are comparable to those obtained with models using larger training sets that include only animals with measured phenotypes. Semi-supervised learning can be helpful for genomic prediction of novel traits, such as RFI, for which the size of reference population is limited, in particular, when the animals to be predicted and the animals in the reference population originate from the same herd-environment.
Comparative Analysis of River Flow Modelling by Using Supervised Learning Technique
NASA Astrophysics Data System (ADS)
Ismail, Shuhaida; Mohamad Pandiahi, Siraj; Shabri, Ani; Mustapha, Aida
2018-04-01
The goal of this research is to investigate the efficiency of three supervised learning algorithms for forecasting monthly river flow of the Indus River in Pakistan, spread over 550 square miles or 1800 square kilometres. The algorithms include the Least Square Support Vector Machine (LSSVM), Artificial Neural Network (ANN) and Wavelet Regression (WR). The forecasting models predict the monthly river flow obtained from the three models individually for river flow data and the accuracy of the all models were then compared against each other. The monthly river flow of the said river has been forecasted using these three models. The obtained results were compared and statistically analysed. Then, the results of this analytical comparison showed that LSSVM model is more precise in the monthly river flow forecasting. It was found that LSSVM has he higher r with the value of 0.934 compared to other models. This indicate that LSSVM is more accurate and efficient as compared to the ANN and WR model.
New developments in technology-assisted supervision and training: a practical overview.
Rousmaniere, Tony; Abbass, Allan; Frederickson, Jon
2014-11-01
Clinical supervision and training are now widely available online. In this article, three of the most accessible and widely adopted new developments in clinical supervision and training technology are described: Videoconference supervision, cloud-based file sharing software, and clinical outcome tracking software. Partial transcripts from two online supervision sessions are provided as examples of videoconference-based supervision. The benefits and limitations of technology in supervision and training are discussed, with an emphasis on supervision process, ethics, privacy, and security. Recommendations for supervision practice are made, including methods to enhance experiential learning, the supervisory working alliance, and online security. © 2014 Wiley Periodicals, Inc.
Li, Yanpeng; Hu, Xiaohua; Lin, Hongfei; Yang, Zhihao
2011-01-01
Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.
Mujtaba, Ghulam; Shuib, Liyana; Raj, Ram Gopal; Rajandram, Retnagowri; Shaikh, Khairunisa; Al-Garadi, Mohammed Ali
2017-01-01
Widespread implementation of electronic databases has improved the accessibility of plaintext clinical information for supplementary use. Numerous machine learning techniques, such as supervised machine learning approaches or ontology-based approaches, have been employed to obtain useful information from plaintext clinical data. This study proposes an automatic multi-class classification system to predict accident-related causes of death from plaintext autopsy reports through expert-driven feature selection with supervised automatic text classification decision models. Accident-related autopsy reports were obtained from one of the largest hospital in Kuala Lumpur. These reports belong to nine different accident-related causes of death. Master feature vector was prepared by extracting features from the collected autopsy reports by using unigram with lexical categorization. This master feature vector was used to detect cause of death [according to internal classification of disease version 10 (ICD-10) classification system] through five automated feature selection schemes, proposed expert-driven approach, five subset sizes of features, and five machine learning classifiers. Model performance was evaluated using precisionM, recallM, F-measureM, accuracy, and area under ROC curve. Four baselines were used to compare the results with the proposed system. Random forest and J48 decision models parameterized using expert-driven feature selection yielded the highest evaluation measure approaching (85% to 90%) for most metrics by using a feature subset size of 30. The proposed system also showed approximately 14% to 16% improvement in the overall accuracy compared with the existing techniques and four baselines. The proposed system is feasible and practical to use for automatic classification of ICD-10-related cause of death from autopsy reports. The proposed system assists pathologists to accurately and rapidly determine underlying cause of death based on autopsy findings. Furthermore, the proposed expert-driven feature selection approach and the findings are generally applicable to other kinds of plaintext clinical reports.
NASA Astrophysics Data System (ADS)
Tu, Shu-Ju; Wang, Chih-Wei; Pan, Kuang-Tse; Wu, Yi-Cheng; Wu, Chen-Te
2018-03-01
Lung cancer screening aims to detect small pulmonary nodules and decrease the mortality rate of those affected. However, studies from large-scale clinical trials of lung cancer screening have shown that the false-positive rate is high and positive predictive value is low. To address these problems, a technical approach is greatly needed for accurate malignancy differentiation among these early-detected nodules. We studied the clinical feasibility of an additional protocol of localized thin-section CT for further assessment on recalled patients from lung cancer screening tests. Our approach of localized thin-section CT was integrated with radiomics features extraction and machine learning classification which was supervised by pathological diagnosis. Localized thin-section CT images of 122 nodules were retrospectively reviewed and 374 radiomics features were extracted. In this study, 48 nodules were benign and 74 malignant. There were nine patients with multiple nodules and four with synchronous multiple malignant nodules. Different machine learning classifiers with a stratified ten-fold cross-validation were used and repeated 100 times to evaluate classification accuracy. Of the image features extracted from the thin-section CT images, 238 (64%) were useful in differentiating between benign and malignant nodules. These useful features include CT density (p = 0.002 518), sigma (p = 0.002 781), uniformity (p = 0.032 41), and entropy (p = 0.006 685). The highest classification accuracy was 79% by the logistic classifier. The performance metrics of this logistic classification model was 0.80 for the positive predictive value, 0.36 for the false-positive rate, and 0.80 for the area under the receiver operating characteristic curve. Our approach of direct risk classification supervised by the pathological diagnosis with localized thin-section CT and radiomics feature extraction may support clinical physicians in determining truly malignant nodules and therefore reduce problems in lung cancer screening.
Using Machine Learning for Advanced Anomaly Detection and Classification
NASA Astrophysics Data System (ADS)
Lane, B.; Poole, M.; Camp, M.; Murray-Krezan, J.
2016-09-01
Machine Learning (ML) techniques have successfully been used in a wide variety of applications to automatically detect and potentially classify changes in activity, or a series of activities by utilizing large amounts data, sometimes even seemingly-unrelated data. The amount of data being collected, processed, and stored in the Space Situational Awareness (SSA) domain has grown at an exponential rate and is now better suited for ML. This paper describes development of advanced algorithms to deliver significant improvements in characterization of deep space objects and indication and warning (I&W) using a global network of telescopes that are collecting photometric data on a multitude of space-based objects. The Phase II Air Force Research Laboratory (AFRL) Small Business Innovative Research (SBIR) project Autonomous Characterization Algorithms for Change Detection and Characterization (ACDC), contracted to ExoAnalytic Solutions Inc. is providing the ability to detect and identify photometric signature changes due to potential space object changes (e.g. stability, tumble rate, aspect ratio), and correlate observed changes to potential behavioral changes using a variety of techniques, including supervised learning. Furthermore, these algorithms run in real-time on data being collected and processed by the ExoAnalytic Space Operations Center (EspOC), providing timely alerts and warnings while dynamically creating collection requirements to the EspOC for the algorithms that generate higher fidelity I&W. This paper will discuss the recently implemented ACDC algorithms, including the general design approach and results to date. The usage of supervised algorithms, such as Support Vector Machines, Neural Networks, k-Nearest Neighbors, etc., and unsupervised algorithms, for example k-means, Principle Component Analysis, Hierarchical Clustering, etc., and the implementations of these algorithms is explored. Results of applying these algorithms to EspOC data both in an off-line "pattern of life" analysis as well as using the algorithms on-line in real-time, meaning as data is collected, will be presented. Finally, future work in applying ML for SSA will be discussed.
Tu, Shu-Ju; Wang, Chih-Wei; Pan, Kuang-Tse; Wu, Yi-Cheng; Wu, Chen-Te
2018-03-14
Lung cancer screening aims to detect small pulmonary nodules and decrease the mortality rate of those affected. However, studies from large-scale clinical trials of lung cancer screening have shown that the false-positive rate is high and positive predictive value is low. To address these problems, a technical approach is greatly needed for accurate malignancy differentiation among these early-detected nodules. We studied the clinical feasibility of an additional protocol of localized thin-section CT for further assessment on recalled patients from lung cancer screening tests. Our approach of localized thin-section CT was integrated with radiomics features extraction and machine learning classification which was supervised by pathological diagnosis. Localized thin-section CT images of 122 nodules were retrospectively reviewed and 374 radiomics features were extracted. In this study, 48 nodules were benign and 74 malignant. There were nine patients with multiple nodules and four with synchronous multiple malignant nodules. Different machine learning classifiers with a stratified ten-fold cross-validation were used and repeated 100 times to evaluate classification accuracy. Of the image features extracted from the thin-section CT images, 238 (64%) were useful in differentiating between benign and malignant nodules. These useful features include CT density (p = 0.002 518), sigma (p = 0.002 781), uniformity (p = 0.032 41), and entropy (p = 0.006 685). The highest classification accuracy was 79% by the logistic classifier. The performance metrics of this logistic classification model was 0.80 for the positive predictive value, 0.36 for the false-positive rate, and 0.80 for the area under the receiver operating characteristic curve. Our approach of direct risk classification supervised by the pathological diagnosis with localized thin-section CT and radiomics feature extraction may support clinical physicians in determining truly malignant nodules and therefore reduce problems in lung cancer screening.
Adaptive distance metric learning for diffusion tensor image segmentation.
Kong, Youyong; Wang, Defeng; Shi, Lin; Hui, Steve C N; Chu, Winnie C W
2014-01-01
High quality segmentation of diffusion tensor images (DTI) is of key interest in biomedical research and clinical application. In previous studies, most efforts have been made to construct predefined metrics for different DTI segmentation tasks. These methods require adequate prior knowledge and tuning parameters. To overcome these disadvantages, we proposed to automatically learn an adaptive distance metric by a graph based semi-supervised learning model for DTI segmentation. An original discriminative distance vector was first formulated by combining both geometry and orientation distances derived from diffusion tensors. The kernel metric over the original distance and labels of all voxels were then simultaneously optimized in a graph based semi-supervised learning approach. Finally, the optimization task was efficiently solved with an iterative gradient descent method to achieve the optimal solution. With our approach, an adaptive distance metric could be available for each specific segmentation task. Experiments on synthetic and real brain DTI datasets were performed to demonstrate the effectiveness and robustness of the proposed distance metric learning approach. The performance of our approach was compared with three classical metrics in the graph based semi-supervised learning framework.
Adaptive Distance Metric Learning for Diffusion Tensor Image Segmentation
Kong, Youyong; Wang, Defeng; Shi, Lin; Hui, Steve C. N.; Chu, Winnie C. W.
2014-01-01
High quality segmentation of diffusion tensor images (DTI) is of key interest in biomedical research and clinical application. In previous studies, most efforts have been made to construct predefined metrics for different DTI segmentation tasks. These methods require adequate prior knowledge and tuning parameters. To overcome these disadvantages, we proposed to automatically learn an adaptive distance metric by a graph based semi-supervised learning model for DTI segmentation. An original discriminative distance vector was first formulated by combining both geometry and orientation distances derived from diffusion tensors. The kernel metric over the original distance and labels of all voxels were then simultaneously optimized in a graph based semi-supervised learning approach. Finally, the optimization task was efficiently solved with an iterative gradient descent method to achieve the optimal solution. With our approach, an adaptive distance metric could be available for each specific segmentation task. Experiments on synthetic and real brain DTI datasets were performed to demonstrate the effectiveness and robustness of the proposed distance metric learning approach. The performance of our approach was compared with three classical metrics in the graph based semi-supervised learning framework. PMID:24651858
2011-01-01
Background Machine learning has a vast range of applications. In particular, advanced machine learning methods are routinely and increasingly used in quantitative structure activity relationship (QSAR) modeling. QSAR data sets often encompass tens of thousands of compounds and the size of proprietary, as well as public data sets, is rapidly growing. Hence, there is a demand for computationally efficient machine learning algorithms, easily available to researchers without extensive machine learning knowledge. In granting the scientific principles of transparency and reproducibility, Open Source solutions are increasingly acknowledged by regulatory authorities. Thus, an Open Source state-of-the-art high performance machine learning platform, interfacing multiple, customized machine learning algorithms for both graphical programming and scripting, to be used for large scale development of QSAR models of regulatory quality, is of great value to the QSAR community. Results This paper describes the implementation of the Open Source machine learning package AZOrange. AZOrange is specially developed to support batch generation of QSAR models in providing the full work flow of QSAR modeling, from descriptor calculation to automated model building, validation and selection. The automated work flow relies upon the customization of the machine learning algorithms and a generalized, automated model hyper-parameter selection process. Several high performance machine learning algorithms are interfaced for efficient data set specific selection of the statistical method, promoting model accuracy. Using the high performance machine learning algorithms of AZOrange does not require programming knowledge as flexible applications can be created, not only at a scripting level, but also in a graphical programming environment. Conclusions AZOrange is a step towards meeting the needs for an Open Source high performance machine learning platform, supporting the efficient development of highly accurate QSAR models fulfilling regulatory requirements. PMID:21798025
Stålring, Jonna C; Carlsson, Lars A; Almeida, Pedro; Boyer, Scott
2011-07-28
Machine learning has a vast range of applications. In particular, advanced machine learning methods are routinely and increasingly used in quantitative structure activity relationship (QSAR) modeling. QSAR data sets often encompass tens of thousands of compounds and the size of proprietary, as well as public data sets, is rapidly growing. Hence, there is a demand for computationally efficient machine learning algorithms, easily available to researchers without extensive machine learning knowledge. In granting the scientific principles of transparency and reproducibility, Open Source solutions are increasingly acknowledged by regulatory authorities. Thus, an Open Source state-of-the-art high performance machine learning platform, interfacing multiple, customized machine learning algorithms for both graphical programming and scripting, to be used for large scale development of QSAR models of regulatory quality, is of great value to the QSAR community. This paper describes the implementation of the Open Source machine learning package AZOrange. AZOrange is specially developed to support batch generation of QSAR models in providing the full work flow of QSAR modeling, from descriptor calculation to automated model building, validation and selection. The automated work flow relies upon the customization of the machine learning algorithms and a generalized, automated model hyper-parameter selection process. Several high performance machine learning algorithms are interfaced for efficient data set specific selection of the statistical method, promoting model accuracy. Using the high performance machine learning algorithms of AZOrange does not require programming knowledge as flexible applications can be created, not only at a scripting level, but also in a graphical programming environment. AZOrange is a step towards meeting the needs for an Open Source high performance machine learning platform, supporting the efficient development of highly accurate QSAR models fulfilling regulatory requirements.
A comparison of machine learning and Bayesian modelling for molecular serotyping.
Newton, Richard; Wernisch, Lorenz
2017-08-11
Streptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Identifying the pneumococcal serotype is an important step in monitoring the impact of vaccines used to protect against disease. Genomic microarrays provide an effective method for molecular serotyping. Previously we developed an empirical Bayesian model for the classification of serotypes from a molecular serotyping array. With only few samples available, a model driven approach was the only option. In the meanwhile, several thousand samples have been made available to us, providing an opportunity to investigate serotype classification by machine learning methods, which could complement the Bayesian model. We compare the performance of the original Bayesian model with two machine learning algorithms: Gradient Boosting Machines and Random Forests. We present our results as an example of a generic strategy whereby a preliminary probabilistic model is complemented or replaced by a machine learning classifier once enough data are available. Despite the availability of thousands of serotyping arrays, a problem encountered when applying machine learning methods is the lack of training data containing mixtures of serotypes; due to the large number of possible combinations. Most of the available training data comprises samples with only a single serotype. To overcome the lack of training data we implemented an iterative analysis, creating artificial training data of serotype mixtures by combining raw data from single serotype arrays. With the enhanced training set the machine learning algorithms out perform the original Bayesian model. However, for serotypes currently lacking sufficient training data the best performing implementation was a combination of the results of the Bayesian Model and the Gradient Boosting Machine. As well as being an effective method for classifying biological data, machine learning can also be used as an efficient method for revealing subtle biological insights, which we illustrate with an example.
Transient classification in LIGO data using difference boosting neural network
NASA Astrophysics Data System (ADS)
Mukund, N.; Abraham, S.; Kandhasamy, S.; Mitra, S.; Philip, N. S.
2017-05-01
Detection and classification of transients in data from gravitational wave detectors are crucial for efficient searches for true astrophysical events and identification of noise sources. We present a hybrid method for classification of short duration transients seen in gravitational wave data using both supervised and unsupervised machine learning techniques. To train the classifiers, we use the relative wavelet energy and the corresponding entropy obtained by applying one-dimensional wavelet decomposition on the data. The prediction accuracy of the trained classifier on nine simulated classes of gravitational wave transients and also LIGO's sixth science run hardware injections are reported. Targeted searches for a couple of known classes of nonastrophysical signals in the first observational run of Advanced LIGO data are also presented. The ability to accurately identify transient classes using minimal training samples makes the proposed method a useful tool for LIGO detector characterization as well as searches for short duration gravitational wave signals.
Stylianou, Neophytos; Akbarov, Artur; Kontopantelis, Evangelos; Buchan, Iain; Dunn, Ken W
2015-08-01
Predicting mortality from burn injury has traditionally employed logistic regression models. Alternative machine learning methods have been introduced in some areas of clinical prediction as the necessary software and computational facilities have become accessible. Here we compare logistic regression and machine learning predictions of mortality from burn. An established logistic mortality model was compared to machine learning methods (artificial neural network, support vector machine, random forests and naïve Bayes) using a population-based (England & Wales) case-cohort registry. Predictive evaluation used: area under the receiver operating characteristic curve; sensitivity; specificity; positive predictive value and Youden's index. All methods had comparable discriminatory abilities, similar sensitivities, specificities and positive predictive values. Although some machine learning methods performed marginally better than logistic regression the differences were seldom statistically significant and clinically insubstantial. Random forests were marginally better for high positive predictive value and reasonable sensitivity. Neural networks yielded slightly better prediction overall. Logistic regression gives an optimal mix of performance and interpretability. The established logistic regression model of burn mortality performs well against more complex alternatives. Clinical prediction with a small set of strong, stable, independent predictors is unlikely to gain much from machine learning outside specialist research contexts. Copyright © 2015 Elsevier Ltd and ISBI. All rights reserved.
Chai, Hua; Li, Zi-Na; Meng, De-Yu; Xia, Liang-Yong; Liang, Yong
2017-10-12
Gene selection is an attractive and important task in cancer survival analysis. Most existing supervised learning methods can only use the labeled biological data, while the censored data (weakly labeled data) far more than the labeled data are ignored in model building. Trying to utilize such information in the censored data, a semi-supervised learning framework (Cox-AFT model) combined with Cox proportional hazard (Cox) and accelerated failure time (AFT) model was used in cancer research, which has better performance than the single Cox or AFT model. This method, however, is easily affected by noise. To alleviate this problem, in this paper we combine the Cox-AFT model with self-paced learning (SPL) method to more effectively employ the information in the censored data in a self-learning way. SPL is a kind of reliable and stable learning mechanism, which is recently proposed for simulating the human learning process to help the AFT model automatically identify and include samples of high confidence into training, minimizing interference from high noise. Utilizing the SPL method produces two direct advantages: (1) The utilization of censored data is further promoted; (2) the noise delivered to the model is greatly decreased. The experimental results demonstrate the effectiveness of the proposed model compared to the traditional Cox-AFT model.
Classification without labels: learning from mixed samples in high energy physics
NASA Astrophysics Data System (ADS)
Metodiev, Eric M.; Nachman, Benjamin; Thaler, Jesse
2017-10-01
Modern machine learning techniques can be used to construct powerful models for difficult collider physics problems. In many applications, however, these models are trained on imperfect simulations due to a lack of truth-level information in the data, which risks the model learning artifacts of the simulation. In this paper, we introduce the paradigm of classification without labels (CWoLa) in which a classifier is trained to distinguish statistical mixtures of classes, which are common in collider physics. Crucially, neither individual labels nor class proportions are required, yet we prove that the optimal classifier in the CWoLa paradigm is also the optimal classifier in the traditional fully-supervised case where all label information is available. After demonstrating the power of this method in an analytical toy example, we consider a realistic benchmark for collider physics: distinguishing quark- versus gluon-initiated jets using mixed quark/gluon training samples. More generally, CWoLa can be applied to any classification problem where labels or class proportions are unknown or simulations are unreliable, but statistical mixtures of the classes are available.