Science.gov

Sample records for machine learning methods

  1. Machine learning methods in chemoinformatics

    PubMed Central

    Mitchell, John B O

    2014-01-01

    Machine learning algorithms are generally developed in computer science or adjacent disciplines and find their way into chemical modeling by a process of diffusion. Though particular machine learning methods are popular in chemoinformatics and quantitative structure–activity relationships (QSAR), many others exist in the technical literature. This discussion is methods-based and focused on some algorithms that chemoinformatics researchers frequently use. It makes no claim to be exhaustive. We concentrate on methods for supervised learning, predicting the unknown property values of a test set of instances, usually molecules, based on the known values for a training set. Particularly relevant approaches include Artificial Neural Networks, Random Forest, Support Vector Machine, k-Nearest Neighbors and naïve Bayes classifiers. WIREs Comput Mol Sci 2014, 4:468–481. How to cite this article: WIREs Comput Mol Sci 2014, 4:468–481. doi:10.1002/wcms.1183 PMID:25285160

  2. Machine learning methods for predictive proteomics.

    PubMed

    Barla, Annalisa; Jurman, Giuseppe; Riccadonna, Samantha; Merler, Stefano; Chierici, Marco; Furlanello, Cesare

    2008-03-01

    The search for predictive biomarkers of disease from high-throughput mass spectrometry (MS) data requires a complex analysis path. Preprocessing and machine-learning modules are pipelined, starting from raw spectra, to set up a predictive classifier based on a shortlist of candidate features. As a machine-learning problem, proteomic profiling on MS data needs caution like the microarray case. The risk of overfitting and of selection bias effects is pervasive: not only potential features easily outnumber samples by 10(3) times, but it is easy to neglect information-leakage effects during preprocessing from spectra to peaks. The aim of this review is to explain how to build a general purpose design analysis protocol (DAP) for predictive proteomic profiling: we show how to limit leakage due to parameter tuning and how to organize classification and ranking on large numbers of replicate versions of the original data to avoid selection bias. The DAP can be used with alternative components, i.e. with different preprocessing methods (peak clustering or wavelet based), classifiers e.g. Support Vector Machine (SVM) or feature ranking methods (recursive feature elimination or I-Relief). A procedure for assessing stability and predictive value of the resulting biomarkers' list is also provided. The approach is exemplified with experiments on synthetic datasets (from the Cromwell MS simulator) and with publicly available datasets from cancer studies. PMID:18310105

  3. Studying depression using imaging and machine learning methods

    PubMed Central

    Patel, Meenal J.; Khalaf, Alexander; Aizenstein, Howard J.

    2015-01-01

    Depression is a complex clinical entity that can pose challenges for clinicians regarding both accurate diagnosis and effective timely treatment. These challenges have prompted the development of multiple machine learning methods to help improve the management of this disease. These methods utilize anatomical and physiological data acquired from neuroimaging to create models that can identify depressed patients vs. non-depressed patients and predict treatment outcomes. This article (1) presents a background on depression, imaging, and machine learning methodologies; (2) reviews methodologies of past studies that have used imaging and machine learning to study depression; and (3) suggests directions for future depression-related studies. PMID:26759786

  4. In silico machine learning methods in drug development.

    PubMed

    Dobchev, Dimitar A; Pillai, Girinath G; Karelson, Mati

    2014-01-01

    Machine learning (ML) computational methods for predicting compounds with pharmacological activity, specific pharmacodynamic and ADMET (absorption, distribution, metabolism, excretion and toxicity) properties are being increasingly applied in drug discovery and evaluation. Recently, machine learning techniques such as artificial neural networks, support vector machines and genetic programming have been explored for predicting inhibitors, antagonists, blockers, agonists, activators and substrates of proteins related to specific therapeutic targets. These methods are particularly useful for screening compound libraries of diverse chemical structures, "noisy" and high-dimensional data to complement QSAR methods, and in cases of unavailable receptor 3D structure to complement structure-based methods. A variety of studies have demonstrated the potential of machine-learning methods for predicting compounds as potential drug candidates. The present review is intended to give an overview of the strategies and current progress in using machine learning methods for drug design and the potential of the respective model development tools. We also regard a number of applications of the machine learning algorithms based on common classes of diseases. PMID:25262800

  5. Risk prediction with machine learning and regression methods.

    PubMed

    Steyerberg, Ewout W; van der Ploeg, Tjeerd; Van Calster, Ben

    2014-07-01

    This is a discussion of issues in risk prediction based on the following papers: "Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory" by Jochen Kruppa, Yufeng Liu, Gérard Biau, Michael Kohler, Inke R. König, James D. Malley, and Andreas Ziegler; and "Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications" by Jochen Kruppa, Yufeng Liu, Hans-Christian Diener, Theresa Holste, Christian Weimar, Inke R. König, and Andreas Ziegler. PMID:24615859

  6. Machine Learning Methods for Articulatory Data

    ERIC Educational Resources Information Center

    Berry, Jeffrey James

    2012-01-01

    Humans make use of more than just the audio signal to perceive speech. Behavioral and neurological research has shown that a person's knowledge of how speech is produced influences what is perceived. With methods for collecting articulatory data becoming more ubiquitous, methods for extracting useful information are needed to make this data…

  7. Machine Learning Methods for Attack Detection in the Smart Grid.

    PubMed

    Ozay, Mete; Esnaola, Inaki; Yarman Vural, Fatos Tunay; Kulkarni, Sanjeev R; Poor, H Vincent

    2016-08-01

    Attack detection problems in the smart grid are posed as statistical learning problems for different attack scenarios in which the measurements are observed in batch or online settings. In this approach, machine learning algorithms are used to classify measurements as being either secure or attacked. An attack detection framework is provided to exploit any available prior knowledge about the system and surmount constraints arising from the sparse structure of the problem in the proposed approach. Well-known batch and online learning algorithms (supervised and semisupervised) are employed with decision- and feature-level fusion to model the attack detection problem. The relationships between statistical and geometric properties of attack vectors employed in the attack scenarios and learning algorithms are analyzed to detect unobservable attacks using statistical learning methods. The proposed algorithms are examined on various IEEE test systems. Experimental analyses show that machine learning algorithms can detect attacks with performances higher than attack detection algorithms that employ state vector estimation methods in the proposed attack detection framework. PMID:25807571

  8. A survey of machine learning methods for secondary and supersecondary protein structure prediction.

    PubMed

    Ho, Hui Kian; Zhang, Lei; Ramamohanarao, Kotagiri; Martin, Shawn

    2013-01-01

    In this chapter we provide a survey of protein secondary and supersecondary structure prediction using methods from machine learning. Our focus is on machine learning methods applicable to β-hairpin and β-sheet prediction, but we also discuss methods for more general supersecondary structure prediction. We provide background on the secondary and supersecondary structures that we discuss, the features used to describe them, and the basic theory behind the machine learning methods used. We survey the machine learning methods available for secondary and supersecondary structure prediction and compare them where possible. PMID:22987348

  9. Introduction to machine learning.

    PubMed

    Baştanlar, Yalin; Ozuysal, Mustafa

    2014-01-01

    The machine learning field, which can be briefly defined as enabling computers make successful predictions using past experiences, has exhibited an impressive development recently with the help of the rapid increase in the storage capacity and processing power of computers. Together with many other disciplines, machine learning methods have been widely employed in bioinformatics. The difficulties and cost of biological analyses have led to the development of sophisticated machine learning approaches for this application area. In this chapter, we first review the fundamental concepts of machine learning such as feature assessment, unsupervised versus supervised learning and types of classification. Then, we point out the main issues of designing machine learning experiments and their performance evaluation. Finally, we introduce some supervised learning methods. PMID:24272434

  10. A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology.

    PubMed

    Koo, Ching Lee; Liew, Mei Jing; Mohamad, Mohd Saberi; Salleh, Abdul Hakim Mohamed

    2013-01-01

    Recently, the greatest statistical computational challenge in genetic epidemiology is to identify and characterize the genes that interact with other genes and environment factors that bring the effect on complex multifactorial disease. These gene-gene interactions are also denoted as epitasis in which this phenomenon cannot be solved by traditional statistical method due to the high dimensionality of the data and the occurrence of multiple polymorphism. Hence, there are several machine learning methods to solve such problems by identifying such susceptibility gene which are neural networks (NNs), support vector machine (SVM), and random forests (RFs) in such common and multifactorial disease. This paper gives an overview on machine learning methods, describing the methodology of each machine learning methods and its application in detecting gene-gene and gene-environment interactions. Lastly, this paper discussed each machine learning method and presents the strengths and weaknesses of each machine learning method in detecting gene-gene interactions in complex human disease. PMID:24228248

  11. A Review for Detecting Gene-Gene Interactions Using Machine Learning Methods in Genetic Epidemiology

    PubMed Central

    Koo, Ching Lee; Liew, Mei Jing; Mohamad, Mohd Saberi

    2013-01-01

    Recently, the greatest statistical computational challenge in genetic epidemiology is to identify and characterize the genes that interact with other genes and environment factors that bring the effect on complex multifactorial disease. These gene-gene interactions are also denoted as epitasis in which this phenomenon cannot be solved by traditional statistical method due to the high dimensionality of the data and the occurrence of multiple polymorphism. Hence, there are several machine learning methods to solve such problems by identifying such susceptibility gene which are neural networks (NNs), support vector machine (SVM), and random forests (RFs) in such common and multifactorial disease. This paper gives an overview on machine learning methods, describing the methodology of each machine learning methods and its application in detecting gene-gene and gene-environment interactions. Lastly, this paper discussed each machine learning method and presents the strengths and weaknesses of each machine learning method in detecting gene-gene interactions in complex human disease. PMID:24228248

  12. A novel virtual viewpoint merging method based on machine learning

    NASA Astrophysics Data System (ADS)

    Zheng, Di; Peng, Zongju; Wang, Hui; Jiang, Gangyi; Chen, Fen

    2014-11-01

    In multi-view video system, multiple video plus depth is main data format of 3D scene representation. Continuous virtual views can be generated by using depth image based rendering (DIBR) technique. DIBR process includes geometric mapping, hole filling and merging. Unique weights, inversely proportional to the distance between the virtual and real cameras, are used to merge the virtual views. However, the weights might not the optimal ones in terms of virtual view quality. In this paper, a novel virtual view merging algorithm is proposed. In the proposed algorithm, machine learning method is utilized to establish an optimal weight model. In the model, color, depth, color gradient and sequence parameters are taken into consideration. Firstly, we render the same virtual view from left and right views, and select the training samples by using a threshold. Then, the eigenvalues of the samples are extracted and the optimal merging weights are calculated as training labels. Finally, support vector classifier (SVC) is adopted to establish the model which is used for guiding virtual views rendering. Experimental results show that the proposed method can improve the quality of virtual views for most sequences. Especially, it is effective in the case of large distance between the virtual and real cameras. And compared to the original method of virtual view synthesis, the proposed method can obtain more than 0.1dB gain for some sequences.

  13. Detecting Abbreviations in Discharge Summaries using Machine Learning Methods

    PubMed Central

    Wu, Yonghui; Rosenbloom, S. Trent; Denny, Joshua C.; Miller, Randolph A.; Mani, Subramani; Giuse, Dario A.; Xu, Hua

    2011-01-01

    Recognition and identification of abbreviations is an important, challenging task in clinical natural language processing (NLP). A comprehensive lexical resource comprised of all common, useful clinical abbreviations would have great applicability. The authors present a corpus-based method to create a lexical resource of clinical abbreviations using machine-learning (ML) methods, and tested its ability to automatically detect abbreviations from hospital discharge summaries. Domain experts manually annotated abbreviations in seventy discharge summaries, which were randomly broken into a training set (40 documents) and a test set (30 documents). We implemented and evaluated several ML algorithms using the training set and a list of pre-defined features. The subsequent evaluation using the test set showed that the Random Forest classifier had the highest F-measure of 94.8% (precision 98.8% and recall of 91.2%). When a voting scheme was used to combine output from various ML classifiers, the system achieved the highest F-measure of 95.7%. PMID:22195219

  14. Detecting abbreviations in discharge summaries using machine learning methods.

    PubMed

    Wu, Yonghui; Rosenbloom, S Trent; Denny, Joshua C; Miller, Randolph A; Mani, Subramani; Giuse, Dario A; Xu, Hua

    2011-01-01

    Recognition and identification of abbreviations is an important, challenging task in clinical natural language processing (NLP). A comprehensive lexical resource comprised of all common, useful clinical abbreviations would have great applicability. The authors present a corpus-based method to create a lexical resource of clinical abbreviations using machine-learning (ML) methods, and tested its ability to automatically detect abbreviations from hospital discharge summaries. Domain experts manually annotated abbreviations in seventy discharge summaries, which were randomly broken into a training set (40 documents) and a test set (30 documents). We implemented and evaluated several ML algorithms using the training set and a list of pre-defined features. The subsequent evaluation using the test set showed that the Random Forest classifier had the highest F-measure of 94.8% (precision 98.8% and recall of 91.2%). When a voting scheme was used to combine output from various ML classifiers, the system achieved the highest F-measure of 95.7%. PMID:22195219

  15. Machine learning methods for quantitative analysis of Raman spectroscopy data

    NASA Astrophysics Data System (ADS)

    Madden, Michael G.; Ryder, Alan G.

    2003-03-01

    The automated identification and quantification of illicit materials using Raman spectroscopy is of significant importance for law enforcement agencies. This paper explores the use of Machine Learning (ML) methods in comparison with standard statistical regression techniques for developing automated identification methods. In this work, the ML task is broken into two sub-tasks, data reduction and prediction. In well-conditioned data, the number of samples should be much larger than the number of attributes per sample, to limit the degrees of freedom in predictive models. In this spectroscopy data, the opposite is normally true. Predictive models based on such data have a high number of degrees of freedom, which increases the risk of models over-fitting to the sample data and having poor predictive power. In the work described here, an approach to data reduction based on Genetic Algorithms is described. For the prediction sub-task, the objective is to estimate the concentration of a component in a mixture, based on its Raman spectrum and the known concentrations of previously seen mixtures. Here, Neural Networks and k-Nearest Neighbours are used for prediction. Preliminary results are presented for the problem of estimating the concentration of cocaine in solid mixtures, and compared with previously published results in which statistical analysis of the same dataset was performed. Finally, this paper demonstrates how more accurate results may be achieved by using an ensemble of prediction techniques.

  16. Survey of Machine Learning Methods for Database Security

    NASA Astrophysics Data System (ADS)

    Kamra, Ashish; Ber, Elisa

    Application of machine learning techniques to database security is an emerging area of research. In this chapter, we present a survey of various approaches that use machine learning/data mining techniques to enhance the traditional security mechanisms of databases. There are two key database security areas in which these techniques have found applications, namely, detection of SQL Injection attacks and anomaly detection for defending against insider threats. Apart from the research prototypes and tools, various third-party commercial products are also available that provide database activity monitoring solutions by profiling database users and applications. We present a survey of such products. We end the chapter with a primer on mechanisms for responding to database anomalies.

  17. Paradigms for machine learning

    NASA Technical Reports Server (NTRS)

    Schlimmer, Jeffrey C.; Langley, Pat

    1991-01-01

    Five paradigms are described for machine learning: connectionist (neural network) methods, genetic algorithms and classifier systems, empirical methods for inducing rules and decision trees, analytic learning methods, and case-based approaches. Some dimensions are considered along with these paradigms vary in their approach to learning, and the basic methods are reviewed that are used within each framework, together with open research issues. It is argued that the similarities among the paradigms are more important than their differences, and that future work should attempt to bridge the existing boundaries. Finally, some recent developments in the field of machine learning are discussed, and their impact on both research and applications is examined.

  18. Assessing and comparison of different machine learning methods in parent-offspring trios for genotype imputation.

    PubMed

    Mikhchi, Abbas; Honarvar, Mahmood; Kashan, Nasser Emam Jomeh; Aminafshar, Mehdi

    2016-06-21

    Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation. PMID:27049046

  19. Floor-Fractured Craters through Machine Learning Methods

    NASA Astrophysics Data System (ADS)

    Thorey, C.

    2015-12-01

    Floor-fractured craters are impact craters that have undergone post impact deformations. They are characterized by shallow floors with a plate-like or convex appearance, wide floor moats, and radial, concentric, and polygonal floor-fractures. While the origin of these deformations has long been debated, it is now generally accepted that they are the result of the emplacement of shallow magmatic intrusions below their floor. These craters thus constitute an efficient tool to probe the importance of intrusive magmatism from the lunar surface. The most recent catalog of lunar-floor fractured craters references about 200 of them, mainly located around the lunar maria Herein, we will discuss the possibility of using machine learning algorithms to try to detect new floor-fractured craters on the Moon among the 60000 craters referenced in the most recent catalogs. In particular, we will use the gravity field provided by the Gravity Recovery and Interior Laboratory (GRAIL) mission, and the topographic dataset obtained from the Lunar Orbiter Laser Altimeter (LOLA) instrument to design a set of representative features for each crater. We will then discuss the possibility to design a binary supervised classifier, based on these features, to discriminate between the presence or absence of crater-centered intrusion below a specific crater. First predictions from different classifier in terms of their accuracy and uncertainty will be presented.

  20. Predicting Coronal Mass Ejections Using Machine Learning Methods

    NASA Astrophysics Data System (ADS)

    Bobra, M. G.; Ilonidis, S.

    2016-04-01

    Of all the activity observed on the Sun, two of the most energetic events are flares and coronal mass ejections (CMEs). Usually, solar active regions that produce large flares will also produce a CME, but this is not always true. Despite advances in numerical modeling, it is still unclear which circumstances will produce a CME. Therefore, it is worthwhile to empirically determine which features distinguish flares associated with CMEs from flares that are not. At this time, no extensive study has used physically meaningful features of active regions to distinguish between these two populations. As such, we attempt to do so by using features derived from (1) photospheric vector magnetic field data taken by the Solar Dynamics Observatory’s Helioseismic and Magnetic Imager instrument and (2) X-ray flux data from the Geostationary Operational Environmental Satellite’s X-ray Flux instrument. We build a catalog of active regions that either produced both a flare and a CME (the positive class) or simply a flare (the negative class). We then use machine-learning algorithms to (1) determine which features distinguish these two populations, and (2) forecast whether an active region that produces an M- or X-class flare will also produce a CME. We compute the True Skill Statistic, a forecast verification metric, and find that it is a relatively high value of ∼0.8 ± 0.2. We conclude that a combination of six parameters, which are all intensive in nature, will capture most of the relevant information contained in the photospheric magnetic field.

  1. Recent progresses in the exploration of machine learning methods as in-silico ADME prediction tools.

    PubMed

    Tao, L; Zhang, P; Qin, C; Chen, S Y; Zhang, C; Chen, Z; Zhu, F; Yang, S Y; Wei, Y Q; Chen, Y Z

    2015-06-23

    In-silico methods have been explored as potential tools for assessing ADME and ADME regulatory properties particularly in early drug discovery stages. Machine learning methods, with their ability in classifying diverse structures and complex mechanisms, are well suited for predicting ADME and ADME regulatory properties. Recent efforts have been directed at the broadening of application scopes and the improvement of predictive performance with particular focuses on the coverage of ADME properties, and exploration of more diversified training data, appropriate molecular features, and consensus modeling. Moreover, several online machine learning ADME prediction servers have emerged. Here we review these progresses and discuss the performances, application prospects and challenges of exploring machine learning methods as useful tools in predicting ADME and ADME regulatory properties. PMID:26037068

  2. Web Mining: Machine Learning for Web Applications.

    ERIC Educational Resources Information Center

    Chen, Hsinchun; Chau, Michael

    2004-01-01

    Presents an overview of machine learning research and reviews methods used for evaluating machine learning systems. Ways that machine-learning algorithms were used in traditional information retrieval systems in the "pre-Web" era are described, and the field of Web mining and how machine learning has been used in different Web mining applications…

  3. Concrete Condition Assessment Using Impact-Echo Method and Extreme Learning Machines

    PubMed Central

    Zhang, Jing-Kui; Yan, Weizhong; Cui, De-Mi

    2016-01-01

    The impact-echo (IE) method is a popular non-destructive testing (NDT) technique widely used for measuring the thickness of plate-like structures and for detecting certain defects inside concrete elements or structures. However, the IE method is not effective for full condition assessment (i.e., defect detection, defect diagnosis, defect sizing and location), because the simple frequency spectrum analysis involved in the existing IE method is not sufficient to capture the IE signal patterns associated with different conditions. In this paper, we attempt to enhance the IE technique and enable it for full condition assessment of concrete elements by introducing advanced machine learning techniques for performing comprehensive analysis and pattern recognition of IE signals. Specifically, we use wavelet decomposition for extracting signatures or features out of the raw IE signals and apply extreme learning machine, one of the recently developed machine learning techniques, as classification models for full condition assessment. To validate the capabilities of the proposed method, we build a number of specimens with various types, sizes, and locations of defects and perform IE testing on these specimens in a lab environment. Based on analysis of the collected IE signals using the proposed machine learning based IE method, we demonstrate that the proposed method is effective in performing full condition assessment of concrete elements or structures. PMID:27023563

  4. Concrete Condition Assessment Using Impact-Echo Method and Extreme Learning Machines.

    PubMed

    Zhang, Jing-Kui; Yan, Weizhong; Cui, De-Mi

    2016-01-01

    The impact-echo (IE) method is a popular non-destructive testing (NDT) technique widely used for measuring the thickness of plate-like structures and for detecting certain defects inside concrete elements or structures. However, the IE method is not effective for full condition assessment (i.e., defect detection, defect diagnosis, defect sizing and location), because the simple frequency spectrum analysis involved in the existing IE method is not sufficient to capture the IE signal patterns associated with different conditions. In this paper, we attempt to enhance the IE technique and enable it for full condition assessment of concrete elements by introducing advanced machine learning techniques for performing comprehensive analysis and pattern recognition of IE signals. Specifically, we use wavelet decomposition for extracting signatures or features out of the raw IE signals and apply extreme learning machine, one of the recently developed machine learning techniques, as classification models for full condition assessment. To validate the capabilities of the proposed method, we build a number of specimens with various types, sizes, and locations of defects and perform IE testing on these specimens in a lab environment. Based on analysis of the collected IE signals using the proposed machine learning based IE method, we demonstrate that the proposed method is effective in performing full condition assessment of concrete elements or structures. PMID:27023563

  5. Machine Methods for Acquiring, Learning, and Applying Knowledge.

    ERIC Educational Resources Information Center

    Hayes-Roth, Frederick; And Others

    A research plan for identifying and acting upon constraints that impede the development of knowledge-based intelligent systems is described. The two primary problems identified are knowledge programming, the task of which is to create an intelligent system that does what an expert says it should, and learning, the problem requiring the criticizing…

  6. Can Machine Learning Methods Predict Extubation Outcome in Premature Infants as well as Clinicians?

    PubMed Central

    Mueller, Martina; Almeida, Jonas S.; Stanislaus, Romesh; Wagner, Carol L.

    2014-01-01

    Rationale Though treatment of the prematurely born infant breathing with assistance of a mechanical ventilator has much advanced in the past decades, predicting extubation outcome at a given point in time remains challenging. Numerous studies have been conducted to identify predictors for extubation outcome; however, the rate of infants failing extubation attempts has not declined. Objective To develop a decision-support tool for the prediction of extubation outcome in premature infants using a set of machine learning algorithms Methods A dataset assembled from 486 premature infants on mechanical ventilation was used to develop predictive models using machine learning algorithms such as artificial neural networks (ANN), support vector machine (SVM), naïve Bayesian classifier (NBC), boosted decision trees (BDT), and multivariable logistic regression (MLR). Performance of all models was evaluated using area under the curve (AUC). Results For some of the models (ANN, MLR and NBC) results were satisfactory (AUC: 0.63–0.76); however, two algorithms (SVM and BDT) showed poor performance with AUCs of ~0.5. Conclusion Clinician's predictions still outperform machine learning due to the complexity of the data and contextual information that may not be captured in clinical data used as input for the development of the machine learning algorithms. Inclusion of preprocessing steps in future studies may improve the performance of prediction models. PMID:25419493

  7. Solar Flare Predictions Using Time Series of SDO/HMI Observations and Machine Learning Methods

    NASA Astrophysics Data System (ADS)

    Ilonidis, Stathis; Bobra, Monica; Couvidat, Sebastien

    2015-08-01

    Solar active regions are dynamic systems that can rapidly evolve in time and produce flare eruptions. The temporal evolution of an active region can provide important information about its potential to produce major flares. In this study, we build a flare forecasting model using supervised machine learning methods and time series of SDO/HMI data for all the flaring regions with magnitude M1.0 or higher that have been observed with HMI and several thousand non-flaring regions. We define and compute hundreds of features that characterize the temporal evolution of physical properties related to the size, non-potentiality, and complexity of the active region, as well as its flaring history, for several days before the flare eruption. Using these features, we implement and test the performance of several machine learning algorithms, including support vector machines, neural networks, decision trees, discriminant analysis, and others. We also apply feature selection algorithms that aim to discard features with low predictive power and improve the performance of the machine learning methods. Our results show that support vector machines provide the best forecasts for the next 24 hours, achieving a True Skill Statistic of 0.923, an accuracy of 0.985, and a Heidke skill score of 0.861, which improve the scores obtained by Bobra and Couvidat (2015). The results of this study contribute to the development of a more reliable and fully automated data-driven flare forecasting system.

  8. e-Learning Application for Machine Maintenance Process using Iterative Method in XYZ Company

    NASA Astrophysics Data System (ADS)

    Nurunisa, Suaidah; Kurniawati, Amelia; Pramuditya Soesanto, Rayinda; Yunan Kurnia Septo Hediyanto, Umar

    2016-02-01

    XYZ Company is a company based on manufacturing part for airplane, one of the machine that is categorized as key facility in the company is Millac 5H6P. As a key facility, the machines should be assured to work well and in peak condition, therefore, maintenance process is needed periodically. From the data gathering, it is known that there are lack of competency from the maintenance staff to maintain different type of machine which is not assigned by the supervisor, this indicate that knowledge which possessed by maintenance staff are uneven. The purpose of this research is to create knowledge-based e-learning application as a realization from externalization process in knowledge transfer process to maintain the machine. The application feature are adjusted for maintenance purpose using e-learning framework for maintenance process, the content of the application support multimedia for learning purpose. QFD is used in this research to understand the needs from user. The application is built using moodle with iterative method for software development cycle and UML Diagram. The result from this research is e-learning application as sharing knowledge media for maintenance staff in the company. From the test, it is known that the application make maintenance staff easy to understand the competencies.

  9. A Distributed Learning Method for ℓ 1 -Regularized Kernel Machine over Wireless Sensor Networks.

    PubMed

    Ji, Xinrong; Hou, Cuiqin; Hou, Yibin; Gao, Fang; Wang, Shulong

    2016-01-01

    In wireless sensor networks, centralized learning methods have very high communication costs and energy consumption. These are caused by the need to transmit scattered training examples from various sensor nodes to the central fusion center where a classifier or a regression machine is trained. To reduce the communication cost, a distributed learning method for a kernel machine that incorporates ℓ 1 norm regularization ( ℓ 1 -regularized) is investigated, and a novel distributed learning algorithm for the ℓ 1 -regularized kernel minimum mean squared error (KMSE) machine is proposed. The proposed algorithm relies on in-network processing and a collaboration that transmits the sparse model only between single-hop neighboring nodes. This paper evaluates the proposed algorithm with respect to the prediction accuracy, the sparse rate of model, the communication cost and the number of iterations on synthetic and real datasets. The simulation results show that the proposed algorithm can obtain approximately the same prediction accuracy as that obtained by the batch learning method. Moreover, it is significantly superior in terms of the sparse rate of model and communication cost, and it can converge with fewer iterations. Finally, an experiment conducted on a wireless sensor network (WSN) test platform further shows the advantages of the proposed algorithm with respect to communication cost. PMID:27376298

  10. A Distributed Learning Method for ℓ1-Regularized Kernel Machine over Wireless Sensor Networks

    PubMed Central

    Ji, Xinrong; Hou, Cuiqin; Hou, Yibin; Gao, Fang; Wang, Shulong

    2016-01-01

    In wireless sensor networks, centralized learning methods have very high communication costs and energy consumption. These are caused by the need to transmit scattered training examples from various sensor nodes to the central fusion center where a classifier or a regression machine is trained. To reduce the communication cost, a distributed learning method for a kernel machine that incorporates ℓ1 norm regularization (ℓ1-regularized) is investigated, and a novel distributed learning algorithm for the ℓ1-regularized kernel minimum mean squared error (KMSE) machine is proposed. The proposed algorithm relies on in-network processing and a collaboration that transmits the sparse model only between single-hop neighboring nodes. This paper evaluates the proposed algorithm with respect to the prediction accuracy, the sparse rate of model, the communication cost and the number of iterations on synthetic and real datasets. The simulation results show that the proposed algorithm can obtain approximately the same prediction accuracy as that obtained by the batch learning method. Moreover, it is significantly superior in terms of the sparse rate of model and communication cost, and it can converge with fewer iterations. Finally, an experiment conducted on a wireless sensor network (WSN) test platform further shows the advantages of the proposed algorithm with respect to communication cost. PMID:27376298

  11. Comparisons of likelihood and machine learning methods of individual classification

    USGS Publications Warehouse

    Guinand, B.; Topchy, A.; Page, K.S.; Burnham-Curtis, M. K.; Punch, W.F.; Scribner, K.T.

    2002-01-01

    “Assignment tests” are designed to determine population membership for individuals. One particular application based on a likelihood estimate (LE) was introduced by Paetkau et al. (1995; see also Vásquez-Domínguez et al. 2001) to assign an individual to the population of origin on the basis of multilocus genotype and expectations of observing this genotype in each potential source population. The LE approach can be implemented statistically in a Bayesian framework as a convenient way to evaluate hypotheses of plausible genealogical relationships (e.g., that an individual possesses an ancestor in another population) (Dawson and Belkhir 2001;Pritchard et al. 2000; Rannala and Mountain 1997). Other studies have evaluated the confidence of the assignment (Almudevar 2000) and characteristics of genotypic data (e.g., degree of population divergence, number of loci, number of individuals, number of alleles) that lead to greater population assignment (Bernatchez and Duchesne 2000; Cornuet et al. 1999; Haig et al. 1997; Shriver et al. 1997; Smouse and Chevillon 1998). Main statistical and conceptual differences between methods leading to the use of an assignment test are given in, for example,Cornuet et al. (1999) and Rosenberg et al. (2001). Howeve

  12. Detection of Periodic Leg Movements by Machine Learning Methods Using Polysomnographic Parameters Other Than Leg Electromyography

    PubMed Central

    Umut, İlhan; Çentik, Güven

    2016-01-01

    The number of channels used for polysomnographic recording frequently causes difficulties for patients because of the many cables connected. Also, it increases the risk of having troubles during recording process and increases the storage volume. In this study, it is intended to detect periodic leg movement (PLM) in sleep with the use of the channels except leg electromyography (EMG) by analysing polysomnography (PSG) data with digital signal processing (DSP) and machine learning methods. PSG records of 153 patients of different ages and genders with PLM disorder diagnosis were examined retrospectively. A novel software was developed for the analysis of PSG records. The software utilizes the machine learning algorithms, statistical methods, and DSP methods. In order to classify PLM, popular machine learning methods (multilayer perceptron, K-nearest neighbour, and random forests) and logistic regression were used. Comparison of classified results showed that while K-nearest neighbour classification algorithm had higher average classification rate (91.87%) and lower average classification error value (RMSE = 0.2850), multilayer perceptron algorithm had the lowest average classification rate (83.29%) and the highest average classification error value (RMSE = 0.3705). Results showed that PLM can be classified with high accuracy (91.87%) without leg EMG record being present. PMID:27213008

  13. Detection of Periodic Leg Movements by Machine Learning Methods Using Polysomnographic Parameters Other Than Leg Electromyography.

    PubMed

    Umut, İlhan; Çentik, Güven

    2016-01-01

    The number of channels used for polysomnographic recording frequently causes difficulties for patients because of the many cables connected. Also, it increases the risk of having troubles during recording process and increases the storage volume. In this study, it is intended to detect periodic leg movement (PLM) in sleep with the use of the channels except leg electromyography (EMG) by analysing polysomnography (PSG) data with digital signal processing (DSP) and machine learning methods. PSG records of 153 patients of different ages and genders with PLM disorder diagnosis were examined retrospectively. A novel software was developed for the analysis of PSG records. The software utilizes the machine learning algorithms, statistical methods, and DSP methods. In order to classify PLM, popular machine learning methods (multilayer perceptron, K-nearest neighbour, and random forests) and logistic regression were used. Comparison of classified results showed that while K-nearest neighbour classification algorithm had higher average classification rate (91.87%) and lower average classification error value (RMSE = 0.2850), multilayer perceptron algorithm had the lowest average classification rate (83.29%) and the highest average classification error value (RMSE = 0.3705). Results showed that PLM can be classified with high accuracy (91.87%) without leg EMG record being present. PMID:27213008

  14. Machine Learning and Radiology

    PubMed Central

    Wang, Shijun; Summers, Ronald M.

    2012-01-01

    In this paper, we give a short introduction to machine learning and survey its applications in radiology. We focused on six categories of applications in radiology: medical image segmentation, registration, computer aided detection and diagnosis, brain function or activity analysis and neurological disease diagnosis from fMR images, content-based image retrieval systems for CT or MRI images, and text analysis of radiology reports using natural language processing (NLP) and natural language understanding (NLU). This survey shows that machine learning plays a key role in many radiology applications. Machine learning identifies complex patterns automatically and helps radiologists make intelligent decisions on radiology data such as conventional radiographs, CT, MRI, and PET images and radiology reports. In many applications, the performance of machine learning-based automatic detection and diagnosis systems has shown to be comparable to that of a well-trained and experienced radiologist. Technology development in machine learning and radiology will benefit from each other in the long run. Key contributions and common characteristics of machine learning techniques in radiology are discussed. We also discuss the problem of translating machine learning applications to the radiology clinical setting, including advantages and potential barriers. PMID:22465077

  15. Machine learning methods enable predictive modeling of antibody feature:function relationships in RV144 vaccinees.

    PubMed

    Choi, Ickwon; Chung, Amy W; Suscovich, Todd J; Rerks-Ngarm, Supachai; Pitisuttithum, Punnee; Nitayaphan, Sorachai; Kaewkungwal, Jaranit; O'Connell, Robert J; Francis, Donald; Robb, Merlin L; Michael, Nelson L; Kim, Jerome H; Alter, Galit; Ackerman, Margaret E; Bailey-Kellogg, Chris

    2015-04-01

    The adaptive immune response to vaccination or infection can lead to the production of specific antibodies to neutralize the pathogen or recruit innate immune effector cells for help. The non-neutralizing role of antibodies in stimulating effector cell responses may have been a key mechanism of the protection observed in the RV144 HIV vaccine trial. In an extensive investigation of a rich set of data collected from RV144 vaccine recipients, we here employ machine learning methods to identify and model associations between antibody features (IgG subclass and antigen specificity) and effector function activities (antibody dependent cellular phagocytosis, cellular cytotoxicity, and cytokine release). We demonstrate via cross-validation that classification and regression approaches can effectively use the antibody features to robustly predict qualitative and quantitative functional outcomes. This integration of antibody feature and function data within a machine learning framework provides a new, objective approach to discovering and assessing multivariate immune correlates. PMID:25874406

  16. Machine Learning methods in fitting first-principles total energies for substitutionally disordered solid

    NASA Astrophysics Data System (ADS)

    Gao, Qin; Yao, Sanxi; Widom, Michael

    2015-03-01

    Density functional theory (DFT) provides an accurate and first-principles description of solid structures and total energies. However, it is highly time-consuming to calculate structures with hundreds of atoms in the unit cell and almost not possible to calculate thousands of atoms. We apply and adapt machine learning algorithms, including compressive sensing, support vector regression and artificial neural networks to fit the DFT total energies of substitutionally disordered boron carbide. The nonparametric kernel method is also included in our models. Our fitted total energy model reproduces the DFT energies with prediction error of around 1 meV/atom. The assumptions of these machine learning models and applications of the fitted total energies will also be discussed. Financial support from McWilliams Fellowship and the ONR-MURI under the Grant No. N00014-11-1-0678 is gratefully acknowledged.

  17. Model-based machine learning

    PubMed Central

    Bishop, Christopher M.

    2013-01-01

    Several decades of research in the field of machine learning have resulted in a multitude of different algorithms for solving a broad range of problems. To tackle a new application, a researcher typically tries to map their problem onto one of these existing methods, often influenced by their familiarity with specific algorithms and by the availability of corresponding software implementations. In this study, we describe an alternative methodology for applying machine learning, in which a bespoke solution is formulated for each new application. The solution is expressed through a compact modelling language, and the corresponding custom machine learning code is then generated automatically. This model-based approach offers several major advantages, including the opportunity to create highly tailored models for specific scenarios, as well as rapid prototyping and comparison of a range of alternative models. Furthermore, newcomers to the field of machine learning do not have to learn about the huge range of traditional methods, but instead can focus their attention on understanding a single modelling environment. In this study, we show how probabilistic graphical models, coupled with efficient inference algorithms, provide a very flexible foundation for model-based machine learning, and we outline a large-scale commercial application of this framework involving tens of millions of users. We also describe the concept of probabilistic programming as a powerful software environment for model-based machine learning, and we discuss a specific probabilistic programming language called Infer.NET, which has been widely used in practical applications. PMID:23277612

  18. Model-based machine learning.

    PubMed

    Bishop, Christopher M

    2013-02-13

    Several decades of research in the field of machine learning have resulted in a multitude of different algorithms for solving a broad range of problems. To tackle a new application, a researcher typically tries to map their problem onto one of these existing methods, often influenced by their familiarity with specific algorithms and by the availability of corresponding software implementations. In this study, we describe an alternative methodology for applying machine learning, in which a bespoke solution is formulated for each new application. The solution is expressed through a compact modelling language, and the corresponding custom machine learning code is then generated automatically. This model-based approach offers several major advantages, including the opportunity to create highly tailored models for specific scenarios, as well as rapid prototyping and comparison of a range of alternative models. Furthermore, newcomers to the field of machine learning do not have to learn about the huge range of traditional methods, but instead can focus their attention on understanding a single modelling environment. In this study, we show how probabilistic graphical models, coupled with efficient inference algorithms, provide a very flexible foundation for model-based machine learning, and we outline a large-scale commercial application of this framework involving tens of millions of users. We also describe the concept of probabilistic programming as a powerful software environment for model-based machine learning, and we discuss a specific probabilistic programming language called Infer.NET, which has been widely used in practical applications. PMID:23277612

  19. An iterative learning control method with application for CNC machine tools

    SciTech Connect

    Kim, D.I.; Kim, S.

    1996-01-01

    A proportional, integral, and derivative (PID) type iterative learning controller is proposed for precise tracking control of industrial robots and computer numerical controller (CNC) machine tools performing repetitive tasks. The convergence of the output error by the proposed learning controller is guaranteed under a certain condition even when the system parameters are not known exactly and unknown external disturbances exist. As the proposed learning controller is repeatedly applied to the industrial robot or the CNC machine tool with the path-dependent repetitive task, the distance difference between the desired path and the actual tracked or machined path, which is one of the most significant factors in the evaluation of control performance, is progressively reduced. The experimental results demonstrate that the proposed learning controller can improve machining accuracy when the CNC machine tool performs repetitive machining tasks.

  20. Applications of Machine Learning in Information Retrieval.

    ERIC Educational Resources Information Center

    Cunningham, Sally Jo; Witten, Ian H.; Littin, James

    1999-01-01

    Introduces the basic ideas that underpin applications of machine learning to information retrieval. Describes applications of machine learning to text categorization. Considers how machine learning can be applied to the query-formulation process. Examines methods of document filtering, where the user specifies a query that is to be applied to an…

  1. Probability estimation with machine learning methods for dichotomous and multicategory outcome: applications.

    PubMed

    Kruppa, Jochen; Liu, Yufeng; Diener, Hans-Christian; Holste, Theresa; Weimar, Christian; König, Inke R; Ziegler, Andreas

    2014-07-01

    Machine learning methods are applied to three different large datasets, all dealing with probability estimation problems for dichotomous or multicategory data. Specifically, we investigate k-nearest neighbors, bagged nearest neighbors, random forests for probability estimation trees, and support vector machines with the kernels of Bessel, linear, Laplacian, and radial basis type. Comparisons are made with logistic regression. The dataset from the German Stroke Study Collaboration with dichotomous and three-category outcome variables allows, in particular, for temporal and external validation. The other two datasets are freely available from the UCI learning repository and provide dichotomous outcome variables. One of them, the Cleveland Clinic Foundation Heart Disease dataset, uses data from one clinic for training and from three clinics for external validation, while the other, the thyroid disease dataset, allows for temporal validation by separating data into training and test data by date of recruitment into study. For dichotomous outcome variables, we use receiver operating characteristics, areas under the curve values with bootstrapped 95% confidence intervals, and Hosmer-Lemeshow-type figures as comparison criteria. For dichotomous and multicategory outcomes, we calculated bootstrap Brier scores with 95% confidence intervals and also compared them through bootstrapping. In a supplement, we provide R code for performing the analyses and for random forest analyses in Random Jungle, version 2.1.0. The learning machines show promising performance over all constructed models. They are simple to apply and serve as an alternative approach to logistic or multinomial logistic regression analysis. PMID:24989843

  2. [Quantitative retrieval of chlorophyll a concentration in Taihu Lake using machine learning methods].

    PubMed

    Zhang, Yu-Chao; Qian, Xin; Qian, Yu; Liu, Jian-Ping; Kong, Fan-Xiang

    2009-05-15

    We evaluated the performance of two machine learning methods, artificial neural net (ANN) and support vector machine (SVM), for estimation of chlorophyll a in Taihu Lake from remote sensing data. The theoretical analysis has been done from basic theory and learning target of these two methods first. Then two empirical algorithms have been developed to relate reflectance of MODIS to in situ concentrations of chlorophyll a. The performance of ANN and SVM is comparatively analyzed in terms of validation, stability and robustness assessment and chlorophyll a distribution of Taihu Lake from two algorithms. The root of mean square deviation (RMSE) and average relative error (ARE) of validation data is only 5.85 and 26.5% of SVM retrieval model, however, RMSE and ARE of ANN model is 13.04 and 46.8%. Stability and robustness assessment suggest that SVM provides the better performance than ANN. And the retrieval results show that the chlorophyll a distribution of the whole lake from two algorithms is similar, however, the chlorophyll a concentration in the eastern region and central region of Taihu Lake is distorted by ANN model because of the limitations, such as learning target setting and over-learning in net construction. PMID:19558096

  3. Briefing in Application of Machine Learning Methods in Ion Channel Prediction

    PubMed Central

    2015-01-01

    In cells, ion channels are one of the most important classes of membrane proteins which allow inorganic ions to move across the membrane. A wide range of biological processes are involved and regulated by the opening and closing of ion channels. Ion channels can be classified into numerous classes and different types of ion channels exhibit different functions. Thus, the correct identification of ion channels and their types using computational methods will provide in-depth insights into their function in various biological processes. In this review, we will briefly introduce and discuss the recent progress in ion channel prediction using machine learning methods. PMID:25961077

  4. Briefing in application of machine learning methods in ion channel prediction.

    PubMed

    Lin, Hao; Chen, Wei

    2015-01-01

    In cells, ion channels are one of the most important classes of membrane proteins which allow inorganic ions to move across the membrane. A wide range of biological processes are involved and regulated by the opening and closing of ion channels. Ion channels can be classified into numerous classes and different types of ion channels exhibit different functions. Thus, the correct identification of ion channels and their types using computational methods will provide in-depth insights into their function in various biological processes. In this review, we will briefly introduce and discuss the recent progress in ion channel prediction using machine learning methods. PMID:25961077

  5. Acceleration of ensemble machine learning methods using many-core devices

    NASA Astrophysics Data System (ADS)

    Tamerus, A.; Washbrook, A.; Wyeth, D.

    2015-12-01

    We present a case study into the acceleration of ensemble machine learning methods using many-core devices in collaboration with Toshiba Medical Visualisation Systems Europe (TMVSE). The adoption of GPUs to execute a key algorithm in the classification of medical image data was shown to significantly reduce overall processing time. Using a representative dataset and pre-trained decision trees as input we will demonstrate how the decision forest classification method can be mapped onto the GPU data processing model. It was found that a GPU-based version of the decision forest method resulted in over 138 times speed-up over a single-threaded CPU implementation with further improvements possible. The same GPU-based software was then directly applied to a suitably formed dataset to benefit supervised learning techniques applied in High Energy Physics (HEP) with similar improvements in performance.

  6. Similarity-based machine learning methods for predicting drug-target interactions: a brief review.

    PubMed

    Ding, Hao; Takigawa, Ichigaku; Mamitsuka, Hiroshi; Zhu, Shanfeng

    2014-09-01

    Computationally predicting drug-target interactions is useful to select possible drug (or target) candidates for further biochemical verification. We focus on machine learning-based approaches, particularly similarity-based methods that use drug and target similarities, which show relationships among drugs and those among targets, respectively. These two similarities represent two emerging concepts, the chemical space and the genomic space. Typically, the methods combine these two types of similarities to generate models for predicting new drug-target interactions. This process is also closely related to a lot of work in pharmacogenomics or chemical biology that attempt to understand the relationships between the chemical and genomic spaces. This background makes the similarity-based approaches attractive and promising. This article reviews the similarity-based machine learning methods for predicting drug-target interactions, which are state-of-the-art and have aroused great interest in bioinformatics. We describe each of these methods briefly, and empirically compare these methods under a uniform experimental setting to explore their advantages and limitations. PMID:23933754

  7. Machine learning methods for credibility assessment of interviewees based on posturographic data.

    PubMed

    Saripalle, Sashi K; Vemulapalli, Spandana; King, Gregory W; Burgoon, Judee K; Derakhshani, Reza

    2015-01-01

    This paper discusses the advantages of using posturographic signals from force plates for non-invasive credibility assessment. The contributions of our work are two fold: first, the proposed method is highly efficient and non invasive. Second, feasibility for creating an autonomous credibility assessment system using machine-learning algorithms is studied. This study employs an interview paradigm that includes subjects responding with truthful and deceptive intent while their center of pressure (COP) signal is being recorded. Classification models utilizing sets of COP features for deceptive responses are derived and best accuracy of 93.5% for test interval is reported. PMID:26737832

  8. Drug name recognition in biomedical texts: a machine-learning-based method.

    PubMed

    He, Linna; Yang, Zhihao; Lin, Hongfei; Li, Yanpeng

    2014-05-01

    Currently, there is an urgent need to develop a technology for extracting drug information automatically from biomedical texts, and drug name recognition is an essential prerequisite for extracting drug information. This article presents a machine-learning-based approach to recognize drug names in biomedical texts. In this approach, a drug name dictionary is first constructed with the external resource of DrugBank and PubMed. Then a semi-supervised learning method, feature coupling generalization, is used to filter this dictionary. Finally, the dictionary look-up and the condition random field method are combined to recognize drug names. Experimental results show that our approach achieves an F-score of 92.54% on the test set of DDIExtraction2011. PMID:24140287

  9. An Evaluation of Machine Learning Methods to Detect Malicious SCADA Communications

    SciTech Connect

    Beaver, Justin M; Borges, Raymond Charles; Buckner, Mark A

    2013-01-01

    Critical infrastructure Supervisory Control and Data Acquisition (SCADA) systems were designed to operate on closed, proprietary networks where a malicious insider posed the greatest threat potential. The centralization of control and the movement towards open systems and standards has improved the efficiency of industrial control, but has also exposed legacy SCADA systems to security threats that they were not designed to mitigate. This work explores the viability of machine learning methods in detecting the new threat scenarios of command and data injection. Similar to network intrusion detection systems in the cyber security domain, the command and control communications in a critical infrastructure setting are monitored, and vetted against examples of benign and malicious command traffic, in order to identify potential attack events. Multiple learning methods are evaluated using a dataset of Remote Terminal Unit communications, which included both normal operations and instances of command and data injection attack scenarios.

  10. Machine Learning in Medicine.

    PubMed

    Deo, Rahul C

    2015-11-17

    Spurred by advances in processing power, memory, storage, and an unprecedented wealth of data, computers are being asked to tackle increasingly complex learning tasks, often with astonishing success. Computers have now mastered a popular variant of poker, learned the laws of physics from experimental data, and become experts in video games - tasks that would have been deemed impossible not too long ago. In parallel, the number of companies centered on applying complex data analysis to varying industries has exploded, and it is thus unsurprising that some analytic companies are turning attention to problems in health care. The purpose of this review is to explore what problems in medicine might benefit from such learning approaches and use examples from the literature to introduce basic concepts in machine learning. It is important to note that seemingly large enough medical data sets and adequate learning algorithms have been available for many decades, and yet, although there are thousands of papers applying machine learning algorithms to medical data, very few have contributed meaningfully to clinical care. This lack of impact stands in stark contrast to the enormous relevance of machine learning to many other industries. Thus, part of my effort will be to identify what obstacles there may be to changing the practice of medicine through statistical learning approaches, and discuss how these might be overcome. PMID:26572668

  11. A machine learning method for the prediction of receptor activation in the simulation of synapses.

    PubMed

    Montes, Jesus; Gomez, Elena; Merchán-Pérez, Angel; Defelipe, Javier; Peña, Jose-Maria

    2013-01-01

    Chemical synaptic transmission involves the release of a neurotransmitter that diffuses in the extracellular space and interacts with specific receptors located on the postsynaptic membrane. Computer simulation approaches provide fundamental tools for exploring various aspects of the synaptic transmission under different conditions. In particular, Monte Carlo methods can track the stochastic movements of neurotransmitter molecules and their interactions with other discrete molecules, the receptors. However, these methods are computationally expensive, even when used with simplified models, preventing their use in large-scale and multi-scale simulations of complex neuronal systems that may involve large numbers of synaptic connections. We have developed a machine-learning based method that can accurately predict relevant aspects of the behavior of synapses, such as the percentage of open synaptic receptors as a function of time since the release of the neurotransmitter, with considerably lower computational cost compared with the conventional Monte Carlo alternative. The method is designed to learn patterns and general principles from a corpus of previously generated Monte Carlo simulations of synapses covering a wide range of structural and functional characteristics. These patterns are later used as a predictive model of the behavior of synapses under different conditions without the need for additional computationally expensive Monte Carlo simulations. This is performed in five stages: data sampling, fold creation, machine learning, validation and curve fitting. The resulting procedure is accurate, automatic, and it is general enough to predict synapse behavior under experimental conditions that are different to the ones it has been trained on. Since our method efficiently reproduces the results that can be obtained with Monte Carlo simulations at a considerably lower computational cost, it is suitable for the simulation of high numbers of synapses and it is

  12. A Machine Learning Method for the Prediction of Receptor Activation in the Simulation of Synapses

    PubMed Central

    Montes, Jesus; Gomez, Elena; Merchán-Pérez, Angel; DeFelipe, Javier; Peña, Jose-Maria

    2013-01-01

    Chemical synaptic transmission involves the release of a neurotransmitter that diffuses in the extracellular space and interacts with specific receptors located on the postsynaptic membrane. Computer simulation approaches provide fundamental tools for exploring various aspects of the synaptic transmission under different conditions. In particular, Monte Carlo methods can track the stochastic movements of neurotransmitter molecules and their interactions with other discrete molecules, the receptors. However, these methods are computationally expensive, even when used with simplified models, preventing their use in large-scale and multi-scale simulations of complex neuronal systems that may involve large numbers of synaptic connections. We have developed a machine-learning based method that can accurately predict relevant aspects of the behavior of synapses, such as the percentage of open synaptic receptors as a function of time since the release of the neurotransmitter, with considerably lower computational cost compared with the conventional Monte Carlo alternative. The method is designed to learn patterns and general principles from a corpus of previously generated Monte Carlo simulations of synapses covering a wide range of structural and functional characteristics. These patterns are later used as a predictive model of the behavior of synapses under different conditions without the need for additional computationally expensive Monte Carlo simulations. This is performed in five stages: data sampling, fold creation, machine learning, validation and curve fitting. The resulting procedure is accurate, automatic, and it is general enough to predict synapse behavior under experimental conditions that are different to the ones it has been trained on. Since our method efficiently reproduces the results that can be obtained with Monte Carlo simulations at a considerably lower computational cost, it is suitable for the simulation of high numbers of synapses and it is

  13. Gaussian processes for machine learning.

    PubMed

    Seeger, Matthias

    2004-04-01

    Gaussian processes (GPs) are natural generalisations of multivariate Gaussian random variables to infinite (countably or continuous) index sets. GPs have been applied in a large number of fields to a diverse range of ends, and very many deep theoretical analyses of various properties are available. This paper gives an introduction to Gaussian processes on a fairly elementary level with special emphasis on characteristics relevant in machine learning. It draws explicit connections to branches such as spline smoothing models and support vector machines in which similar ideas have been investigated. Gaussian process models are routinely used to solve hard machine learning problems. They are attractive because of their flexible non-parametric nature and computational simplicity. Treated within a Bayesian framework, very powerful statistical methods can be implemented which offer valid estimates of uncertainties in our predictions and generic model selection procedures cast as nonlinear optimization problems. Their main drawback of heavy computational scaling has recently been alleviated by the introduction of generic sparse approximations.13,78,31 The mathematical literature on GPs is large and often uses deep concepts which are not required to fully understand most machine learning applications. In this tutorial paper, we aim to present characteristics of GPs relevant to machine learning and to show up precise connections to other "kernel machines" popular in the community. Our focus is on a simple presentation, but references to more detailed sources are provided. PMID:15112367

  14. Estimating the complexity of 3D structural models using machine learning methods

    NASA Astrophysics Data System (ADS)

    Mejía-Herrera, Pablo; Kakurina, Maria; Royer, Jean-Jacques

    2016-04-01

    Quantifying the complexity of 3D geological structural models can play a major role in natural resources exploration surveys, for predicting environmental hazards or for forecasting fossil resources. This paper proposes a structural complexity index which can be used to help in defining the degree of effort necessary to build a 3D model for a given degree of confidence, and also to identify locations where addition efforts are required to meet a given acceptable risk of uncertainty. In this work, it is considered that the structural complexity index can be estimated using machine learning methods on raw geo-data. More precisely, the metrics for measuring the complexity can be approximated as the difficulty degree associated to the prediction of the geological objects distribution calculated based on partial information on the actual structural distribution of materials. The proposed methodology is tested on a set of 3D synthetic structural models for which the degree of effort during their building is assessed using various parameters (such as number of faults, number of part in a surface object, number of borders, ...), the rank of geological elements contained in each model, and, finally, their level of deformation (folding and faulting). The results show how the estimated complexity in a 3D model can be approximated by the quantity of partial data necessaries to simulated at a given precision the actual 3D model without error using machine learning algorithms.

  15. Peak Detection Method Evaluation for Ion Mobility Spectrometry by Using Machine Learning Approaches

    PubMed Central

    Hauschild, Anne-Christin; Kopczynski, Dominik; D’Addario, Marianna; Baumbach, Jörg Ingo; Rahmann, Sven; Baumbach, Jan

    2013-01-01

    Ion mobility spectrometry with pre-separation by multi-capillary columns (MCC/IMS) has become an established inexpensive, non-invasive bioanalytics technology for detecting volatile organic compounds (VOCs) with various metabolomics applications in medical research. To pave the way for this technology towards daily usage in medical practice, different steps still have to be taken. With respect to modern biomarker research, one of the most important tasks is the automatic classification of patient-specific data sets into different groups, healthy or not, for instance. Although sophisticated machine learning methods exist, an inevitable preprocessing step is reliable and robust peak detection without manual intervention. In this work we evaluate four state-of-the-art approaches for automated IMS-based peak detection: local maxima search, watershed transformation with IPHEx, region-merging with VisualNow, and peak model estimation (PME). We manually generated a gold standard with the aid of a domain expert (manual) and compare the performance of the four peak calling methods with respect to two distinct criteria. We first utilize established machine learning methods and systematically study their classification performance based on the four peak detectors’ results. Second, we investigate the classification variance and robustness regarding perturbation and overfitting. Our main finding is that the power of the classification accuracy is almost equally good for all methods, the manually created gold standard as well as the four automatic peak finding methods. In addition, we note that all tools, manual and automatic, are similarly robust against perturbations. However, the classification performance is more robust against overfitting when using the PME as peak calling preprocessor. In summary, we conclude that all methods, though small differences exist, are largely reliable and enable a wide spectrum of real-world biomedical applications. PMID:24957992

  16. Benchmark of Machine Learning Methods for Classification of a SENTINEL-2 Image

    NASA Astrophysics Data System (ADS)

    Pirotti, F.; Sunar, F.; Piragnolo, M.

    2016-06-01

    Thanks to mainly ESA and USGS, a large bulk of free images of the Earth is readily available nowadays. One of the main goals of remote sensing is to label images according to a set of semantic categories, i.e. image classification. This is a very challenging issue since land cover of a specific class may present a large spatial and spectral variability and objects may appear at different scales and orientations. In this study, we report the results of benchmarking 9 machine learning algorithms tested for accuracy and speed in training and classification of land-cover classes in a Sentinel-2 dataset. The following machine learning methods (MLM) have been tested: linear discriminant analysis, k-nearest neighbour, random forests, support vector machines, multi layered perceptron, multi layered perceptron ensemble, ctree, boosting, logarithmic regression. The validation is carried out using a control dataset which consists of an independent classification in 11 land-cover classes of an area about 60 km2, obtained by manual visual interpretation of high resolution images (20 cm ground sampling distance) by experts. In this study five out of the eleven classes are used since the others have too few samples (pixels) for testing and validating subsets. The classes used are the following: (i) urban (ii) sowable areas (iii) water (iv) tree plantations (v) grasslands. Validation is carried out using three different approaches: (i) using pixels from the training dataset (train), (ii) using pixels from the training dataset and applying cross-validation with the k-fold method (kfold) and (iii) using all pixels from the control dataset. Five accuracy indices are calculated for the comparison between the values predicted with each model and control values over three sets of data: the training dataset (train), the whole control dataset (full) and with k-fold cross-validation (kfold) with ten folds. Results from validation of predictions of the whole dataset (full) show the random

  17. Machine learning and statistical methods for the prediction of maximal oxygen uptake: recent advances.

    PubMed

    Abut, Fatih; Akay, Mehmet Fatih

    2015-01-01

    Maximal oxygen uptake (VO2max) indicates how many milliliters of oxygen the body can consume in a state of intense exercise per minute. VO2max plays an important role in both sport and medical sciences for different purposes, such as indicating the endurance capacity of athletes or serving as a metric in estimating the disease risk of a person. In general, the direct measurement of VO2max provides the most accurate assessment of aerobic power. However, despite a high level of accuracy, practical limitations associated with the direct measurement of VO2max, such as the requirement of expensive and sophisticated laboratory equipment or trained staff, have led to the development of various regression models for predicting VO2max. Consequently, a lot of studies have been conducted in the last years to predict VO2max of various target audiences, ranging from soccer athletes, nonexpert swimmers, cross-country skiers to healthy-fit adults, teenagers, and children. Numerous prediction models have been developed using different sets of predictor variables and a variety of machine learning and statistical methods, including support vector machine, multilayer perceptron, general regression neural network, and multiple linear regression. The purpose of this study is to give a detailed overview about the data-driven modeling studies for the prediction of VO2max conducted in recent years and to compare the performance of various VO2max prediction models reported in related literature in terms of two well-known metrics, namely, multiple correlation coefficient (R) and standard error of estimate. The survey results reveal that with respect to regression methods used to develop prediction models, support vector machine, in general, shows better performance than other methods, whereas multiple linear regression exhibits the worst performance. PMID:26346869

  18. Comparison of machine learning methods for data infilling in hydrological forecasting

    NASA Astrophysics Data System (ADS)

    Chacon Hurtado, Juan Carlos; Alfonso, Leonardo; Solomatine, Dimitri

    2014-05-01

    The continuous measurement of hydrological variables requires sensors that must be deployed in the field, increasing the risk of failure due to natural or anthropic conditions inherent to its deployment. The failure of these sensors will interrupt the data stream, which in operational hydrological systems might lead to unsatisfactory performance of forecasting models, or biases due to lack of information in simulation models. To mitigate this, various techniques to fill these missing values can be used, varying from simple regression techniques, to more complex machine learning methods. This research aims at exploring the performance of the latter, considering particular properties of the measurements, length of the missing data series and particular properties of the missing variable. This study is carried out in two different European catchments which differ in geographical conditions, mechanisms and monitoring frequency.

  19. Classifying Force Spectroscopy of DNA Pulling Measurements Using Supervised and Unsupervised Machine Learning Methods.

    PubMed

    Karatay, Durmus U; Zhang, Jie; Harrison, Jeffrey S; Ginger, David S

    2016-04-25

    Dynamic force spectroscopy (DFS) measurements on biomolecules typically require classifying thousands of repeated force spectra prior to data analysis. Here, we study classification of atomic force microscope-based DFS measurements using machine-learning algorithms in order to automate selection of successful force curves. Notably, we collect a data set that has a testable positive signal using photoswitch-modified DNA before and after illumination with UV (365 nm) light. We generate a feature set consisting of six properties of force-distance curves to train supervised models and use principal component analysis (PCA) for an unsupervised model. For supervised classification, we train random forest models for binary and multiclass classification of force-distance curves. Random forest models predict successful pulls with an accuracy of 94% and classify them into five classes with an accuracy of 90%. The unsupervised method using Gaussian mixture models (GMM) reaches an accuracy of approximately 80% for binary classification. PMID:27010122

  20. Classification of P-glycoprotein-interacting compounds using machine learning methods

    PubMed Central

    Prachayasittikul, Veda; Worachartcheewan, Apilak; Shoombuatong, Watshara; Prachayasittikul, Virapong; Nantasenamat, Chanin

    2015-01-01

    P-glycoprotein (Pgp) is a drug transporter that plays important roles in multidrug resistance and drug pharmacokinetics. The inhibition of Pgp has become a notable strategy for combating multidrug-resistant cancers and improving therapeutic outcomes. However, the polyspecific nature of Pgp, together with inconsistent results in experimental assays, renders the determination of endpoints for Pgp-interacting compounds a great challenge. In this study, the classification of a large set of 2,477 Pgp-interacting compounds (i.e., 1341 inhibitors, 913 non-inhibitors, 197 substrates and 26 non-substrates) was performed using several machine learning methods (i.e., decision tree induction, artificial neural network modelling and support vector machine) as a function of their physicochemical properties. The models provided good predictive performance, producing MCC values in the range of 0.739-1 for internal cross-validation and 0.665-1 for external validation. The study provided simple and interpretable models for important properties that influence the activity of Pgp-interacting compounds, which are potentially beneficial for screening and rational design of Pgp inhibitors that are of clinical importance. PMID:26862321

  1. Time and spectral analysis methods with machine learning for the authentication of digital audio recordings.

    PubMed

    Korycki, Rafal

    2013-07-10

    This paper addresses the problem of tampering detection and discusses new methods that can be used for authenticity analysis of digital audio recordings. Nowadays, the only method referred to digital audio files commonly approved by forensic experts is the ENF criterion. It consists in fluctuation analysis of the mains frequency induced in electronic circuits of recording devices. Therefore, its effectiveness is strictly dependent on the presence of mains signal in the recording, which is a rare occurrence. This article presents the existing methods of time and spectral analysis along with their modifications as proposed by the author involving spectral analysis of residual signal enhanced by machine learning algorithms. The effectiveness of tampering detection methods described in this paper is tested on a predefined music database. The results are compared graphically using ROC-like curves. Furthermore, time-frequency plots are presented and enhanced by reassignment method in purpose of visual inspection of modified recordings. Using this solution, enables analysis of minimal changes of background sounds, which may indicate tampering. PMID:23481673

  2. Comparison of Machine Learning methods for incipient motion in gravel bed rivers

    NASA Astrophysics Data System (ADS)

    Valyrakis, Manousos

    2013-04-01

    Soil erosion and sediment transport of natural gravel bed streams are important processes which affect both the morphology as well as the ecology of earth's surface. For gravel bed rivers at near incipient flow conditions, particle entrainment dynamics are highly intermittent. This contribution reviews the use of modern Machine Learning (ML) methods implemented for short term prediction of entrainment instances of individual grains exposed in fully developed near boundary turbulent flows. Results obtained by network architectures of variable complexity based on two different ML methods namely the Artificial Neural Network (ANN) and the Adaptive Neuro-Fuzzy Inference System (ANFIS) are compared in terms of different error and performance indices, computational efficiency and complexity as well as predictive accuracy and forecast ability. Different model architectures are trained and tested with experimental time series obtained from mobile particle flume experiments. The experimental setup consists of a Laser Doppler Velocimeter (LDV) and a laser optics system, which acquire data for the instantaneous flow and particle response respectively, synchronously. The first is used to record the flow velocity components directly upstream of the test particle, while the later tracks the particle's displacements. The lengthy experimental data sets (millions of data points) are split into the training and validation subsets used to perform the corresponding learning and testing of the models. It is demonstrated that the ANFIS hybrid model, which is based on neural learning and fuzzy inference principles, better predicts the critical flow conditions above which sediment transport is initiated. In addition, it is illustrated that empirical knowledge can be extracted, validating the theoretical assumption that particle ejections occur due to energetic turbulent flow events. Such a tool may find application in management and regulation of stream flows downstream of dams for stream

  3. Assessment of machine learning reliability methods for quantifying the applicability domain of QSAR regression models.

    PubMed

    Toplak, Marko; Močnik, Rok; Polajnar, Matija; Bosnić, Zoran; Carlsson, Lars; Hasselgren, Catrin; Demšar, Janez; Boyer, Scott; Zupan, Blaž; Stålring, Jonna

    2014-02-24

    The vastness of chemical space and the relatively small coverage by experimental data recording molecular properties require us to identify subspaces, or domains, for which we can confidently apply QSAR models. The prediction of QSAR models in these domains is reliable, and potential subsequent investigations of such compounds would find that the predictions closely match the experimental values. Standard approaches in QSAR assume that predictions are more reliable for compounds that are "similar" to those in subspaces with denser experimental data. Here, we report on a study of an alternative set of techniques recently proposed in the machine learning community. These methods quantify prediction confidence through estimation of the prediction error at the point of interest. Our study includes 20 public QSAR data sets with continuous response and assesses the quality of 10 reliability scoring methods by observing their correlation with prediction error. We show that these new alternative approaches can outperform standard reliability scores that rely only on similarity to compounds in the training set. The results also indicate that the quality of reliability scoring methods is sensitive to data set characteristics and to the regression method used in QSAR. We demonstrate that at the cost of increased computational complexity these dependencies can be leveraged by integration of scores from various reliability estimation approaches. The reliability estimation techniques described in this paper have been implemented in an open source add-on package ( https://bitbucket.org/biolab/orange-reliability ) to the Orange data mining suite. PMID:24490838

  4. Machine learning plus optical flow: a simple and sensitive method to detect cardioactive drugs

    NASA Astrophysics Data System (ADS)

    Lee, Eugene K.; Kurokawa, Yosuke K.; Tu, Robin; George, Steven C.; Khine, Michelle

    2015-07-01

    Current preclinical screening methods do not adequately detect cardiotoxicity. Using human induced pluripotent stem cell-derived cardiomyocytes (iPS-CMs), more physiologically relevant preclinical or patient-specific screening to detect potential cardiotoxic effects of drug candidates may be possible. However, one of the persistent challenges for developing a high-throughput drug screening platform using iPS-CMs is the need to develop a simple and reliable method to measure key electrophysiological and contractile parameters. To address this need, we have developed a platform that combines machine learning paired with brightfield optical flow as a simple and robust tool that can automate the detection of cardiomyocyte drug effects. Using three cardioactive drugs of different mechanisms, including those with primarily electrophysiological effects, we demonstrate the general applicability of this screening method to detect subtle changes in cardiomyocyte contraction. Requiring only brightfield images of cardiomyocyte contractions, we detect changes in cardiomyocyte contraction comparable to - and even superior to - fluorescence readouts. This automated method serves as a widely applicable screening tool to characterize the effects of drugs on cardiomyocyte function.

  5. Machine learning plus optical flow: a simple and sensitive method to detect cardioactive drugs.

    PubMed

    Lee, Eugene K; Kurokawa, Yosuke K; Tu, Robin; George, Steven C; Khine, Michelle

    2015-01-01

    Current preclinical screening methods do not adequately detect cardiotoxicity. Using human induced pluripotent stem cell-derived cardiomyocytes (iPS-CMs), more physiologically relevant preclinical or patient-specific screening to detect potential cardiotoxic effects of drug candidates may be possible. However, one of the persistent challenges for developing a high-throughput drug screening platform using iPS-CMs is the need to develop a simple and reliable method to measure key electrophysiological and contractile parameters. To address this need, we have developed a platform that combines machine learning paired with brightfield optical flow as a simple and robust tool that can automate the detection of cardiomyocyte drug effects. Using three cardioactive drugs of different mechanisms, including those with primarily electrophysiological effects, we demonstrate the general applicability of this screening method to detect subtle changes in cardiomyocyte contraction. Requiring only brightfield images of cardiomyocyte contractions, we detect changes in cardiomyocyte contraction comparable to - and even superior to - fluorescence readouts. This automated method serves as a widely applicable screening tool to characterize the effects of drugs on cardiomyocyte function. PMID:26139150

  6. Machine learning plus optical flow: a simple and sensitive method to detect cardioactive drugs

    PubMed Central

    Lee, Eugene K.; Kurokawa, Yosuke K.; Tu, Robin; George, Steven C.; Khine, Michelle

    2015-01-01

    Current preclinical screening methods do not adequately detect cardiotoxicity. Using human induced pluripotent stem cell-derived cardiomyocytes (iPS-CMs), more physiologically relevant preclinical or patient-specific screening to detect potential cardiotoxic effects of drug candidates may be possible. However, one of the persistent challenges for developing a high-throughput drug screening platform using iPS-CMs is the need to develop a simple and reliable method to measure key electrophysiological and contractile parameters. To address this need, we have developed a platform that combines machine learning paired with brightfield optical flow as a simple and robust tool that can automate the detection of cardiomyocyte drug effects. Using three cardioactive drugs of different mechanisms, including those with primarily electrophysiological effects, we demonstrate the general applicability of this screening method to detect subtle changes in cardiomyocyte contraction. Requiring only brightfield images of cardiomyocyte contractions, we detect changes in cardiomyocyte contraction comparable to – and even superior to – fluorescence readouts. This automated method serves as a widely applicable screening tool to characterize the effects of drugs on cardiomyocyte function. PMID:26139150

  7. A Study of Applications of Machine Learning Based Classification Methods for Virtual Screening of Lead Molecules.

    PubMed

    Vyas, Renu; Bapat, Sanket; Jain, Esha; Tambe, Sanjeev S; Karthikeyan, Muthukumarasamy; Kulkarni, Bhaskar D

    2015-01-01

    The ligand-based virtual screening of combinatorial libraries employs a number of statistical modeling and machine learning methods. A comprehensive analysis of the application of these methods for the diversity oriented virtual screening of biological targets/drug classes is presented here. A number of classification models have been built using three types of inputs namely structure based descriptors, molecular fingerprints and therapeutic category for performing virtual screening. The activity and affinity descriptors of a set of inhibitors of four target classes DHFR, COX, LOX and NMDA have been utilized to train a total of six classifiers viz. Artificial Neural Network (ANN), k nearest neighbor (k-NN), Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree--(DT) and Random Forest--(RF). Among these classifiers, the ANN was found as the best classifier with an AUC of 0.9 irrespective of the target. New molecular fingerprints based on pharmacophore, toxicophore and chemophore (PTC), were used to build the ANN models for each dataset. A good accuracy of 87.27% was obtained using 296 chemophoric binary fingerprints for the COX-LOX inhibitors compared to pharmacophoric (67.82%) and toxicophoric (70.64%). The methodology was validated on the classical Ames mutagenecity dataset of 4337 molecules. To evaluate it further, selectivity and promiscuity of molecules from five drug classes viz. anti-anginal, anti-convulsant, anti-depressant, anti-arrhythmic and anti-diabetic were studied. The TPC fingerprints computed for each category were able to capture the drug-class specific features using the k-NN classifier. These models can be useful for selecting optimal molecules for drug design. PMID:26138573

  8. Comparison of Machine Learning Methods for the Purpose Of Human Fall Detection

    NASA Astrophysics Data System (ADS)

    Strémy, Maximilián; Peterková, Andrea

    2014-12-01

    According to several studies, the European population is rapidly aging far over last years. It is therefore important to ensure that aging population is able to live independently without the support of working-age population. In accordance with the studies, fall is the most dangerous and frequent accident in the everyday life of aging population. In our paper, we present a system to track the human fall by a visual detection, i.e. using no wearable equipment. For this purpose, we used a Kinect sensor, which provides the human body position in the Cartesian coordinates. It is possible to directly capture a human body because the Kinect sensor has a depth and also an infrared camera. The first step in our research was to detect postures and classify the fall accident. We experimented and compared the selected machine learning methods including Naive Bayes, decision trees and SVM method to compare the performance in recognizing the human postures (standing, sitting and lying). The highest classification accuracy of over 93.3% was achieved by the decision tree method.

  9. On Plant Detection of Intact Tomato Fruits Using Image Analysis and Machine Learning Methods

    PubMed Central

    Yamamoto, Kyosuke; Guo, Wei; Yoshioka, Yosuke; Ninomiya, Seishi

    2014-01-01

    Fully automated yield estimation of intact fruits prior to harvesting provides various benefits to farmers. Until now, several studies have been conducted to estimate fruit yield using image-processing technologies. However, most of these techniques require thresholds for features such as color, shape and size. In addition, their performance strongly depends on the thresholds used, although optimal thresholds tend to vary with images. Furthermore, most of these techniques have attempted to detect only mature and immature fruits, although the number of young fruits is more important for the prediction of long-term fluctuations in yield. In this study, we aimed to develop a method to accurately detect individual intact tomato fruits including mature, immature and young fruits on a plant using a conventional RGB digital camera in conjunction with machine learning approaches. The developed method did not require an adjustment of threshold values for fruit detection from each image because image segmentation was conducted based on classification models generated in accordance with the color, shape, texture and size of the images. The results of fruit detection in the test images showed that the developed method achieved a recall of 0.80, while the precision was 0.88. The recall values of mature, immature and young fruits were 1.00, 0.80 and 0.78, respectively. PMID:25010694

  10. On plant detection of intact tomato fruits using image analysis and machine learning methods.

    PubMed

    Yamamoto, Kyosuke; Guo, Wei; Yoshioka, Yosuke; Ninomiya, Seishi

    2014-01-01

    Fully automated yield estimation of intact fruits prior to harvesting provides various benefits to farmers. Until now, several studies have been conducted to estimate fruit yield using image-processing technologies. However, most of these techniques require thresholds for features such as color, shape and size. In addition, their performance strongly depends on the thresholds used, although optimal thresholds tend to vary with images. Furthermore, most of these techniques have attempted to detect only mature and immature fruits, although the number of young fruits is more important for the prediction of long-term fluctuations in yield. In this study, we aimed to develop a method to accurately detect individual intact tomato fruits including mature, immature and young fruits on a plant using a conventional RGB digital camera in conjunction with machine learning approaches. The developed method did not require an adjustment of threshold values for fruit detection from each image because image segmentation was conducted based on classification models generated in accordance with the color, shape, texture and size of the images. The results of fruit detection in the test images showed that the developed method achieved a recall of 0.80, while the precision was 0.88. The recall values of mature, immature and young fruits were 1.00, 0.80 and 0.78, respectively. PMID:25010694

  11. Unsupervised nonlinear dimensionality reduction machine learning methods applied to multiparametric MRI in cerebral ischemia: preliminary results

    NASA Astrophysics Data System (ADS)

    Parekh, Vishwa S.; Jacobs, Jeremy R.; Jacobs, Michael A.

    2014-03-01

    The evaluation and treatment of acute cerebral ischemia requires a technique that can determine the total area of tissue at risk for infarction using diagnostic magnetic resonance imaging (MRI) sequences. Typical MRI data sets consist of T1- and T2-weighted imaging (T1WI, T2WI) along with advanced MRI parameters of diffusion-weighted imaging (DWI) and perfusion weighted imaging (PWI) methods. Each of these parameters has distinct radiological-pathological meaning. For example, DWI interrogates the movement of water in the tissue and PWI gives an estimate of the blood flow, both are critical measures during the evolution of stroke. In order to integrate these data and give an estimate of the tissue at risk or damaged; we have developed advanced machine learning methods based on unsupervised non-linear dimensionality reduction (NLDR) techniques. NLDR methods are a class of algorithms that uses mathematically defined manifolds for statistical sampling of multidimensional classes to generate a discrimination rule of guaranteed statistical accuracy and they can generate a two- or three-dimensional map, which represents the prominent structures of the data and provides an embedded image of meaningful low-dimensional structures hidden in their high-dimensional observations. In this manuscript, we develop NLDR methods on high dimensional MRI data sets of preclinical animals and clinical patients with stroke. On analyzing the performance of these methods, we observed that there was a high of similarity between multiparametric embedded images from NLDR methods and the ADC map and perfusion map. It was also observed that embedded scattergram of abnormal (infarcted or at risk) tissue can be visualized and provides a mechanism for automatic methods to delineate potential stroke volumes and early tissue at risk.

  12. A Multi-Label Learning Based Kernel Automatic Recommendation Method for Support Vector Machine

    PubMed Central

    Zhang, Xueying; Song, Qinbao

    2015-01-01

    Choosing an appropriate kernel is very important and critical when classifying a new problem with Support Vector Machine. So far, more attention has been paid on constructing new kernels and choosing suitable parameter values for a specific kernel function, but less on kernel selection. Furthermore, most of current kernel selection methods focus on seeking a best kernel with the highest classification accuracy via cross-validation, they are time consuming and ignore the differences among the number of support vectors and the CPU time of SVM with different kernels. Considering the tradeoff between classification success ratio and CPU time, there may be multiple kernel functions performing equally well on the same classification problem. Aiming to automatically select those appropriate kernel functions for a given data set, we propose a multi-label learning based kernel recommendation method built on the data characteristics. For each data set, the meta-knowledge data base is first created by extracting the feature vector of data characteristics and identifying the corresponding applicable kernel set. Then the kernel recommendation model is constructed on the generated meta-knowledge data base with the multi-label classification method. Finally, the appropriate kernel functions are recommended to a new data set by the recommendation model according to the characteristics of the new data set. Extensive experiments over 132 UCI benchmark data sets, with five different types of data set characteristics, eleven typical kernels (Linear, Polynomial, Radial Basis Function, Sigmoidal function, Laplace, Multiquadric, Rational Quadratic, Spherical, Spline, Wave and Circular), and five multi-label classification methods demonstrate that, compared with the existing kernel selection methods and the most widely used RBF kernel function, SVM with the kernel function recommended by our proposed method achieved the highest classification performance. PMID:25893896

  13. Machine learning algorithms for predicting protein folding rates and stability of mutant proteins: comparison with statistical methods.

    PubMed

    Gromiha, M Michael; Huang, Liang-Tsung

    2011-09-01

    Machine learning algorithms have wide range of applications in bioinformatics and computational biology such as prediction of protein secondary structures, solvent accessibility, binding site residues in protein complexes, protein folding rates, stability of mutant proteins, and discrimination of proteins based on their structure and function. In this work, we focus on two aspects of predictions: (i) protein folding rates and (ii) stability of proteins upon mutations. We briefly introduce the concepts of protein folding rates and stability along with available databases, features for prediction methods and measures for prediction performance. Subsequently, the development of structure based parameters and their relationship with protein folding rates will be outlined. The structure based parameters are helpful to understand the physical basis for protein folding and stability. Further, basic principles of major machine learning techniques will be mentioned and their applications for predicting protein folding rates and stability of mutant proteins will be illustrated. The machine learning techniques could achieve the highest accuracy of predicting protein folding rates and stability. In essence, statistical methods and machine learning algorithms are complimenting each other for understanding and predicting protein folding rates and the stability of protein mutants. The available online resources on protein folding rates and stability will be listed. PMID:21787301

  14. Prediction of core cancer genes using a hybrid of feature selection and machine learning methods.

    PubMed

    Liu, Y X; Zhang, N N; He, Y; Lun, L J

    2015-01-01

    Machine learning techniques are of great importance in the analysis of microarray expression data, and provide a systematic and promising way to predict core cancer genes. In this study, a hybrid strategy was introduced based on machine learning techniques to select a small set of informative genes, which will lead to improving classification accuracy. First feature filtering algorithms were applied to select a set of top-ranked genes, and then hierarchical clustering and collapsing dense clusters were used to select core cancer genes. Through empirical study, our approach is capable of selecting relatively few core cancer genes while making high-accuracy predictions. The biological significance of these genes was evaluated using systems biology analysis. Extensive functional pathway and network analyses have confirmed findings in previous studies and can bring new insights into common cancer mechanisms. PMID:26345818

  15. Identifying relatively high-risk group of coronary artery calcification based on progression rate: statistical and machine learning methods.

    PubMed

    Kim, Ha-Young; Yoo, Sanghyun; Lee, Jihyun; Kam, Hye Jin; Woo, Kyoung-Gu; Choi, Yoon-Ho; Sung, Jidong; Kang, Mira

    2012-01-01

    Coronary artery calcification (CAC) score is an important predictor of coronary artery disease (CAD), which is the primary cause of death in advanced countries. Early prediction of high-risk of CAC based on progression rate enables people to prevent CAD from developing into severe symptoms and diseases. In this study, we developed various classifiers to identify patients in high risk of CAC using statistical and machine learning methods, and compared them with performance accuracy. For statistical approaches, linear regression based classifier and logistic regression model were developed. For machine learning approaches, we suggested three kinds of ensemble-based classifiers (best, top-k, and voting method) to deal with imbalanced distribution of our data set. Ensemble voting method outperformed all other methods including regression methods as AUC was 0.781. PMID:23366360

  16. Estimating Corn Yield in the United States with Modis Evi and Machine Learning Methods

    NASA Astrophysics Data System (ADS)

    Kuwata, K.; Shibasaki, R.

    2016-06-01

    Satellite remote sensing is commonly used to monitor crop yield in wide areas. Because many parameters are necessary for crop yield estimation, modelling the relationships between parameters and crop yield is generally complicated. Several methodologies using machine learning have been proposed to solve this issue, but the accuracy of county-level estimation remains to be improved. In addition, estimating county-level crop yield across an entire country has not yet been achieved. In this study, we applied a deep neural network (DNN) to estimate corn yield. We evaluated the estimation accuracy of the DNN model by comparing it with other models trained by different machine learning algorithms. We also prepared two time-series datasets differing in duration and confirmed the feature extraction performance of models by inputting each dataset. As a result, the DNN estimated county-level corn yield for the entire area of the United States with a determination coefficient (R2) of 0.780 and a root mean square error (RMSE) of 18.2 bushels/acre. In addition, our results showed that estimation models that were trained by a neural network extracted features from the input data better than an existing machine learning algorithm.

  17. Integrating Symbolic and Statistical Methods for Testing Intelligent Systems Applications to Machine Learning and Computer Vision

    SciTech Connect

    Jha, Sumit Kumar; Pullum, Laura L; Ramanathan, Arvind

    2016-01-01

    Embedded intelligent systems ranging from tiny im- plantable biomedical devices to large swarms of autonomous un- manned aerial systems are becoming pervasive in our daily lives. While we depend on the flawless functioning of such intelligent systems, and often take their behavioral correctness and safety for granted, it is notoriously difficult to generate test cases that expose subtle errors in the implementations of machine learning algorithms. Hence, the validation of intelligent systems is usually achieved by studying their behavior on representative data sets, using methods such as cross-validation and bootstrapping.In this paper, we present a new testing methodology for studying the correctness of intelligent systems. Our approach uses symbolic decision procedures coupled with statistical hypothesis testing to. We also use our algorithm to analyze the robustness of a human detection algorithm built using the OpenCV open-source computer vision library. We show that the human detection implementation can fail to detect humans in perturbed video frames even when the perturbations are so small that the corresponding frames look identical to the naked eye.

  18. Daily streamflow forecasting by machine learning methods with weather and climate inputs

    NASA Astrophysics Data System (ADS)

    Rasouli, Kabir; Hsieh, William W.; Cannon, Alex J.

    2012-01-01

    SummaryWeather forecast data generated by the NOAA Global Forecasting System (GFS) model, climate indices, and local meteo-hydrologic observations were used to forecast daily streamflows for a small watershed in British Columbia, Canada, at lead times of 1-7 days. Three machine learning methods - Bayesian neural network (BNN), support vector regression (SVR) and Gaussian process (GP) - were used and compared with multiple linear regression (MLR). The nonlinear models generally outperformed MLR, and BNN tended to slightly outperform the other nonlinear models. Among various combinations of predictors, local observations plus the GFS output were generally best at shorter lead times, while local observations plus climate indices were best at longer lead times. The climate indices selected include the sea surface temperature in the Niño 3.4 region, the Pacific-North American teleconnection (PNA), the Arctic Oscillation (AO) and the North Atlantic Oscillation (NAO). In the binary forecasts for extreme (high) streamflow events, the best predictors to use were the local observations plus GFS output. Interestingly, climate indices contribute to daily streamflow forecast scores during longer lead times of 5-7 days, but not to forecast scores for extreme streamflow events for all lead times studied (1-7 days).

  19. Stacked Extreme Learning Machines.

    PubMed

    Zhou, Hongming; Huang, Guang-Bin; Lin, Zhiping; Wang, Han; Soh, Yeng Chai

    2015-09-01

    Extreme learning machine (ELM) has recently attracted many researchers' interest due to its very fast learning speed, good generalization ability, and ease of implementation. It provides a unified solution that can be used directly to solve regression, binary, and multiclass classification problems. In this paper, we propose a stacked ELMs (S-ELMs) that is specially designed for solving large and complex data problems. The S-ELMs divides a single large ELM network into multiple stacked small ELMs which are serially connected. The S-ELMs can approximate a very large ELM network with small memory requirement. To further improve the testing accuracy on big data problems, the ELM autoencoder can be implemented during each iteration of the S-ELMs algorithm. The simulation results show that the S-ELMs even with random hidden nodes can achieve similar testing accuracy to support vector machine (SVM) while having low memory requirements. With the help of ELM autoencoder, the S-ELMs can achieve much better testing accuracy than SVM and slightly better accuracy than deep belief network (DBN) with much faster training speed. PMID:25361517

  20. Machine learning for medical images analysis.

    PubMed

    Criminisi, A

    2016-10-01

    This article discusses the application of machine learning for the analysis of medical images. Specifically: (i) We show how a special type of learning models can be thought of as automatically optimized, hierarchically-structured, rule-based algorithms, and (ii) We discuss how the issue of collecting large labelled datasets applies to both conventional algorithms as well as machine learning techniques. The size of the training database is a function of model complexity rather than a characteristic of machine learning methods. PMID:27374127

  1. On the Use of Machine Learning Methods for Characterization of Contaminant Source Zone Architecture

    NASA Astrophysics Data System (ADS)

    Zhang, H.; Mendoza-Sanchez, I.; Christ, J.; Miller, E. L.; Abriola, L. M.

    2011-12-01

    Recent research has identified the importance of DNAPL mass distribution in the evolution of down-gradient contaminant plumes and the control of source zone remediation effectiveness. Advances in the management of sites containing DNAPL source zones, however, are currently limited by the difficulty associated with characterizing subsurface DNAPL source zone 'architecture'. Specifically, knowledge of the ganglia to pool ratio (GTP) has been demonstrated useful in the assessment and prediction of system behavior. In this paper, we present an approach to the estimation of a quantity related to GTP, the pool fraction (PF), defined as the percentage of the source zone volume occupied by pools, based on observations of plume concentrations. Here we discuss the development and initial validation of an approach for PF estimation based on machine learning method. The algorithm is constructed in a way that, when given new concentration data, prediction of the PF of the associated source zone is attained. An ideal solution would make use of the concentration signals to estimate a single value for PF. Unfortunately, this problem is not well-posed given the data at our disposal. Thus, we relax the regression approach to one of classification. We quantize pool fraction (i.e., the interval between zero and one) into a number of intervals and employ machine learning methods to use the concentration data to determine the interval containing the PF for a given set of data. This approach is predicated on the assumption that quantities (i.e., features) derived from the concentration data of evolving plumes with similar source zone PFs will in fact be similar to one another. Thus, within the training process we must determine a suitable collection of features and build methods for evaluating and optimizing similarity in features space that results in high accuracy in terms of predicting the correct PF interval. Moreover, the number and boundaries of these intervals must also be

  2. A MACHINE-LEARNING METHOD TO INFER FUNDAMENTAL STELLAR PARAMETERS FROM PHOTOMETRIC LIGHT CURVES

    SciTech Connect

    Miller, A. A.; Bloom, J. S.; Richards, J. W.; Starr, D. L.; Lee, Y. S.; Butler, N. R.; Tokarz, S.; Smith, N.; Eisner, J. A.

    2015-01-10

    A fundamental challenge for wide-field imaging surveys is obtaining follow-up spectroscopic observations: there are >10{sup 9} photometrically cataloged sources, yet modern spectroscopic surveys are limited to ∼few× 10{sup 6} targets. As we approach the Large Synoptic Survey Telescope era, new algorithmic solutions are required to cope with the data deluge. Here we report the development of a machine-learning framework capable of inferring fundamental stellar parameters (T {sub eff}, log g, and [Fe/H]) using photometric-brightness variations and color alone. A training set is constructed from a systematic spectroscopic survey of variables with Hectospec/Multi-Mirror Telescope. In sum, the training set includes ∼9000 spectra, for which stellar parameters are measured using the SEGUE Stellar Parameters Pipeline (SSPP). We employed the random forest algorithm to perform a non-parametric regression that predicts T {sub eff}, log g, and [Fe/H] from photometric time-domain observations. Our final optimized model produces a cross-validated rms error (RMSE) of 165 K, 0.39 dex, and 0.33 dex for T {sub eff}, log g, and [Fe/H], respectively. Examining the subset of sources for which the SSPP measurements are most reliable, the RMSE reduces to 125 K, 0.37 dex, and 0.27 dex, respectively, comparable to what is achievable via low-resolution spectroscopy. For variable stars this represents a ≈12%-20% improvement in RMSE relative to models trained with single-epoch photometric colors. As an application of our method, we estimate stellar parameters for ∼54,000 known variables. We argue that this method may convert photometric time-domain surveys into pseudo-spectrographic engines, enabling the construction of extremely detailed maps of the Milky Way, its structure, and history.

  3. A Machine-learning Method to Infer Fundamental Stellar Parameters from Photometric Light Curves

    NASA Astrophysics Data System (ADS)

    Miller, A. A.; Bloom, J. S.; Richards, J. W.; Lee, Y. S.; Starr, D. L.; Butler, N. R.; Tokarz, S.; Smith, N.; Eisner, J. A.

    2015-01-01

    A fundamental challenge for wide-field imaging surveys is obtaining follow-up spectroscopic observations: there are >109 photometrically cataloged sources, yet modern spectroscopic surveys are limited to ~few× 106 targets. As we approach the Large Synoptic Survey Telescope era, new algorithmic solutions are required to cope with the data deluge. Here we report the development of a machine-learning framework capable of inferring fundamental stellar parameters (T eff, log g, and [Fe/H]) using photometric-brightness variations and color alone. A training set is constructed from a systematic spectroscopic survey of variables with Hectospec/Multi-Mirror Telescope. In sum, the training set includes ~9000 spectra, for which stellar parameters are measured using the SEGUE Stellar Parameters Pipeline (SSPP). We employed the random forest algorithm to perform a non-parametric regression that predicts T eff, log g, and [Fe/H] from photometric time-domain observations. Our final optimized model produces a cross-validated rms error (RMSE) of 165 K, 0.39 dex, and 0.33 dex for T eff, log g, and [Fe/H], respectively. Examining the subset of sources for which the SSPP measurements are most reliable, the RMSE reduces to 125 K, 0.37 dex, and 0.27 dex, respectively, comparable to what is achievable via low-resolution spectroscopy. For variable stars this represents a ≈12%-20% improvement in RMSE relative to models trained with single-epoch photometric colors. As an application of our method, we estimate stellar parameters for ~54,000 known variables. We argue that this method may convert photometric time-domain surveys into pseudo-spectrographic engines, enabling the construction of extremely detailed maps of the Milky Way, its structure, and history.

  4. Classification of lung cancer using ensemble-based feature selection and machine learning methods.

    PubMed

    Cai, Zhihua; Xu, Dong; Zhang, Qing; Zhang, Jiexia; Ngai, Sai-Ming; Shao, Jianlin

    2015-03-01

    Lung cancer is one of the leading causes of death worldwide. There are three major types of lung cancers, non-small cell lung cancer (NSCLC), small cell lung cancer (SCLC) and carcinoid. NSCLC is further classified into lung adenocarcinoma (LADC), squamous cell lung cancer (SQCLC) as well as large cell lung cancer. Many previous studies demonstrated that DNA methylation has emerged as potential lung cancer-specific biomarkers. However, whether there exists a set of DNA methylation markers simultaneously distinguishing such three types of lung cancers remains elusive. In the present study, ROC (Receiving Operating Curve), RFs (Random Forests) and mRMR (Maximum Relevancy and Minimum Redundancy) were proposed to capture the unbiased, informative as well as compact molecular signatures followed by machine learning methods to classify LADC, SQCLC and SCLC. As a result, a panel of 16 DNA methylation markers exhibits an ideal classification power with an accuracy of 86.54%, 84.6% and a recall 84.37%, 85.5% in the leave-one-out cross-validation (LOOCV) and independent data set test experiments, respectively. Besides, comparison results indicate that ensemble-based feature selection methods outperform individual ones when combined with the incremental feature selection (IFS) strategy in terms of the informative and compact property of features. Taken together, results obtained suggest the effectiveness of the ensemble-based feature selection approach and the possible existence of a common panel of DNA methylation markers among such three types of lung cancer tissue, which would facilitate clinical diagnosis and treatment. PMID:25512221

  5. Extensions and applications of ensemble-of-trees methods in machine learning

    NASA Astrophysics Data System (ADS)

    Bleich, Justin

    Ensemble-of-trees algorithms have emerged to the forefront of machine learning due to their ability to generate high forecasting accuracy for a wide array of regression and classification problems. Classic ensemble methodologies such as random forests (RF) and stochastic gradient boosting (SGB) rely on algorithmic procedures to generate fits to data. In contrast, more recent ensemble techniques such as Bayesian Additive Regression Trees (BART) and Dynamic Trees (DT) focus on an underlying Bayesian probability model to generate the fits. These new probability model-based approaches show much promise versus their algorithmic counterparts, but also offer substantial room for improvement. The first part of this thesis focuses on methodological advances for ensemble-of-trees techniques with an emphasis on the more recent Bayesian approaches. In particular, we focus on extensions of BART in four distinct ways. First, we develop a more robust implementation of BART for both research and application. We then develop a principled approach to variable selection for BART as well as the ability to naturally incorporate prior information on important covariates into the algorithm. Next, we propose a method for handling missing data that relies on the recursive structure of decision trees and does not require imputation. Last, we relax the assumption of homoskedasticity in the BART model to allow for parametric modeling of heteroskedasticity. The second part of this thesis returns to the classic algorithmic approaches in the context of classification problems with asymmetric costs of forecasting errors. First we consider the performance of RF and SGB more broadly and demonstrate its superiority to logistic regression for applications in criminology with asymmetric costs. Next, we use RF to forecast unplanned hospital readmissions upon patient discharge with asymmetric costs taken into account. Finally, we explore the construction of stable decision trees for forecasts of

  6. Forecasting Urban Water Demand via Machine Learning Methods Coupled with a Bootstrap Rank-Ordered Conditional Mutual Information Input Variable Selection Method

    NASA Astrophysics Data System (ADS)

    Adamowski, J. F.; Quilty, J.; Khalil, B.; Rathinasamy, M.

    2014-12-01

    This paper explores forecasting short-term urban water demand (UWD) (using only historical records) through a variety of machine learning techniques coupled with a novel input variable selection (IVS) procedure. The proposed IVS technique termed, bootstrap rank-ordered conditional mutual information for real-valued signals (brCMIr), is multivariate, nonlinear, nonparametric, and probabilistic. The brCMIr method was tested in a case study using water demand time series for two urban water supply system pressure zones in Ottawa, Canada to select the most important historical records for use with each machine learning technique in order to generate forecasts of average and peak UWD for the respective pressure zones at lead times of 1, 3, and 7 days ahead. All lead time forecasts are computed using Artificial Neural Networks (ANN) as the base model, and are compared with Least Squares Support Vector Regression (LSSVR), as well as a novel machine learning method for UWD forecasting: the Extreme Learning Machine (ELM). Results from one-way analysis of variance (ANOVA) and Tukey Honesty Significance Difference (HSD) tests indicate that the LSSVR and ELM models are the best machine learning techniques to pair with brCMIr. However, ELM has significant computational advantages over LSSVR (and ANN) and provides a new and promising technique to explore in UWD forecasting.

  7. Multipolar electrostatics based on the Kriging machine learning method: an application to serine.

    PubMed

    Yuan, Yongna; Mills, Matthew J L; Popelier, Paul L A

    2014-04-01

    A multipolar, polarizable electrostatic method for future use in a novel force field is described. Quantum Chemical Topology (QCT) is used to partition the electron density of a chemical system into atoms, then the machine learning method Kriging is used to build models that relate the multipole moments of the atoms to the positions of their surrounding nuclei. The pilot system serine is used to study both the influence of the level of theory and the set of data generator methods used. The latter consists of: (i) sampling of protein structures deposited in the Protein Data Bank (PDB), or (ii) normal mode distortion along either (a) Cartesian coordinates, or (b) redundant internal coordinates. Wavefunctions for the sampled geometries were obtained at the HF/6-31G(d,p), B3LYP/apc-1, and MP2/cc-pVDZ levels of theory, prior to calculation of the atomic multipole moments by volume integration. The average absolute error (over an independent test set of conformations) in the total atom-atom electrostatic interaction energy of serine, using Kriging models built with the three data generator methods is 11.3 kJ mol⁻¹ (PDB), 8.2 kJ mol⁻¹ (Cartesian distortion), and 10.1 kJ mol⁻¹ (redundant internal distortion) at the HF/6-31G(d,p) level. At the B3LYP/apc-1 level, the respective errors are 7.7 kJ mol⁻¹, 6.7 kJ mol⁻¹, and 4.9 kJ mol⁻¹, while at the MP2/cc-pVDZ level they are 6.5 kJ mol⁻¹, 5.3 kJ mol⁻¹, and 4.0 kJ mol⁻¹. The ranges of geometries generated by the redundant internal coordinate distortion and by extraction from the PDB are much wider than the range generated by Cartesian distortion. The atomic multipole moment and electrostatic interaction energy predictions for the B3LYP/apc-1 and MP2/cc-pVDZ levels are similar, and both are better than the corresponding predictions at the HF/6-31G(d,p) level. PMID:24633774

  8. Computer-Aided Diagnosis for Breast Ultrasound Using Computerized BI-RADS Features and Machine Learning Methods.

    PubMed

    Shan, Juan; Alam, S Kaisar; Garra, Brian; Zhang, Yingtao; Ahmed, Tahira

    2016-04-01

    This work identifies effective computable features from the Breast Imaging Reporting and Data System (BI-RADS), to develop a computer-aided diagnosis (CAD) system for breast ultrasound. Computerized features corresponding to ultrasound BI-RADs categories were designed and tested using a database of 283 pathology-proven benign and malignant lesions. Features were selected based on classification performance using a "bottom-up" approach for different machine learning methods, including decision tree, artificial neural network, random forest and support vector machine. Using 10-fold cross-validation on the database of 283 cases, the highest area under the receiver operating characteristic (ROC) curve (AUC) was 0.84 from a support vector machine with 77.7% overall accuracy; the highest overall accuracy, 78.5%, was from a random forest with the AUC 0.83. Lesion margin and orientation were optimum features common to all of the different machine learning methods. These features can be used in CAD systems to help distinguish benign from worrisome lesions. PMID:26806441

  9. Machine learning applications in genetics and genomics.

    PubMed

    Libbrecht, Maxwell W; Noble, William Stafford

    2015-06-01

    The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets. PMID:25948244

  10. A Machine Learning Method for Power Prediction on the Mobile Devices.

    PubMed

    Chen, Da-Ren; Chen, You-Shyang; Chen, Lin-Chih; Hsu, Ming-Yang; Chiang, Kai-Feng

    2015-10-01

    Energy profiling and estimation have been popular areas of research in multicore mobile architectures. While short sequences of system calls have been recognized by machine learning as pattern descriptions for anomalous detection, power consumption of running processes with respect to system-call patterns are not well studied. In this paper, we propose a fuzzy neural network (FNN) for training and analyzing process execution behaviour with respect to series of system calls, parameters and their power consumptions. On the basis of the patterns of a series of system calls, we develop a power estimation daemon (PED) to analyze and predict the energy consumption of the running process. In the initial stage, PED categorizes sequences of system calls as functional groups and predicts their energy consumptions by FNN. In the operational stage, PED is applied to identify the predefined sequences of system calls invoked by running processes and estimates their energy consumption. PMID:26306877

  11. Machine Learning in Systems Biology

    PubMed Central

    d'Alché-Buc, Florence; Wehenkel, Louis

    2008-01-01

    This supplement contains extended versions of a selected subset of papers presented at the workshop MLSB 2007, Machine Learning in Systems Biology, Evry, France, from September 24 to 25, 2007. PMID:19091048

  12. Machine learning in systems biology.

    PubMed

    d'Alché-Buc, Florence; Wehenkel, Louis

    2008-01-01

    This supplement contains extended versions of a selected subset of papers presented at the workshop MLSB 2007, Machine Learning in Systems Biology, Evry, France, from September 24 to 25, 2007. PMID:19091048

  13. Machine learning methods for the classification of extreme rainfall and hail events

    NASA Astrophysics Data System (ADS)

    Teschl, Reinhard; Süsser-Rechberger, Barbara; Paulitsch, Helmut

    2015-04-01

    In this study, an analysis of a meteorological data set with machine learning tools is presented. The aim was to identify characteristic patterns in different sources of remote sensing data that are associated with hazards like extreme rainfall and hail. The data set originates from a project that was started in 2007 with the goal to document and mitigate hail events in the province of Styria, Austria. It consists of three dimensional weather radar data from a C-band Doppler radar, cloud top temperature information from infrared channels of a weather satellite, as well as the height of the 0° C isotherm from the forecast of the national weather service. The 3D radar dataset has a spatial resolution of 1 km x 1 km x 1 km, up to a height of 16 km above mean sea level, and a temporal resolution of 5 minutes. The infrared satellite image resolution is about 3 km x 3 km, the images are updated every 30 minutes. The study area has approx. 16,000 square kilometers. So far, different criteria for the occurrence of hail (and its discrimination from heavy rain) have been found and are documented in the literature. When applying these criteria to our data and contrasting them with damage reports from an insurance company, a need for adaption was identified. Here we are using supervised learning paradigms to find tailored relationships for the study area, validated by a sub-dataset that was not involved in the training process.

  14. Machine learning-based method for personalized and cost-effective detection of Alzheimer's disease.

    PubMed

    Escudero, Javier; Ifeachor, Emmanuel; Zajicek, John P; Green, Colin; Shearer, James; Pearson, Stephen

    2013-01-01

    Diagnosis of Alzheimer's disease (AD) is often difficult, especially early in the disease process at the stage of mild cognitive impairment (MCI). Yet, it is at this stage that treatment is most likely to be effective, so there would be great advantages in improving the diagnosis process. We describe and test a machine learning approach for personalized and cost-effective diagnosis of AD. It uses locally weighted learning to tailor a classifier model to each patient and computes the sequence of biomarkers most informative or cost-effective to diagnose patients. Using ADNI data, we classified AD versus controls and MCI patients who progressed to AD within a year, against those who did not. The approach performed similarly to considering all data at once, while significantly reducing the number (and cost) of the biomarkers needed to achieve a confident diagnosis for each patient. Thus, it may contribute to a personalized and effective detection of AD, and may prove useful in clinical settings. PMID:22893371

  15. Assessing Scientific Practices Using Machine-Learning Methods: How Closely Do They Match Clinical Interview Performance?

    NASA Astrophysics Data System (ADS)

    Beggrow, Elizabeth P.; Ha, Minsu; Nehm, Ross H.; Pearl, Dennis; Boone, William J.

    2013-07-01

    The landscape of science education is being transformed by the new Framework for Science Education (National Research Council, A framework for K-12 science education: practices, crosscutting concepts, and core ideas. The National Academies Press, Washington, DC, 2012), which emphasizes the centrality of scientific practices—such as explanation, argumentation, and communication—in science teaching, learning, and assessment. A major challenge facing the field of science education is developing assessment tools that are capable of validly and efficiently evaluating these practices. Our study examined the efficacy of a free, open-source machine-learning tool for evaluating the quality of students' written explanations of the causes of evolutionary change relative to three other approaches: (1) human-scored written explanations, (2) a multiple-choice test, and (3) clinical oral interviews. A large sample of undergraduates (n = 104) exposed to varying amounts of evolution content completed all three assessments: a clinical oral interview, a written open-response assessment, and a multiple-choice test. Rasch analysis was used to compute linear person measures and linear item measures on a single logit scale. We found that the multiple-choice test displayed poor person and item fit (mean square outfit >1.3), while both oral interview measures and computer-generated written response measures exhibited acceptable fit (average mean square outfit for interview: person 0.97, item 0.97; computer: person 1.03, item 1.06). Multiple-choice test measures were more weakly associated with interview measures (r = 0.35) than the computer-scored explanation measures (r = 0.63). Overall, Rasch analysis indicated that computer-scored written explanation measures (1) have the strongest correspondence to oral interview measures; (2) are capable of capturing students' normative scientific and naive ideas as accurately as human-scored explanations, and (3) more validly detect understanding

  16. A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli

    PubMed Central

    2014-01-01

    Background Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods. Results This paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end. Conclusions This study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving

  17. An introduction to quantum machine learning

    NASA Astrophysics Data System (ADS)

    Schuld, Maria; Sinayskiy, Ilya; Petruccione, Francesco

    2015-04-01

    Machine learning algorithms learn a desired input-output relation from examples in order to interpret new inputs. This is important for tasks such as image and speech recognition or strategy optimisation, with growing applications in the IT industry. In the last couple of years, researchers investigated if quantum computing can help to improve classical machine learning algorithms. Ideas range from running computationally costly algorithms or their subroutines efficiently on a quantum computer to the translation of stochastic methods into the language of quantum theory. This contribution gives a systematic overview of the emerging field of quantum machine learning. It presents the approaches as well as technical details in an accessible way, and discusses the potential of a future theory of quantum learning.

  18. Missing-Data Estimation for Daily Rainfall in Everglades Florida Using Machine Learning Methods

    NASA Astrophysics Data System (ADS)

    Lima, C.; Lall, U.; Landot, T.; Pathak, C.

    2008-05-01

    In the present study we derive a novel model to fill in gaps of daily rainfall data from 43 rainfall stations of South Florida. The filling-in process consists of two stages: prediction of rainfall occurrence and prediction of rainfall amounts. In the first step we identify the stations with available daily rainfall data and assign 1 for wet states and - 1 for dry states. Support Vector Machines (SVM) is then applied to derive an optimal spatial boundary in order to define the spatial pattern of wet and dry states. The missing-data station is classified, based on its spatial location, into the wet or dry class. In the second stage we use historical data of available stations as predictors for the rainfall amounts of the missing data gauge. We evaluate three different models to predict rainfall amounts: linear, local regression (locfit) and Support Vector Machines. We compare these models with common methods used in the literature, namely ordinary Kriging and nearest neighbor methods. The results show that the methodology proposed here yields accurate estimates of daily rainfall.

  19. Machine Shop. Student Learning Guide.

    ERIC Educational Resources Information Center

    Palm Beach County Board of Public Instruction, West Palm Beach, FL.

    This student learning guide contains eight modules for completing a course in machine shop. It is designed especially for use in Palm Beach County, Florida. Each module covers one task, and consists of a purpose, performance objective, enabling objectives, learning activities and resources, information sheets, student self-check with answer key,…

  20. Machine Learning Methods for the Sampling of Chemical Space From First Principles

    NASA Astrophysics Data System (ADS)

    von Lilienfeld, Anatole

    2015-03-01

    Computational brute force high-throughput screening of compounds is beyond any capacity for all but the most restricted systems due to the combinatorial nature of chemical space, i.e. all the compositional, constitutional, and conformational isomers. Efficient computational materials design algorithms must therefore make good trade-offs between the accuracy of the applied model and computational speed. Overall, rapid convergence in terms of number of compounds visited is highly desirable. In this talk, I will describe recent contributions in this field based on statistical approaches that can serve as inexpensive surrogate models to reduce the computational load of quantum mechanical calculations. Such surrogate machine learning (ML) models infer quantum mechanical observables of novel materials, rather than solving approximate variants of Schroedinger's equation. We developed accurate ML models for the rapid prediction of atomization energies and enthalpies, cohesive energies, and electronic properties that conventionally can only be predicted using quantum mechanics. All our ML models have been trained using large data bases containing properties of thousands of chemical compounds and materials. I will exemplify our approach for the prediction of properties from scratch for out-of-sample compounds. These predictions reach quantum chemical accuracy and are basically instantaneous, i.e. at a computational cost reduced by several orders of magnitude.

  1. Application of Geostatistical Methods and Machine Learning for spatio-temporal Earthquake Cluster Analysis

    NASA Astrophysics Data System (ADS)

    Schaefer, A. M.; Daniell, J. E.; Wenzel, F.

    2014-12-01

    Earthquake clustering tends to be an increasingly important part of general earthquake research especially in terms of seismic hazard assessment and earthquake forecasting and prediction approaches. The distinct identification and definition of foreshocks, aftershocks, mainshocks and secondary mainshocks is taken into account using a point based spatio-temporal clustering algorithm originating from the field of classic machine learning. This can be further applied for declustering purposes to separate background seismicity from triggered seismicity. The results are interpreted and processed to assemble 3D-(x,y,t) earthquake clustering maps which are based on smoothed seismicity records in space and time. In addition, multi-dimensional Gaussian functions are used to capture clustering parameters for spatial distribution and dominant orientations. Clusters are further processed using methodologies originating from geostatistics, which have been mostly applied and developed in mining projects during the last decades. A 2.5D variogram analysis is applied to identify spatio-temporal homogeneity in terms of earthquake density and energy output. The results are mitigated using Kriging to provide an accurate mapping solution for clustering features. As a case study, seismic data of New Zealand and the United States is used, covering events since the 1950s, from which an earthquake cluster catalogue is assembled for most of the major events, including a detailed analysis of the Landers and Christchurch sequences.

  2. Comparative analysis of expert and machine-learning methods for classification of body cavity effusions in companion animals.

    PubMed

    Hotz, Christine S; Templeton, Steven J; Christopher, Mary M

    2005-03-01

    A rule-based expert system using CLIPS programming language was created to classify body cavity effusions as transudates, modified transudates, exudates, chylous, and hemorrhagic effusions. The diagnostic accuracy of the rule-based system was compared with that produced by 2 machine-learning methods: Rosetta, a rough sets algorithm and RIPPER, a rule-induction method. Results of 508 body cavity fluid analyses (canine, feline, equine) obtained from the University of California-Davis Veterinary Medical Teaching Hospital computerized patient database were used to test CLIPS and to test and train RIPPER and Rosetta. The CLIPS system, using 17 rules, achieved an accuracy of 93.5% compared with pathologist consensus diagnoses. Rosetta accurately classified 91% of effusions by using 5,479 rules. RIPPER achieved the greatest accuracy (95.5%) using only 10 rules. When the original rules of the CLIPS application were replaced with those of RIPPER, the accuracy rates were identical. These results suggest that both rule-based expert systems and machine-learning methods hold promise for the preliminary classification of body fluids in the clinical laboratory. PMID:15825497

  3. Data Processing And Machine Learning Methods For Multi-Modal Operator State Classification Systems

    NASA Technical Reports Server (NTRS)

    Hearn, Tristan A.

    2015-01-01

    This document is intended as an introduction to a set of common signal processing learning methods that may be used in the software portion of a functional crew state monitoring system. This includes overviews of both the theory of the methods involved, as well as examples of implementation. Practical considerations are discussed for implementing modular, flexible, and scalable processing and classification software for a multi-modal, multi-channel monitoring system. Example source code is also given for all of the discussed processing and classification methods.

  4. Game-powered machine learning.

    PubMed

    Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert

    2012-04-24

    Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the "wisdom of the crowds." Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., "funky jazz with saxophone," "spooky electronica," etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data. PMID:22460786

  5. Game-powered machine learning

    PubMed Central

    Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert

    2012-01-01

    Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the “wisdom of the crowds.” Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., “funky jazz with saxophone,” “spooky electronica,” etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data. PMID:22460786

  6. Recent Advances in Predictive (Machine) Learning

    SciTech Connect

    Friedman, J

    2004-01-24

    Prediction involves estimating the unknown value of an attribute of a system under study given the values of other measured attributes. In prediction (machine) learning the prediction rule is derived from data consisting of previously solved cases. Most methods for predictive learning were originated many years ago at the dawn of the computer age. Recently two new techniques have emerged that have revitalized the field. These are support vector machines and boosted decision trees. This paper provides an introduction to these two new methods tracing their respective ancestral roots to standard kernel methods and ordinary decision trees.

  7. Machine learning: Trends, perspectives, and prospects.

    PubMed

    Jordan, M I; Mitchell, T M

    2015-07-17

    Machine learning addresses the question of how to build computers that improve automatically through experience. It is one of today's most rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science. Recent progress in machine learning has been driven both by the development of new learning algorithms and theory and by the ongoing explosion in the availability of online data and low-cost computation. The adoption of data-intensive machine-learning methods can be found throughout science, technology and commerce, leading to more evidence-based decision-making across many walks of life, including health care, manufacturing, education, financial modeling, policing, and marketing. PMID:26185243

  8. Machine learning in motion control

    NASA Technical Reports Server (NTRS)

    Su, Renjeng; Kermiche, Noureddine

    1989-01-01

    The existing methodologies for robot programming originate primarily from robotic applications to manufacturing, where uncertainties of the robots and their task environment may be minimized by repeated off-line modeling and identification. In space application of robots, however, a higher degree of automation is required for robot programming because of the desire of minimizing the human intervention. We discuss a new paradigm of robotic programming which is based on the concept of machine learning. The goal is to let robots practice tasks by themselves and the operational data are used to automatically improve their motion performance. The underlying mathematical problem is to solve the problem of dynamical inverse by iterative methods. One of the key questions is how to ensure the convergence of the iterative process. There have been a few small steps taken into this important approach to robot programming. We give a representative result on the convergence problem.

  9. Machine learning in sedimentation modelling.

    PubMed

    Bhattacharya, B; Solomatine, D P

    2006-03-01

    The paper presents machine learning (ML) models that predict sedimentation in the harbour basin of the Port of Rotterdam. The important factors affecting the sedimentation process such as waves, wind, tides, surge, river discharge, etc. are studied, the corresponding time series data is analysed, missing values are estimated and the most important variables behind the process are chosen as the inputs. Two ML methods are used: MLP ANN and M5 model tree. The latter is a collection of piece-wise linear regression models, each being an expert for a particular region of the input space. The models are trained on the data collected during 1992-1998 and tested by the data of 1999-2000. The predictive accuracy of the models is found to be adequate for the potential use in the operational decision making. PMID:16530383

  10. Blind steganalysis method for JPEG steganography combined with the semisupervised learning and soft margin support vector machine

    NASA Astrophysics Data System (ADS)

    Dong, Yu; Zhang, Tao; Xi, Ling

    2015-01-01

    Stego images embedded by unknown steganographic algorithms currently may not be detected by using steganalysis detectors based on binary classifier. However, it is difficult to obtain high detection accuracy by using universal steganalysis based on one-class classifier. For solving this problem, a blind detection method for JPEG steganography was proposed from the perspective of information theory. The proposed method combined the semisupervised learning and soft margin support vector machine with steganalysis detector based on one-class classifier to utilize the information in test data for improving detection performance. Reliable blind detection for JPEG steganography was realized only using cover images for training. The experimental results show that the proposed method can contribute to improving the detection accuracy of steganalysis detector based on one-class classifier and has good robustness under different source mismatch conditions.

  11. Prediction of Backbreak in Open-Pit Blasting Operations Using the Machine Learning Method

    NASA Astrophysics Data System (ADS)

    Khandelwal, Manoj; Monjezi, M.

    2013-03-01

    Backbreak is an undesirable phenomenon in blasting operations. It can cause instability of mine walls, falling down of machinery, improper fragmentation, reduced efficiency of drilling, etc. The existence of various effective parameters and their unknown relationships are the main reasons for inaccuracy of the empirical models. Presently, the application of new approaches such as artificial intelligence is highly recommended. In this paper, an attempt has been made to predict backbreak in blasting operations of Soungun iron mine, Iran, incorporating rock properties and blast design parameters using the support vector machine (SVM) method. To investigate the suitability of this approach, the predictions by SVM have been compared with multivariate regression analysis (MVRA). The coefficient of determination (CoD) and the mean absolute error (MAE) were taken as performance measures. It was found that the CoD between measured and predicted backbreak was 0.987 and 0.89 by SVM and MVRA, respectively, whereas the MAE was 0.29 and 1.07 by SVM and MVRA, respectively.

  12. Remotely controlling of mobile robots using gesture captured by the Kinect and recognized by machine learning method

    NASA Astrophysics Data System (ADS)

    Hsu, Roy CHaoming; Jian, Jhih-Wei; Lin, Chih-Chuan; Lai, Chien-Hung; Liu, Cheng-Ting

    2013-01-01

    The main purpose of this paper is to use machine learning method and Kinect and its body sensation technology to design a simple, convenient, yet effective robot remote control system. In this study, a Kinect sensor is used to capture the human body skeleton with depth information, and a gesture training and identification method is designed using the back propagation neural network to remotely command a mobile robot for certain actions via the Bluetooth. The experimental results show that the designed mobile robots remote control system can achieve, on an average, more than 96% of accurate identification of 7 types of gestures and can effectively control a real e-puck robot for the designed commands.

  13. Machine learning methods for empirical streamflow simulation: a comparison of model accuracy, interpretability, and uncertainty in seasonal watersheds

    NASA Astrophysics Data System (ADS)

    Shortridge, Julie E.; Guikema, Seth D.; Zaitchik, Benjamin F.

    2016-07-01

    In the past decade, machine learning methods for empirical rainfall-runoff modeling have seen extensive development and been proposed as a useful complement to physical hydrologic models, particularly in basins where data to support process-based models are limited. However, the majority of research has focused on a small number of methods, such as artificial neural networks, despite the development of multiple other approaches for non-parametric regression in recent years. Furthermore, this work has often evaluated model performance based on predictive accuracy alone, while not considering broader objectives, such as model interpretability and uncertainty, that are important if such methods are to be used for planning and management decisions. In this paper, we use multiple regression and machine learning approaches (including generalized additive models, multivariate adaptive regression splines, artificial neural networks, random forests, and M5 cubist models) to simulate monthly streamflow in five highly seasonal rivers in the highlands of Ethiopia and compare their performance in terms of predictive accuracy, error structure and bias, model interpretability, and uncertainty when faced with extreme climate conditions. While the relative predictive performance of models differed across basins, data-driven approaches were able to achieve reduced errors when compared to physical models developed for the region. Methods such as random forests and generalized additive models may have advantages in terms of visualization and interpretation of model structure, which can be useful in providing insights into physical watershed function. However, the uncertainty associated with model predictions under extreme climate conditions should be carefully evaluated, since certain models (especially generalized additive models and multivariate adaptive regression splines) become highly variable when faced with high temperatures.

  14. A general procedure to generate models for urban environmental-noise pollution using feature selection and machine learning methods.

    PubMed

    Torija, Antonio J; Ruiz, Diego P

    2015-02-01

    The prediction of environmental noise in urban environments requires the solution of a complex and non-linear problem, since there are complex relationships among the multitude of variables involved in the characterization and modelling of environmental noise and environmental-noise magnitudes. Moreover, the inclusion of the great spatial heterogeneity characteristic of urban environments seems to be essential in order to achieve an accurate environmental-noise prediction in cities. This problem is addressed in this paper, where a procedure based on feature-selection techniques and machine-learning regression methods is proposed and applied to this environmental problem. Three machine-learning regression methods, which are considered very robust in solving non-linear problems, are used to estimate the energy-equivalent sound-pressure level descriptor (LAeq). These three methods are: (i) multilayer perceptron (MLP), (ii) sequential minimal optimisation (SMO), and (iii) Gaussian processes for regression (GPR). In addition, because of the high number of input variables involved in environmental-noise modelling and estimation in urban environments, which make LAeq prediction models quite complex and costly in terms of time and resources for application to real situations, three different techniques are used to approach feature selection or data reduction. The feature-selection techniques used are: (i) correlation-based feature-subset selection (CFS), (ii) wrapper for feature-subset selection (WFS), and the data reduction technique is principal-component analysis (PCA). The subsequent analysis leads to a proposal of different schemes, depending on the needs regarding data collection and accuracy. The use of WFS as the feature-selection technique with the implementation of SMO or GPR as regression algorithm provides the best LAeq estimation (R(2)=0.94 and mean absolute error (MAE)=1.14-1.16 dB(A)). PMID:25461071

  15. New Learning Method of a Lecture of ‘Machine Fabrication’ by Self-study with Investigation and Presentation Incorporated

    NASA Astrophysics Data System (ADS)

    Kasuga, Yukio

    A new teaching method was developed in learningmachine fabrication’ for the undergraduate students. This consists of a few times of lectures, grouping, decision of industrial products which each group wants to investigate, investigation work by library books and internet, arrangement of data containing characteristics of the products, employed materials and processing methods, presentation, discussions and revision followed by another presentation. This new method is derived from one of the Finland‧s way of primary school education. Their way of education is believed to have boosted up to the top ranking in PISA tests by OECD. After starting the new way of learning, students have fresh impressions on this lesson, especially for self-study, the way of investigation, collaborate work and presentation. Also, after four years of implementation, some improvements have been made including less use of internet, and determination of products and fabricating methods in advance which should be investigated. By this, students‧ lecture assessment shows further encouraging results.

  16. A machine learning method for extracting symbolic knowledge from recurrent neural networks.

    PubMed

    Vahed, A; Omlin, C W

    2004-01-01

    Neural networks do not readily provide an explanation of the knowledge stored in their weights as part of their information processing. Until recently, neural networks were considered to be black boxes, with the knowledge stored in their weights not readily accessible. Since then, research has resulted in a number of algorithms for extracting knowledge in symbolic form from trained neural networks. This article addresses the extraction of knowledge in symbolic form from recurrent neural networks trained to behave like deterministic finite-state automata (DFAs). To date, methods used to extract knowledge from such networks have relied on the hypothesis that networks' states tend to cluster and that clusters of network states correspond to DFA states. The computational complexity of such a cluster analysis has led to heuristics that either limit the number of clusters that may form during training or limit the exploration of the space of hidden recurrent state neurons. These limitations, while necessary, may lead to decreased fidelity, in which the extracted knowledge may not model the true behavior of a trained network, perhaps not even for the training set. The method proposed here uses a polynomial time, symbolic learning algorithm to infer DFAs solely from the observation of a trained network's input-output behavior. Thus, this method has the potential to increase the fidelity of the extracted knowledge. PMID:15006023

  17. Dropout Prediction in E-Learning Courses through the Combination of Machine Learning Techniques

    ERIC Educational Resources Information Center

    Lykourentzou, Ioanna; Giannoukos, Ioannis; Nikolopoulos, Vassilis; Mpardis, George; Loumos, Vassili

    2009-01-01

    In this paper, a dropout prediction method for e-learning courses, based on three popular machine learning techniques and detailed student data, is proposed. The machine learning techniques used are feed-forward neural networks, support vector machines and probabilistic ensemble simplified fuzzy ARTMAP. Since a single technique may fail to…

  18. Entanglement-Based Machine Learning on a Quantum Computer

    NASA Astrophysics Data System (ADS)

    Cai, X.-D.; Wu, D.; Su, Z.-E.; Chen, M.-C.; Wang, X.-L.; Li, Li; Liu, N.-L.; Lu, C.-Y.; Pan, J.-W.

    2015-03-01

    Machine learning, a branch of artificial intelligence, learns from previous experience to optimize performance, which is ubiquitous in various fields such as computer sciences, financial analysis, robotics, and bioinformatics. A challenge is that machine learning with the rapidly growing "big data" could become intractable for classical computers. Recently, quantum machine learning algorithms [Lloyd, Mohseni, and Rebentrost, arXiv.1307.0411] were proposed which could offer an exponential speedup over classical algorithms. Here, we report the first experimental entanglement-based classification of two-, four-, and eight-dimensional vectors to different clusters using a small-scale photonic quantum computer, which are then used to implement supervised and unsupervised machine learning. The results demonstrate the working principle of using quantum computers to manipulate and classify high-dimensional vectors, the core mathematical routine in machine learning. The method can, in principle, be scaled to larger numbers of qubits, and may provide a new route to accelerate machine learning.

  19. Entanglement-based machine learning on a quantum computer.

    PubMed

    Cai, X-D; Wu, D; Su, Z-E; Chen, M-C; Wang, X-L; Li, Li; Liu, N-L; Lu, C-Y; Pan, J-W

    2015-03-20

    Machine learning, a branch of artificial intelligence, learns from previous experience to optimize performance, which is ubiquitous in various fields such as computer sciences, financial analysis, robotics, and bioinformatics. A challenge is that machine learning with the rapidly growing "big data" could become intractable for classical computers. Recently, quantum machine learning algorithms [Lloyd, Mohseni, and Rebentrost, arXiv.1307.0411] were proposed which could offer an exponential speedup over classical algorithms. Here, we report the first experimental entanglement-based classification of two-, four-, and eight-dimensional vectors to different clusters using a small-scale photonic quantum computer, which are then used to implement supervised and unsupervised machine learning. The results demonstrate the working principle of using quantum computers to manipulate and classify high-dimensional vectors, the core mathematical routine in machine learning. The method can, in principle, be scaled to larger numbers of qubits, and may provide a new route to accelerate machine learning. PMID:25839250

  20. Vitrification: Machines learn to recognize glasses

    NASA Astrophysics Data System (ADS)

    Ceriotti, Michele; Vitelli, Vincenzo

    2016-05-01

    The dynamics of a viscous liquid undergo a dramatic slowdown when it is cooled to form a solid glass. Recognizing the structural changes across such a transition remains a major challenge. Machine-learning methods, similar to those Facebook uses to recognize groups of friends, have now been applied to this problem.

  1. A Comparison of Methods for Classifying Clinical Samples Based on Proteomics Data: A Case Study for Statistical and Machine Learning Approaches

    PubMed Central

    Sampson, Dayle L.; Parker, Tony J.; Upton, Zee; Hurst, Cameron P.

    2011-01-01

    The discovery of protein variation is an important strategy in disease diagnosis within the biological sciences. The current benchmark for elucidating information from multiple biological variables is the so called “omics” disciplines of the biological sciences. Such variability is uncovered by implementation of multivariable data mining techniques which come under two primary categories, machine learning strategies and statistical based approaches. Typically proteomic studies can produce hundreds or thousands of variables, p, per observation, n, depending on the analytical platform or method employed to generate the data. Many classification methods are limited by an n≪p constraint, and as such, require pre-treatment to reduce the dimensionality prior to classification. Recently machine learning techniques have gained popularity in the field for their ability to successfully classify unknown samples. One limitation of such methods is the lack of a functional model allowing meaningful interpretation of results in terms of the features used for classification. This is a problem that might be solved using a statistical model-based approach where not only is the importance of the individual protein explicit, they are combined into a readily interpretable classification rule without relying on a black box approach. Here we incorporate statistical dimension reduction techniques Partial Least Squares (PLS) and Principal Components Analysis (PCA) followed by both statistical and machine learning classification methods, and compared them to a popular machine learning technique, Support Vector Machines (SVM). Both PLS and SVM demonstrate strong utility for proteomic classification problems. PMID:21969867

  2. Machine learning phases of matter

    NASA Astrophysics Data System (ADS)

    Carrasquilla, Juan; Stoudenmire, Miles; Melko, Roger

    We show how the technology that allows automatic teller machines read hand-written digits in cheques can be used to encode and recognize phases of matter and phase transitions in many-body systems. In particular, we analyze the (quasi-)order-disorder transitions in the classical Ising and XY models. Furthermore, we successfully use machine learning to study classical Z2 gauge theories that have important technological application in the coming wave of quantum information technologies and whose phase transitions have no conventional order parameter.

  3. Learning Extended Finite State Machines

    NASA Technical Reports Server (NTRS)

    Cassel, Sofia; Howar, Falk; Jonsson, Bengt; Steffen, Bernhard

    2014-01-01

    We present an active learning algorithm for inferring extended finite state machines (EFSM)s, combining data flow and control behavior. Key to our learning technique is a novel learning model based on so-called tree queries. The learning algorithm uses the tree queries to infer symbolic data constraints on parameters, e.g., sequence numbers, time stamps, identifiers, or even simple arithmetic. We describe sufficient conditions for the properties that the symbolic constraints provided by a tree query in general must have to be usable in our learning model. We have evaluated our algorithm in a black-box scenario, where tree queries are realized through (black-box) testing. Our case studies include connection establishment in TCP and a priority queue from the Java Class Library.

  4. Learning Machine Learning: A Case Study

    ERIC Educational Resources Information Center

    Lavesson, N.

    2010-01-01

    This correspondence reports on a case study conducted in the Master's-level Machine Learning (ML) course at Blekinge Institute of Technology, Sweden. The students participated in a self-assessment test and a diagnostic test of prerequisite subjects, and their results on these tests are correlated with their achievement of the course's learning…

  5. Solar Flare Forecasting Using Time Series of SDO/HMI Vector Magnetic Field Data and Machine Learning Methods

    NASA Astrophysics Data System (ADS)

    Ilonidis, Stathis; Bobra, Monica G.; Couvidat, Sebastien

    2015-04-01

    This project is motivated by the need to understand the physical mechanisms that generate solar flares, and assess whether reliable data-driven flare forecasts are possible. We build a flare forecasting model that takes into account the temporal evolution of the active regions and provides improved forecasts for the next 24 hours. We use SDO/HMI vector magnetic field data for all the flaring regions with magnitude M1.0 or higher that have been observed with HMI and several thousand non-flaring regions. Each region is characterized by hundreds of features, including physical properties, such as the current helicity and the Lorentz force, as well as parameters that describe the temporal evolution of these properties over a two-day interval, starting 3 days and ending 1 day before the flare eruption. All of these features were used to train a Support Vector Machine (SVM), which is a supervised machine learning method used in classification problems. The results show that the SVM algorithm can achieve a True Skill Statistic of 0.91, an accuracy of 0.985, and a Heidke skill score of 0.861, improving the results of Bobra and Couvidat (2015).

  6. Patient-centered yes/no prognosis using learning machines

    PubMed Central

    König, I.R.; Malley, J.D.; Pajevic, S.; Weimar, C.; Diener, H-C.

    2009-01-01

    In the last 15 years several machine learning approaches have been developed for classification and regression. In an intuitive manner we introduce the main ideas of classification and regression trees, support vector machines, bagging, boosting and random forests. We discuss differences in the use of machine learning in the biomedical community and the computer sciences. We propose methods for comparing machines on a sound statistical basis. Data from the German Stroke Study Collaboration is used for illustration. We compare the results from learning machines to those obtained by a published logistic regression and discuss similarities and differences. PMID:19216340

  7. The Higgs Machine Learning Challenge

    NASA Astrophysics Data System (ADS)

    Adam-Bourdarios, C.; Cowan, G.; Germain-Renaud, C.; Guyon, I.; Kégl, B.; Rousseau, D.

    2015-12-01

    The Higgs Machine Learning Challenge was an open data analysis competition that took place between May and September 2014. Samples of simulated data from the ATLAS Experiment at the LHC corresponding to signal events with Higgs bosons decaying to τ+τ- together with background events were made available to the public through the website of the data science organization Kaggle (kaggle.com). Participants attempted to identify the search region in a space of 30 kinematic variables that would maximize the expected discovery significance of the signal process. One of the primary goals of the Challenge was to promote communication of new ideas between the Machine Learning (ML) and HEP communities. In this regard it was a resounding success, with almost 2,000 participants from HEP, ML and other areas. The process of understanding and integrating the new ideas, particularly from ML into HEP, is currently underway.

  8. Applying Sparse Machine Learning Methods to Twitter: Analysis of the 2012 Change in Pap Smear Guidelines. A Sequential Mixed-Methods Study

    PubMed Central

    Godbehere, Andrew; Le, Gem; El Ghaoui, Laurent; Sarkar, Urmimala

    2016-01-01

    Background It is difficult to synthesize the vast amount of textual data available from social media websites. Capturing real-world discussions via social media could provide insights into individuals’ opinions and the decision-making process. Objective We conducted a sequential mixed methods study to determine the utility of sparse machine learning techniques in summarizing Twitter dialogues. We chose a narrowly defined topic for this approach: cervical cancer discussions over a 6-month time period surrounding a change in Pap smear screening guidelines. Methods We applied statistical methodologies, known as sparse machine learning algorithms, to summarize Twitter messages about cervical cancer before and after the 2012 change in Pap smear screening guidelines by the US Preventive Services Task Force (USPSTF). All messages containing the search terms “cervical cancer,” “Pap smear,” and “Pap test” were analyzed during: (1) January 1–March 13, 2012, and (2) March 14–June 30, 2012. Topic modeling was used to discern the most common topics from each time period, and determine the singular value criterion for each topic. The results were then qualitatively coded from top 10 relevant topics to determine the efficiency of clustering method in grouping distinct ideas, and how the discussion differed before vs. after the change in guidelines . Results This machine learning method was effective in grouping the relevant discussion topics about cervical cancer during the respective time periods (~20% overall irrelevant content in both time periods). Qualitative analysis determined that a significant portion of the top discussion topics in the second time period directly reflected the USPSTF guideline change (eg, “New Screening Guidelines for Cervical Cancer”), and many topics in both time periods were addressing basic screening promotion and education (eg, “It is Cervical Cancer Awareness Month! Click the link to see where you can receive a free or low

  9. A machine learning method for identifying morphological patterns in reflectance confocal microscopy mosaics of melanocytic skin lesions in-vivo

    NASA Astrophysics Data System (ADS)

    Kose, Kivanc; Alessi-Fox, Christi; Gill, Melissa; Dy, Jennifer G.; Brooks, Dana H.; Rajadhyaksha, Milind

    2016-02-01

    We present a machine learning algorithm that can imitate the clinicians qualitative and visual process of analyzing reflectance confocal microscopy (RCM) mosaics at the dermal epidermal junction (DEJ) of skin. We divide the mosaics into localized areas of processing, and capture the textural appearance of each area using dense Speeded Up Robust Feature (SURF). Using these features, we train a support vector machine (SVM) classifier that can distinguish between meshwork, ring, clod, aspecific and background patterns in benign conditions and melanomas. Preliminary results on 20 RCM mosaics labeled by expert readers show classification with 55 - 81% sensitivity and 81 - 89% specificity in distinguishing these patterns.

  10. Accurate prediction of polarised high order electrostatic interactions for hydrogen bonded complexes using the machine learning method kriging.

    PubMed

    Hughes, Timothy J; Kandathil, Shaun M; Popelier, Paul L A

    2015-02-01

    As intermolecular interactions such as the hydrogen bond are electrostatic in origin, rigorous treatment of this term within force field methodologies should be mandatory. We present a method able of accurately reproducing such interactions for seven van der Waals complexes. It uses atomic multipole moments up to hexadecupole moment mapped to the positions of the nuclear coordinates by the machine learning method kriging. Models were built at three levels of theory: HF/6-31G(**), B3LYP/aug-cc-pVDZ and M06-2X/aug-cc-pVDZ. The quality of the kriging models was measured by their ability to predict the electrostatic interaction energy between atoms in external test examples for which the true energies are known. At all levels of theory, >90% of test cases for small van der Waals complexes were predicted within 1 kJ mol(-1), decreasing to 60-70% of test cases for larger base pair complexes. Models built on moments obtained at B3LYP and M06-2X level generally outperformed those at HF level. For all systems the individual interactions were predicted with a mean unsigned error of less than 1 kJ mol(-1). PMID:24274986

  11. ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction

    PubMed Central

    2013-01-01

    Background Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification. Results We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual’s continental and sub-continental ancestry. To predict an individual’s continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control’s λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of

  12. Scaling up: Distributed machine learning with cooperation

    SciTech Connect

    Provost, F.J.; Hennessy, D.N.

    1996-12-31

    Machine-learning methods are becoming increasingly popular for automated data analysis. However, standard methods do not scale up to massive scientific and business data sets without expensive hardware. This paper investigates a practical alternative for scaling up: the use of distributed processing to take advantage of the often dormant PCs and workstations available on local networks. Each workstation runs a common rule-learning program on a subset of the data. We first show that for commonly used rule-evaluation criteria, a simple form of cooperation can guarantee that a rule will look good to the set of cooperating learners if and only if it would look good to a single learner operating with the entire data set. We then show how such a system can further capitalize on different perspectives by sharing learned knowledge for significant reduction in search effort. We demonstrate the power of the method by learning from a massive data set taken from the domain of cellular fraud detection. Finally, we provide an overview of other methods for scaling up machine learning.

  13. Food category consumption and obesity prevalence across countries: an application of Machine Learning method to big data analysis

    NASA Astrophysics Data System (ADS)

    Dunstan, Jocelyn; Fallah-Fini, Saeideh; Nau, Claudia; Glass, Thomas; Global Obesity Prevention Center Team

    The applications of sophisticated mathematical and numerical tools in public health has been demonstrated to be useful in predicting the outcome of public intervention as well as to study, for example, the main causes of obesity without doing experiments with the population. In this project we aim to understand which kind of food consumed in different countries over time best defines the rate of obesity in those countries. The use of Machine Learning is particularly useful because we do not need to create a hypothesis and test it with the data, but instead we learn from the data to find the groups of food that best describe the prevalence of obesity.

  14. Introducing Machine Learning Concepts with WEKA.

    PubMed

    Smith, Tony C; Frank, Eibe

    2016-01-01

    This chapter presents an introduction to data mining with machine learning. It gives an overview of various types of machine learning, along with some examples. It explains how to download, install, and run the WEKA data mining toolkit on a simple data set, then proceeds to explain how one might approach a bioinformatics problem. Finally, it includes a brief summary of machine learning algorithms for other types of data mining problems, and provides suggestions about where to find additional information. PMID:27008023

  15. Topics in Machine Learning for Astronomers

    NASA Astrophysics Data System (ADS)

    Cisewski, Jessi

    2016-01-01

    As astronomical datasets continue to increase in size and complexity, innovative statistical and machine learning tools are required to address the scientific questions of interest in a computationally efficient manner. I will introduce some tools that astronomers can employ for such problems with a focus on clustering and classification techniques. I will introduce standard methods, but also get into more recent developments that may be of use to the astronomical community.

  16. Using machine learning methods for predicting inhospital mortality in patients undergoing open repair of abdominal aortic aneurysm.

    PubMed

    Monsalve-Torra, Ana; Ruiz-Fernandez, Daniel; Marin-Alonso, Oscar; Soriano-Payá, Antonio; Camacho-Mackenzie, Jaime; Carreño-Jaimes, Marisol

    2016-08-01

    An abdominal aortic aneurysm is an abnormal dilatation of the aortic vessel at abdominal level. This disease presents high rate of mortality and complications causing a decrease in the quality of life and increasing the cost of treatment. To estimate the mortality risk of patients undergoing surgery is complex due to the variables associated. The use of clinical decision support systems based on machine learning could help medical staff to improve the results of surgery and get a better understanding of the disease. In this work, the authors present a predictive system of inhospital mortality in patients who were undergoing to open repair of abdominal aortic aneurysm. Different methods as multilayer perceptron, radial basis function and Bayesian networks are used. Results are measured in terms of accuracy, sensitivity and specificity of the classifiers, achieving an accuracy higher than 95%. The developing of a system based on the algorithms tested can be useful for medical staff in order to make a better planning of care and reducing undesirable surgery results and the cost of the post-surgical treatments. PMID:27395372

  17. Machine Learning Methods for Binary and Multiclass Classification of Melanoma Thickness From Dermoscopic Images.

    PubMed

    Saez, Aurora; Sanchez-Monedero, Javier; Gutierrez, Pedro Antonio; Hervas-Martinez, Cesar

    2016-04-01

    Thickness of the melanoma is the most important factor associated with survival in patients with melanoma. It is most commonly reported as a measurement of depth given in millimeters (mm) and computed by means of pathological examination after a biopsy of the suspected lesion. In order to avoid the use of an invasive method in the estimation of the thickness of melanoma before surgery, we propose a computational image analysis system from dermoscopic images. The proposed feature extraction is based on the clinical findings that correlate certain characteristics present in dermoscopic images and tumor depth. Two supervised classification schemes are proposed: a binary classification in which melanomas are classified into thin or thick, and a three-class scheme (thin, intermediate, and thick). The performance of several nominal classification methods, including a recent interpretable method combining logistic regression with artificial neural networks (Logistic regression using Initial variables and Product Units, LIPU), is compared. For the three-class problem, a set of ordinal classification methods (considering ordering relation between the three classes) is included. For the binary case, LIPU outperforms all the other methods with an accuracy of 77.6%, while, for the second scheme, although LIPU reports the highest overall accuracy, the ordinal classification methods achieve a better balance between the performances of all classes. PMID:26672031

  18. An analysis of feature relevance in the classification of astronomical transients with machine learning methods

    NASA Astrophysics Data System (ADS)

    D'Isanto, A.; Cavuoti, S.; Brescia, M.; Donalek, C.; Longo, G.; Riccio, G.; Djorgovski, S. G.

    2016-04-01

    The exploitation of present and future synoptic (multiband and multi-epoch) surveys requires an extensive use of automatic methods for data processing and data interpretation. In this work, using data extracted from the Catalina Real Time Transient Survey (CRTS), we investigate the classification performance of some well tested methods: Random Forest, MultiLayer Perceptron with Quasi Newton Algorithm and K-Nearest Neighbours, paying special attention to the feature selection phase. In order to do so, several classification experiments were performed. Namely: identification of cataclysmic variables, separation between galactic and extragalactic objects and identification of supernovae.

  19. Harnessing the power of big data: infusing the scientific method with machine learning to transform ecology

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Most efforts to harness the power of big data for ecology and environmental sciences focus on data and metadata sharing, standardization, and accuracy. However, many scientists have not accepted the data deluge as an integral part of their research because the current scientific method is not scalab...

  20. Machine Learning Methods for the Understanding and Prediction of Climate Systems: Tropical Pacific Ocean Thermocline and ENSO events

    NASA Astrophysics Data System (ADS)

    Lima, C. H.; Lall, U.

    2012-12-01

    In this work we explore recently developed methods from the machine learning community for dimensionality reduction and model selection of very large datasets. We apply the nonlinear maximum variance unfolding (MVU) method to find a short dimensional space for the thermocline of the Tropical Pacific Ocean as indicated by the ocean depth of the 200C isotherm from the NOAA/NCEP GODAS dataset. The leading modes are then used as covariates in an ENSO forecast model based on LASSO regression, where parameters are shrunk in order to find the best subset of predictors. A comparison with Principal Component Analysis (PCA) reveals that MVU is able to reduce the thermocline data from 21009 dimensions to three main components that collectively explain 77% of the system variance, whereas the first three PCs respond to 47% of the variance only. The series of the first three leading MVU and PCA based modes and their associated spatial patterns show also different features, including an enhanced monotonic upward trend in the first MVU mode that is hardly detected in the correspondent first PCA mode. Correlation analysis between the MVU components and the NINO3 index shows that each of the modes has peak correlations across different lag times, with statistically significant correlation coefficients up to two years. After combining the three MVU components across several lag times, a forecast model for NINO3 based on LASSO regression was built and tested using the ten-fold cross-validation method. Based on metrics such as the RMSE and correlation scores, the results show appreciable skills for lead times that go beyond ten months, particularly for the December NINO3, which is responsible for several floods and droughts across the globe.; Spatial signature of the first MVU mode. ; First MVU series featuring the upward trend.

  1. Extracting semantic lexicons from discharge summaries using machine learning and the C-Value method.

    PubMed

    Jiang, Min; Denny, Josh C; Tang, Buzhou; Cao, Hongxin; Xu, Hua

    2012-01-01

    Semantic lexicons that link words and phrases to specific semantic types such as diseases are valuable assets for clinical natural language processing (NLP) systems. Although terminological terms with predefined semantic types can be generated easily from existing knowledge bases such as the Unified Medical Language Systems (UMLS), they are often limited and do not have good coverage for narrative clinical text. In this study, we developed a method for building semantic lexicons from clinical corpus. It extracts candidate semantic terms using a conditional random field (CRF) classifier and then selects terms using the C-Value algorithm. We applied the method to a corpus containing 10 years of discharge summaries from Vanderbilt University Hospital (VUH) and extracted 44,957 new terms for three semantic groups: Problem, Treatment, and Test. A manual analysis of 200 randomly selected terms not found in the UMLS demonstrated that 59% of them were meaningful new clinical concepts and 25% were lexical variants of exiting concepts in the UMLS. Furthermore, we compared the effectiveness of corpus-derived and UMLS-derived semantic lexicons in the concept extraction task of the 2010 i2b2 clinical NLP challenge. Our results showed that the classifier with corpus-derived semantic lexicons as features achieved a better performance (F-score 82.52%) than that with UMLS-derived semantic lexicons as features (F-score 82.04%). We conclude that such corpus-based methods are effective for generating semantic lexicons, which may improve named entity recognition tasks and may aid in augmenting synonymy within existing terminologies. PMID:23304311

  2. Applying GIS and Machine Learning Methods to Twitter Data for Multiscale Surveillance of Influenza.

    PubMed

    Allen, Chris; Tsou, Ming-Hsiang; Aslam, Anoshe; Nagel, Anna; Gawron, Jean-Mark

    2016-01-01

    Traditional methods for monitoring influenza are haphazard and lack fine-grained details regarding the spatial and temporal dynamics of outbreaks. Twitter gives researchers and public health officials an opportunity to examine the spread of influenza in real-time and at multiple geographical scales. In this paper, we introduce an improved framework for monitoring influenza outbreaks using the social media platform Twitter. Relying upon techniques from geographic information science (GIS) and data mining, Twitter messages were collected, filtered, and analyzed for the thirty most populated cities in the United States during the 2013-2014 flu season. The results of this procedure are compared with national, regional, and local flu outbreak reports, revealing a statistically significant correlation between the two data sources. The main contribution of this paper is to introduce a comprehensive data mining process that enhances previous attempts to accurately identify tweets related to influenza. Additionally, geographical information systems allow us to target, filter, and normalize Twitter messages. PMID:27455108

  3. Applying GIS and Machine Learning Methods to Twitter Data for Multiscale Surveillance of Influenza

    PubMed Central

    Aslam, Anoshe; Nagel, Anna; Gawron, Jean-Mark

    2016-01-01

    Traditional methods for monitoring influenza are haphazard and lack fine-grained details regarding the spatial and temporal dynamics of outbreaks. Twitter gives researchers and public health officials an opportunity to examine the spread of influenza in real-time and at multiple geographical scales. In this paper, we introduce an improved framework for monitoring influenza outbreaks using the social media platform Twitter. Relying upon techniques from geographic information science (GIS) and data mining, Twitter messages were collected, filtered, and analyzed for the thirty most populated cities in the United States during the 2013–2014 flu season. The results of this procedure are compared with national, regional, and local flu outbreak reports, revealing a statistically significant correlation between the two data sources. The main contribution of this paper is to introduce a comprehensive data mining process that enhances previous attempts to accurately identify tweets related to influenza. Additionally, geographical information systems allow us to target, filter, and normalize Twitter messages. PMID:27455108

  4. Rapid estimation of compost enzymatic activity by spectral analysis method combined with machine learning.

    PubMed

    Chakraborty, Somsubhra; Das, Bhabani S; Ali, Md Nasim; Li, Bin; Sarathjith, M C; Majumdar, K; Ray, D P

    2014-03-01

    The aim of this study was to investigate the feasibility of using visible near-infrared (VisNIR) diffuse reflectance spectroscopy (DRS) as an easy, inexpensive, and rapid method to predict compost enzymatic activity, which traditionally measured by fluorescein diacetate hydrolysis (FDA-HR) assay. Compost samples representative of five different compost facilities were scanned by DRS, and the raw reflectance spectra were preprocessed using seven spectral transformations for predicting compost FDA-HR with six multivariate algorithms. Although principal component analysis for all spectral pretreatments satisfactorily identified the clusters by compost types, it could not separate different FDA contents. Furthermore, the artificial neural network multilayer perceptron (residual prediction deviation=3.2, validation r(2)=0.91 and RMSE=13.38 μg g(-1) h(-1)) outperformed other multivariate models to capture the highly non-linear relationships between compost enzymatic activity and VisNIR reflectance spectra after Savitzky-Golay first derivative pretreatment. This work demonstrates the efficiency of VisNIR DRS for predicting compost enzymatic as well as microbial activity. PMID:24398221

  5. Using Machine learning method to estimate Air Temperature from MODIS over Berlin

    NASA Astrophysics Data System (ADS)

    Marzban, F.; Preusker, R.; Sodoudi, S.; Taheri, H.; Allahbakhshi, M.

    2015-12-01

    Land Surface Temperature (LST) is defined as the temperature of the interface between the Earth's surface and its atmosphere and thus it is a critical variable to understand land-atmosphere interactions and a key parameter in meteorological and hydrological studies, which is involved in energy fluxes. Air temperature (Tair) is one of the most important input variables in different spatially distributed hydrological, ecological models. The estimation of near surface air temperature is useful for a wide range of applications. Some applications from traffic or energy management, require Tair data in high spatial and temporal resolution at two meters height above the ground (T2m), sometimes in near-real-time. Thus, a parameterization based on boundary layer physical principles was developed that determines the air temperature from remote sensing data (MODIS). Tair is commonly obtained from synoptic measurements in weather stations. However, the derivation of near surface air temperature from the LST derived from satellite is far from straight forward. T2m is not driven directly by the sun, but indirectly by LST, thus T2m can be parameterized from the LST and other variables such as Albedo, NDVI, Water vapor and etc. Most of the previous studies have focused on estimating T2m based on simple and advanced statistical approaches, Temperature-Vegetation index and energy-balance approaches but the main objective of this research is to explore the relationships between T2m and LST in Berlin by using Artificial intelligence method with the aim of studying key variables to allow us establishing suitable techniques to obtain Tair from satellite Products and ground data. Secondly, an attempt was explored to identify an individual mix of attributes that reveals a particular pattern to better understanding variation of T2m during day and nighttime over the different area of Berlin. For this reason, a three layer Feedforward neural networks is considered with LMA algorithm

  6. Discriminative clustering via extreme learning machine.

    PubMed

    Huang, Gao; Liu, Tianchi; Yang, Yan; Lin, Zhiping; Song, Shiji; Wu, Cheng

    2015-10-01

    Discriminative clustering is an unsupervised learning framework which introduces the discriminative learning rule of supervised classification into clustering. The underlying assumption is that a good partition (clustering) of the data should yield high discrimination, namely, the partitioned data can be easily classified by some classification algorithms. In this paper, we propose three discriminative clustering approaches based on Extreme Learning Machine (ELM). The first algorithm iteratively trains weighted ELM (W-ELM) classifier to gradually maximize the data discrimination. The second and third methods are both built on Fisher's Linear Discriminant Analysis (LDA); but one approach adopts alternative optimization, while the other leverages kernel k-means. We show that the proposed algorithms can be easily implemented, and yield competitive clustering accuracy on real world data sets compared to state-of-the-art clustering methods. PMID:26143036

  7. Defect classification using machine learning

    NASA Astrophysics Data System (ADS)

    Carr, Adra; Kegelmeyer, L.; Liao, Z. M.; Abdulla, G.; Cross, D.; Kegelmeyer, W. P.; Ravizza, F.; Carr, C. W.

    2008-10-01

    Laser-induced damage growth on the surface of fused silica optics has been extensively studied and has been found to depend on a number of factors including fluence and the surface on which the damage site resides. It has been demonstrated that damage sites as small as a few tens of microns can be detected and tracked on optics installed a fusion-class laser, however, determining the surface of an optic on which a damage site resides in situ can be a significant challenge. In this work demonstrate that a machine-learning algorithm can successfully predict the surface location of the damage site using an expanded set of characteristics for each damage site, some of which are not historically associated with growth rate.

  8. Defect Classification Using Machine Learning

    SciTech Connect

    Carr, A; Kegelmeyer, L; Liao, Z M; Abdulla, G; Cross, D; Kegelmeyer, W P; Raviza, F; Carr, C W

    2008-10-24

    Laser-induced damage growth on the surface of fused silica optics has been extensively studied and has been found to depend on a number of factors including fluence and the surface on which the damage site resides. It has been demonstrated that damage sites as small as a few tens of microns can be detected and tracked on optics installed a fusion-class laser, however, determining the surface of an optic on which a damage site resides in situ can be a significant challenge. In this work demonstrate that a machine-learning algorithm can successfully predict the surface location of the damage site using an expanded set of characteristics for each damage site, some of which are not historically associated with growth rate.

  9. Medical Dataset Classification: A Machine Learning Paradigm Integrating Particle Swarm Optimization with Extreme Learning Machine Classifier.

    PubMed

    Subbulakshmi, C V; Deepa, S N

    2015-01-01

    Medical data classification is a prime data mining problem being discussed about for a decade that has attracted several researchers around the world. Most classifiers are designed so as to learn from the data itself using a training process, because complete expert knowledge to determine classifier parameters is impracticable. This paper proposes a hybrid methodology based on machine learning paradigm. This paradigm integrates the successful exploration mechanism called self-regulated learning capability of the particle swarm optimization (PSO) algorithm with the extreme learning machine (ELM) classifier. As a recent off-line learning method, ELM is a single-hidden layer feedforward neural network (FFNN), proved to be an excellent classifier with large number of hidden layer neurons. In this research, PSO is used to determine the optimum set of parameters for the ELM, thus reducing the number of hidden layer neurons, and it further improves the network generalization performance. The proposed method is experimented on five benchmarked datasets of the UCI Machine Learning Repository for handling medical dataset classification. Simulation results show that the proposed approach is able to achieve good generalization performance, compared to the results of other classifiers. PMID:26491713

  10. Medical Dataset Classification: A Machine Learning Paradigm Integrating Particle Swarm Optimization with Extreme Learning Machine Classifier

    PubMed Central

    Subbulakshmi, C. V.; Deepa, S. N.

    2015-01-01

    Medical data classification is a prime data mining problem being discussed about for a decade that has attracted several researchers around the world. Most classifiers are designed so as to learn from the data itself using a training process, because complete expert knowledge to determine classifier parameters is impracticable. This paper proposes a hybrid methodology based on machine learning paradigm. This paradigm integrates the successful exploration mechanism called self-regulated learning capability of the particle swarm optimization (PSO) algorithm with the extreme learning machine (ELM) classifier. As a recent off-line learning method, ELM is a single-hidden layer feedforward neural network (FFNN), proved to be an excellent classifier with large number of hidden layer neurons. In this research, PSO is used to determine the optimum set of parameters for the ELM, thus reducing the number of hidden layer neurons, and it further improves the network generalization performance. The proposed method is experimented on five benchmarked datasets of the UCI Machine Learning Repository for handling medical dataset classification. Simulation results show that the proposed approach is able to achieve good generalization performance, compared to the results of other classifiers. PMID:26491713

  11. Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries

    PubMed Central

    Xu, Yan; Hong, Kai; Tsujii, Junichi

    2012-01-01

    Objective A system that translates narrative text in the medical domain into structured representation is in great demand. The system performs three sub-tasks: concept extraction, assertion classification, and relation identification. Design The overall system consists of five steps: (1) pre-processing sentences, (2) marking noun phrases (NPs) and adjective phrases (APs), (3) extracting concepts that use a dosage-unit dictionary to dynamically switch two models based on Conditional Random Fields (CRF), (4) classifying assertions based on voting of five classifiers, and (5) identifying relations using normalized sentences with a set of effective discriminating features. Measurements Macro-averaged and micro-averaged precision, recall and F-measure were used to evaluate results. Results The performance is competitive with the state-of-the-art systems with micro-averaged F-measure of 0.8489 for concept extraction, 0.9392 for assertion classification and 0.7326 for relation identification. Conclusions The system exploits an array of common features and achieves state-of-the-art performance. Prudent feature engineering sets the foundation of our systems. In concept extraction, we demonstrated that switching models, one of which is especially designed for telegraphic sentences, improved extraction of the treatment concept significantly. In assertion classification, a set of features derived from a rule-based classifier were proven to be effective for the classes such as conditional and possible. These classes would suffer from data scarcity in conventional machine-learning methods. In relation identification, we use two-staged architecture, the second of which applies pairwise classifiers to possible candidate classes. This architecture significantly improves performance. PMID:22586067

  12. Adaptive Learning Systems: Beyond Teaching Machines

    ERIC Educational Resources Information Center

    Kara, Nuri; Sevim, Nese

    2013-01-01

    Since 1950s, teaching machines have changed a lot. Today, we have different ideas about how people learn, what instructor should do to help students during their learning process. We have adaptive learning technologies that can create much more student oriented learning environments. The purpose of this article is to present these changes and its…

  13. Probabilistic machine learning and artificial intelligence.

    PubMed

    Ghahramani, Zoubin

    2015-05-28

    How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, has a central role in scientific data analysis, machine learning, robotics, cognitive science and artificial intelligence. This Review provides an introduction to this framework, and discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery. PMID:26017444

  14. Probabilistic machine learning and artificial intelligence

    NASA Astrophysics Data System (ADS)

    Ghahramani, Zoubin

    2015-05-01

    How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, has a central role in scientific data analysis, machine learning, robotics, cognitive science and artificial intelligence. This Review provides an introduction to this framework, and discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery.

  15. Optimizing transition states via kernel-based machine learning.

    PubMed

    Pozun, Zachary D; Hansen, Katja; Sheppard, Daniel; Rupp, Matthias; Müller, Klaus-Robert; Henkelman, Graeme

    2012-05-01

    We present a method for optimizing transition state theory dividing surfaces with support vector machines. The resulting dividing surfaces require no a priori information or intuition about reaction mechanisms. To generate optimal dividing surfaces, we apply a cycle of machine-learning and refinement of the surface by molecular dynamics sampling. We demonstrate that the machine-learned surfaces contain the relevant low-energy saddle points. The mechanisms of reactions may be extracted from the machine-learned surfaces in order to identify unexpected chemically relevant processes. Furthermore, we show that the machine-learned surfaces significantly increase the transmission coefficient for an adatom exchange involving many coupled degrees of freedom on a (100) surface when compared to a distance-based dividing surface. PMID:22583204

  16. Modeling quantum physics with machine learning

    NASA Astrophysics Data System (ADS)

    Lopez-Bezanilla, Alejandro; Arsenault, Louis-Francois; Millis, Andrew; Littlewood, Peter; von Lilienfeld, Anatole

    2014-03-01

    Machine Learning (ML) is a systematic way of inferring new results from sparse information. It directly allows for the resolution of computationally expensive sets of equations by making sense of accumulated knowledge and it is therefore an attractive method for providing computationally inexpensive 'solvers' for some of the important systems of condensed matter physics. In this talk a non-linear regression statistical model is introduced to demonstrate the utility of ML methods in solving quantum physics related problem, and is applied to the calculation of electronic transport in 1D channels. DOE contract number DE-AC02-06CH11357.

  17. Machine vision systems using machine learning for industrial product inspection

    NASA Astrophysics Data System (ADS)

    Lu, Yi; Chen, Tie Q.; Chen, Jie; Zhang, Jian; Tisler, Anthony

    2002-02-01

    Machine vision inspection requires efficient processing time and accurate results. In this paper, we present a machine vision inspection architecture, SMV (Smart Machine Vision). SMV decomposes a machine vision inspection problem into two stages, Learning Inspection Features (LIF), and On-Line Inspection (OLI). The LIF is designed to learn visual inspection features from design data and/or from inspection products. During the OLI stage, the inspection system uses the knowledge learnt by the LIF component to inspect the visual features of products. In this paper we will present two machine vision inspection systems developed under the SMV architecture for two different types of products, Printed Circuit Board (PCB) and Vacuum Florescent Displaying (VFD) boards. In the VFD board inspection system, the LIF component learns inspection features from a VFD board and its displaying patterns. In the PCB board inspection system, the LIF learns the inspection features from the CAD file of a PCB board. In both systems, the LIF component also incorporates interactive learning to make the inspection system more powerful and efficient. The VFD system has been deployed successfully in three different manufacturing companies and the PCB inspection system is the process of being deployed in a manufacturing plant.

  18. Scalable Machine Learning for Massive Astronomical Datasets

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Astronomy Data Centre, Canadian

    2014-01-01

    We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors, and the local outlier factor. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.

  19. Scalable Machine Learning for Massive Astronomical Datasets

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Gray, A.

    2014-04-01

    We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors. This is likely of particular interest to the radio astronomy community given, for example, that survey projects contain groups dedicated to this topic. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex

  20. Photonic Neurocomputers And Learning Machines

    NASA Astrophysics Data System (ADS)

    Farhat, Nabil H.

    1990-05-01

    The study of complex multidimensional nonlinear dynamical systems and the modeling and emulation of cognitive brain-like processing of sensory information (neural network research), including the study of chaos and its role in such systems would benefit immensely from the development of a new generation of programmable analog computers capable of carrying out collective, nonlinear and iterative computations at very high speed. The massive interconnectivity and nonlinearity needed in such analog computing structures indicate that a mix of optics and electronics mediated by judicial choice of device physics offer benefits for realizing networks with the following desirable properties: (a) large scale nets, i.e. nets with high number of decision making elements (neurons), (b) modifiable structure, i.e. ability to partition the net into any desired number of layers of prescribed size (number of neurons per layer) with any prescribed pattern of communications between them (e.g. feed forward or feedback (recurrent)), (c) programmable and/or adaptive connectivity weights between the neurons for self-organization and learning, (d) both synchroneous or asynchroneous update rules be possible, (e) high speed update i.e. neurons with lisec response time to enable rapid iteration and convergence, (f) can be used in the study and evaluation of a variety of adaptive learning algorithms, (g) can be used in rapid solution by fast simulated annealing of complex optimization problems of the kind encountered in adaptive learning, pattern recognition, and image processing. The aim of this paper is to describe recent efforts and progress made towards achieving these desirable attributes in analog photonic (optoelectronic and/or electron optical) hardware that utilizes primarily incoherent light. A specific example, hardware implementation of a stochastic Boltzmann learning machine, is used as vehicle for identifying generic issues and clarify research and development areas for further

  1. Applying Machine Learning to Star Cluster Classification

    NASA Astrophysics Data System (ADS)

    Fedorenko, Kristina; Grasha, Kathryn; Calzetti, Daniela; Mahadevan, Sridhar

    2016-01-01

    Catalogs describing populations of star clusters are essential in investigating a range of important issues, from star formation to galaxy evolution. Star cluster catalogs are typically created in a two-step process: in the first step, a catalog of sources is automatically produced; in the second step, each of the extracted sources is visually inspected by 3-to-5 human classifiers and assigned a category. Classification by humans is labor-intensive and time consuming, thus it creates a bottleneck, and substantially slows down progress in star cluster research.We seek to automate the process of labeling star clusters (the second step) through applying supervised machine learning techniques. This will provide a fast, objective, and reproducible classification. Our data is HST (WFC3 and ACS) images of galaxies in the distance range of 3.5-12 Mpc, with a few thousand star clusters already classified by humans as a part of the LEGUS (Legacy ExtraGalactic UV Survey) project. The classification is based on 4 labels (Class 1 - symmetric, compact cluster; Class 2 - concentrated object with some degree of asymmetry; Class 3 - multiple peak system, diffuse; and Class 4 - spurious detection). We start by looking at basic machine learning methods such as decision trees. We then proceed to evaluate performance of more advanced techniques, focusing on convolutional neural networks and other Deep Learning methods. We analyze the results, and suggest several directions for further improvement.

  2. Galaxy morphology - An unsupervised machine learning approach

    NASA Astrophysics Data System (ADS)

    Schutter, A.; Shamir, L.

    2015-09-01

    Structural properties poses valuable information about the formation and evolution of galaxies, and are important for understanding the past, present, and future universe. Here we use unsupervised machine learning methodology to analyze a network of similarities between galaxy morphological types, and automatically deduce a morphological sequence of galaxies. Application of the method to the EFIGI catalog show that the morphological scheme produced by the algorithm is largely in agreement with the De Vaucouleurs system, demonstrating the ability of computer vision and machine learning methods to automatically profile galaxy morphological sequences. The unsupervised analysis method is based on comprehensive computer vision techniques that compute the visual similarities between the different morphological types. Rather than relying on human cognition, the proposed system deduces the similarities between sets of galaxy images in an automatic manner, and is therefore not limited by the number of galaxies being analyzed. The source code of the method is publicly available, and the protocol of the experiment is included in the paper so that the experiment can be replicated, and the method can be used to analyze user-defined datasets of galaxy images.

  3. Multistrategy machine-learning vision system

    NASA Astrophysics Data System (ADS)

    Roberts, Barry A.

    1993-04-01

    Advances in the field of machine learning technology have yielded learning techniques with solid theoretical foundations that are applicable to the problems being encountered by object recognition systems. At Honeywell an object recognition system that works with high-level, symbolic, object features is under development. This system, named object recognition accomplished through combined learning expertise (ORACLE), employs both an inductive learning technique (i.e., conceptual clustering, CC) and a deductive technique (i.e., explanation-based learning, EBL) that are combined in a synergistic manner. This paper provides an overview of the ORACLE system, describes the machine learning mechanisms (EBL and CC) that it employs, and provides example results of system operation. The paper emphasizes the beneficial effect of integrating machine learning into object recognition systems.

  4. Machine learning in soil classification.

    PubMed

    Bhattacharya, B; Solomatine, D P

    2006-03-01

    In a number of engineering problems, e.g. in geotechnics, petroleum engineering, etc. intervals of measured series data (signals) are to be attributed a class maintaining the constraint of contiguity and standard classification methods could be inadequate. Classification in this case needs involvement of an expert who observes the magnitude and trends of the signals in addition to any a priori information that might be available. In this paper, an approach for automating this classification procedure is presented. Firstly, a segmentation algorithm is developed and applied to segment the measured signals. Secondly, the salient features of these segments are extracted using boundary energy method. Based on the measured data and extracted features to assign classes to the segments classifiers are built; they employ Decision Trees, ANN and Support Vector Machines. The methodology was tested in classifying sub-surface soil using measured data from Cone Penetration Testing and satisfactory results were obtained. PMID:16530382

  5. Machine Learning and Cosmological Simulations

    NASA Astrophysics Data System (ADS)

    Kamdar, Harshil; Turk, Matthew; Brunner, Robert

    2016-01-01

    We explore the application of machine learning (ML) to the problem of galaxy formation and evolution in a hierarchical universe. Our motivations are two-fold: (1) presenting a new, promising technique to study galaxy formation, and (2) quantitatively evaluating the extent of the influence of dark matter halo properties on small-scale structure formation. For our analyses, we use both semi-analytical models (Millennium simulation) and N-body + hydrodynamical simulations (Illustris simulation). The ML algorithms are trained on important dark matter halo properties (inputs) and galaxy properties (outputs). The trained models are able to robustly predict the gas mass, stellar mass, black hole mass, star formation rate, $g-r$ color, and stellar metallicity. Moreover, the ML simulated galaxies obey fundamental observational constraints implying that the population of ML predicted galaxies is physically and statistically robust. Next, ML algorithms are trained on an N-body + hydrodynamical simulation and applied to an N-body only simulation (Dark Sky simulation, Illustris Dark), populating this new simulation with galaxies. We can examine how structure formation changes with different cosmological parameters and are able to mimic a full-blown hydrodynamical simulation in a computation time that is orders of magnitude smaller. We find that the set of ML simulated galaxies in Dark Sky obey the same observational constraints, further solidifying ML's place as an intriguing and promising technique in future galaxy formation studies and rapid mock galaxy catalog creation.

  6. Memristor models for machine learning.

    PubMed

    Carbajal, Juan Pablo; Dambre, Joni; Hermans, Michiel; Schrauwen, Benjamin

    2015-03-01

    In the quest for alternatives to traditional complementary metal-oxide-semiconductor, it is being suggested that digital computing efficiency and power can be improved by matching the precision to the application. Many applications do not need the high precision that is being used today. In particular, large gains in area and power efficiency could be achieved by dedicated analog realizations of approximate computing engines. In this work we explore the use of memristor networks for analog approximate computation, based on a machine learning framework called reservoir computing. Most experimental investigations on the dynamics of memristors focus on their nonvolatile behavior. Hence, the volatility that is present in the developed technologies is usually unwanted and is not included in simulation models. In contrast, in reservoir computing, volatility is not only desirable but necessary. Therefore, in this work, we propose two different ways to incorporate it into memristor simulation models. The first is an extension of Strukov's model, and the second is an equivalent Wiener model approximation. We analyze and compare the dynamical properties of these models and discuss their implications for the memory and the nonlinear processing capacity of memristor networks. Our results indicate that device variability, increasingly causing problems in traditional computer design, is an asset in the context of reservoir computing. We conclude that although both models could lead to useful memristor-based reservoir computing systems, their computational performance will differ. Therefore, experimental modeling research is required for the development of accurate volatile memristor models. PMID:25602769

  7. Extreme Learning Machine for Multilayer Perceptron.

    PubMed

    Tang, Jiexiong; Deng, Chenwei; Huang, Guang-Bin

    2016-04-01

    Extreme learning machine (ELM) is an emerging learning algorithm for the generalized single hidden layer feedforward neural networks, of which the hidden node parameters are randomly generated and the output weights are analytically computed. However, due to its shallow architecture, feature learning using ELM may not be effective for natural signals (e.g., images/videos), even with a large number of hidden nodes. To address this issue, in this paper, a new ELM-based hierarchical learning framework is proposed for multilayer perceptron. The proposed architecture is divided into two main components: 1) self-taught feature extraction followed by supervised feature classification and 2) they are bridged by random initialized hidden weights. The novelties of this paper are as follows: 1) unsupervised multilayer encoding is conducted for feature extraction, and an ELM-based sparse autoencoder is developed via l1 constraint. By doing so, it achieves more compact and meaningful feature representations than the original ELM; 2) by exploiting the advantages of ELM random feature mapping, the hierarchically encoded outputs are randomly projected before final decision making, which leads to a better generalization with faster learning speed; and 3) unlike the greedy layerwise training of deep learning (DL), the hidden layers of the proposed framework are trained in a forward manner. Once the previous layer is established, the weights of the current layer are fixed without fine-tuning. Therefore, it has much better learning efficiency than the DL. Extensive experiments on various widely used classification data sets show that the proposed algorithm achieves better and faster convergence than the existing state-of-the-art hierarchical learning methods. Furthermore, multiple applications in computer vision further confirm the generality and capability of the proposed learning scheme. PMID:25966483

  8. Classification of collective behavior: a comparison of tracking and machine learning methods to study the effect of ambient light on fish shoaling.

    PubMed

    Butail, Sachit; Salerno, Philip; Bollt, Erik M; Porfiri, Maurizio

    2015-12-01

    Traditional approaches for the analysis of collective behavior entail digitizing the position of each individual, followed by evaluation of pertinent group observables, such as cohesion and polarization. Machine learning may enable considerable advancements in this area by affording the classification of these observables directly from images. While such methods have been successfully implemented in the classification of individual behavior, their potential in the study collective behavior is largely untested. In this paper, we compare three methods for the analysis of collective behavior: simple tracking (ST) without resolving occlusions, machine learning with real data (MLR), and machine learning with synthetic data (MLS). These methods are evaluated on videos recorded from an experiment studying the effect of ambient light on the shoaling tendency of Giant danios. In particular, we compute average nearest-neighbor distance (ANND) and polarization using the three methods and compare the values with manually-verified ground-truth data. To further assess possible dependence on sampling rate for computing ANND, the comparison is also performed at a low frame rate. Results show that while ST is the most accurate at higher frame rate for both ANND and polarization, at low frame rate for ANND there is no significant difference in accuracy between the three methods. In terms of computational speed, MLR and MLS take significantly less time to process an image, with MLS better addressing constraints related to generation of training data. Finally, all methods are able to successfully detect a significant difference in ANND as the ambient light intensity is varied irrespective of the direction of intensity change. PMID:25294042

  9. Alternating minimization and Boltzmann machine learning.

    PubMed

    Byrne, W

    1992-01-01

    Training a Boltzmann machine with hidden units is appropriately treated in information geometry using the information divergence and the technique of alternating minimization. The resulting algorithm is shown to be closely related to gradient descent Boltzmann machine learning rules, and the close relationship of both to the EM algorithm is described. An iterative proportional fitting procedure for training machines without hidden units is described and incorporated into the alternating minimization algorithm. PMID:18276461

  10. Predicting fecal sources in waters with diverse pollution loads using general and molecular host-specific indicators and applying machine learning methods.

    PubMed

    Casanovas-Massana, Arnau; Gómez-Doñate, Marta; Sánchez, David; Belanche-Muñoz, Lluís A; Muniesa, Maite; Blanch, Anicet R

    2015-03-15

    In this study we use a machine learning software (Ichnaea) to generate predictive models for water samples with different concentrations of fecal contamination (point source, moderate and low). We applied several MST methods (host-specific Bacteroides phages, mitochondrial DNA genetic markers, Bifidobacterium adolescentis and Bifidobacterium dentium markers, and bifidobacterial host-specific qPCR), and general indicators (Escherichia coli, enterococci and somatic coliphages) to evaluate the source of contamination in the samples. The results provided data to the Ichnaea software, that evaluated the performance of each method in the different scenarios and determined the source of the contamination. Almost all MST methods in this study determined correctly the origin of fecal contamination at point source and in moderate concentration samples. When the dilution of the fecal pollution increased (below 3 log10 CFU E. coli/100 ml) some of these indicators (bifidobacterial host-specific qPCR, some mitochondrial markers or B. dentium marker) were not suitable because their concentrations decreased below the detection limit. Using the data from source point samples, the software Ichnaea produced models for waters with low levels of fecal pollution. These models included some MST methods, on the basis of their best performance, that were used to determine the source of pollution in this area. Regardless the methods selected, that could vary depending on the scenario, inductive machine learning methods are a promising tool in MST studies and may represent a leap forward in solving MST cases. PMID:25585145

  11. On machine learning classification of otoneurological data.

    PubMed

    Juhola, Martti

    2008-01-01

    A dataset including cases of six otoneurological diseases was analysed using machine learning methods to investigate the classification problem of these diseases and to compare the effectiveness of different methods for this data. Linear discriminant analysis was the best method and next multilayer perceptron neural networks provided that the data was input into a network in the form of principal components. Nearest neighbour searching, k-means clustering and Kohonen neural networks achieved almost as good results as the former, but decision trees slightly worse. Thus, these methods fared well, but Naïve Bayes rule could not be used since some data matrices were singular. Otoneurological cases subject to the six diseases given can be reliably distinguished. PMID:18487733

  12. Development of E-Learning Materials for Machining Safety Education

    NASA Astrophysics Data System (ADS)

    Nakazawa, Tsuyoshi; Mita, Sumiyoshi; Matsubara, Masaaki; Takashima, Takeo; Tanaka, Koichi; Izawa, Satoru; Kawamura, Takashi

    We developed two e-learning materials for Manufacturing Practice safety education: movie learning materials and hazard-detection learning materials. Using these video and sound media, students can learn how to operate machines safely with movie learning materials, which raise the effectiveness of preparation and review for manufacturing practice. Using these materials, students can realize safety operation well. Students can apply knowledge learned in lectures to the detection of hazards and use study methods for hazard detection during machine operation using the hazard-detection learning materials. Particularly, the hazard-detection learning materials raise students‧ safety consciousness and increase students‧ comprehension of knowledge from lectures and comprehension of operations during Manufacturing Practice.

  13. Machine Learning for Biomedical Literature Triage

    PubMed Central

    Almeida, Hayda; Meurs, Marie-Jean; Kosseim, Leila; Butler, Greg; Tsang, Adrian

    2014-01-01

    This paper presents a machine learning system for supporting the first task of the biological literature manual curation process, called triage. We compare the performance of various classification models, by experimenting with dataset sampling factors and a set of features, as well as three different machine learning algorithms (Naive Bayes, Support Vector Machine and Logistic Model Trees). The results show that the most fitting model to handle the imbalanced datasets of the triage classification task is obtained by using domain relevant features, an under-sampling technique, and the Logistic Model Trees algorithm. PMID:25551575

  14. Photometric Supernova Classification with Machine Learning

    NASA Astrophysics Data System (ADS)

    Lochner, Michelle; McEwen, Jason D.; Peiris, Hiranya V.; Lahav, Ofer; Winter, Max K.

    2016-08-01

    Automated photometric supernova classification has become an active area of research in recent years in light of current and upcoming imaging surveys such as the Dark Energy Survey (DES) and the Large Synoptic Survey Telescope, given that spectroscopic confirmation of type for all supernovae discovered will be impossible. Here, we develop a multi-faceted classification pipeline, combining existing and new approaches. Our pipeline consists of two stages: extracting descriptive features from the light curves and classification using a machine learning algorithm. Our feature extraction methods vary from model-dependent techniques, namely SALT2 fits, to more independent techniques that fit parametric models to curves, to a completely model-independent wavelet approach. We cover a range of representative machine learning algorithms, including naive Bayes, k-nearest neighbors, support vector machines, artificial neural networks, and boosted decision trees (BDTs). We test the pipeline on simulated multi-band DES light curves from the Supernova Photometric Classification Challenge. Using the commonly used area under the curve (AUC) of the Receiver Operating Characteristic as a metric, we find that the SALT2 fits and the wavelet approach, with the BDTs algorithm, each achieve an AUC of 0.98, where 1 represents perfect classification. We find that a representative training set is essential for good classification, whatever the feature set or algorithm, with implications for spectroscopic follow-up. Importantly, we find that by using either the SALT2 or the wavelet feature sets with a BDT algorithm, accurate classification is possible purely from light curve data, without the need for any redshift information.

  15. Coupling machine learning methods with wavelet transforms and the bootstrap and boosting ensemble approaches for drought prediction

    NASA Astrophysics Data System (ADS)

    Belayneh, A.; Adamowski, J.; Khalil, B.; Quilty, J.

    2016-05-01

    This study explored the ability of coupled machine learning models and ensemble techniques to predict drought conditions in the Awash River Basin of Ethiopia. The potential of wavelet transforms coupled with the bootstrap and boosting ensemble techniques to develop reliable artificial neural network (ANN) and support vector regression (SVR) models was explored in this study for drought prediction. Wavelet analysis was used as a pre-processing tool and was shown to improve drought predictions. The Standardized Precipitation Index (SPI) (in this case SPI 3, SPI 12 and SPI 24) is a meteorological drought index that was forecasted using the aforementioned models and these SPI values represent short and long-term drought conditions. The performances of all models were compared using RMSE, MAE, and R2. The prediction results indicated that the use of the boosting ensemble technique consistently improved the correlation between observed and predicted SPIs. In addition, the use of wavelet analysis improved the prediction results of all models. Overall, the wavelet boosting ANN (WBS-ANN) and wavelet boosting SVR (WBS-SVR) models provided better prediction results compared to the other model types evaluated.

  16. A method for the evaluation of image quality according to the recognition effectiveness of objects in the optical remote sensing image using machine learning algorithm.

    PubMed

    Yuan, Tao; Zheng, Xinqi; Hu, Xuan; Zhou, Wei; Wang, Wei

    2014-01-01

    Objective and effective image quality assessment (IQA) is directly related to the application of optical remote sensing images (ORSI). In this study, a new IQA method of standardizing the target object recognition rate (ORR) is presented to reflect quality. First, several quality degradation treatments with high-resolution ORSIs are implemented to model the ORSIs obtained in different imaging conditions; then, a machine learning algorithm is adopted for recognition experiments on a chosen target object to obtain ORRs; finally, a comparison with commonly used IQA indicators was performed to reveal their applicability and limitations. The results showed that the ORR of the original ORSI was calculated to be up to 81.95%, whereas the ORR ratios of the quality-degraded images to the original images were 65.52%, 64.58%, 71.21%, and 73.11%. The results show that these data can more accurately reflect the advantages and disadvantages of different images in object identification and information extraction when compared with conventional digital image assessment indexes. By recognizing the difference in image quality from the application effect perspective, using a machine learning algorithm to extract regional gray scale features of typical objects in the image for analysis, and quantitatively assessing quality of ORSI according to the difference, this method provides a new approach for objective ORSI assessment. PMID:24489739

  17. A Machine Learning Based Framework for Adaptive Mobile Learning

    NASA Astrophysics Data System (ADS)

    Al-Hmouz, Ahmed; Shen, Jun; Yan, Jun

    Advances in wireless technology and handheld devices have created significant interest in mobile learning (m-learning) in recent years. Students nowadays are able to learn anywhere and at any time. Mobile learning environments must also cater for different user preferences and various devices with limited capability, where not all of the information is relevant and critical to each learning environment. To address this issue, this paper presents a framework that depicts the process of adapting learning content to satisfy individual learner characteristics by taking into consideration his/her learning style. We use a machine learning based algorithm for acquiring, representing, storing, reasoning and updating each learner acquired profile.

  18. Machine Learning for Biological Trajectory Classification Applications

    NASA Technical Reports Server (NTRS)

    Sbalzarini, Ivo F.; Theriot, Julie; Koumoutsakos, Petros

    2002-01-01

    Machine-learning techniques, including clustering algorithms, support vector machines and hidden Markov models, are applied to the task of classifying trajectories of moving keratocyte cells. The different algorithms axe compared to each other as well as to expert and non-expert test persons, using concepts from signal-detection theory. The algorithms performed very well as compared to humans, suggesting a robust tool for trajectory classification in biological applications.

  19. Extreme Learning Machines for spatial environmental data

    NASA Astrophysics Data System (ADS)

    Leuenberger, Michael; Kanevski, Mikhail

    2015-12-01

    The use of machine learning algorithms has increased in a wide variety of domains (from finance to biocomputing and astronomy), and nowadays has a significant impact on the geoscience community. In most real cases geoscience data modelling problems are multivariate, high dimensional, variable at several spatial scales, and are generated by non-linear processes. For such complex data, the spatial prediction of continuous (or categorical) variables is a challenging task. The aim of this paper is to investigate the potential of the recently developed Extreme Learning Machine (ELM) for environmental data analysis, modelling and spatial prediction purposes. An important contribution of this study deals with an application of a generic self-consistent methodology for environmental data driven modelling based on Extreme Learning Machine. Both real and simulated data are used to demonstrate applicability of ELM at different stages of the study to understand and justify the results.

  20. Introduction to machine learning for brain imaging.

    PubMed

    Lemm, Steven; Blankertz, Benjamin; Dickhaus, Thorsten; Müller, Klaus-Robert

    2011-05-15

    Machine learning and pattern recognition algorithms have in the past years developed to become a working horse in brain imaging and the computational neurosciences, as they are instrumental for mining vast amounts of neural data of ever increasing measurement precision and detecting minuscule signals from an overwhelming noise floor. They provide the means to decode and characterize task relevant brain states and to distinguish them from non-informative brain signals. While undoubtedly this machinery has helped to gain novel biological insights, it also holds the danger of potential unintentional abuse. Ideally machine learning techniques should be usable for any non-expert, however, unfortunately they are typically not. Overfitting and other pitfalls may occur and lead to spurious and nonsensical interpretation. The goal of this review is therefore to provide an accessible and clear introduction to the strengths and also the inherent dangers of machine learning usage in the neurosciences. PMID:21172442

  1. A study of the effectiveness of machine learning methods for classification of clinical interview fragments into a large number of categories.

    PubMed

    Hasan, Mehedi; Kotov, Alexander; Idalski Carcone, April; Dong, Ming; Naar, Sylvie; Brogan Hartlieb, Kathryn

    2016-08-01

    This study examines the effectiveness of state-of-the-art supervised machine learning methods in conjunction with different feature types for the task of automatic annotation of fragments of clinical text based on codebooks with a large number of categories. We used a collection of motivational interview transcripts consisting of 11,353 utterances, which were manually annotated by two human coders as the gold standard, and experimented with state-of-art classifiers, including Naïve Bayes, J48 Decision Tree, Support Vector Machine (SVM), Random Forest (RF), AdaBoost, DiscLDA, Conditional Random Fields (CRF) and Convolutional Neural Network (CNN) in conjunction with lexical, contextual (label of the previous utterance) and semantic (distribution of words in the utterance across the Linguistic Inquiry and Word Count dictionaries) features. We found out that, when the number of classes is large, the performance of CNN and CRF is inferior to SVM. When only lexical features were used, interview transcripts were automatically annotated by SVM with the highest classification accuracy among all classifiers of 70.8%, 61% and 53.7% based on the codebooks consisting of 17, 20 and 41 codes, respectively. Using contextual and semantic features, as well as their combination, in addition to lexical ones, improved the accuracy of SVM for annotation of utterances in motivational interview transcripts with a codebook consisting of 17 classes to 71.5%, 74.2%, and 75.1%, respectively. Our results demonstrate the potential of using machine learning methods in conjunction with lexical, semantic and contextual features for automatic annotation of clinical interview transcripts with near-human accuracy. PMID:27185608

  2. Learning in brains and machines.

    PubMed

    Poggio, T; Shelton, C R

    2000-01-01

    The problem of learning is arguably at the very core of the problem of intelligence, both biological and artificial. In this paper we sketch some of our work over the last ten years in the area of supervised learning, focusing on three interlinked directions of research: theory, engineering applications (that is, making intelligent software) and neuroscience (that is, understanding the brain's mechanisms of learning). PMID:11198239

  3. Machine Learning for Dynamical Mean Field Theory

    NASA Astrophysics Data System (ADS)

    Arsenault, Louis-Francois; Lopez-Bezanilla, Alejandro; von Lilienfeld, O. Anatole; Littlewood, P. B.; Millis, Andy

    2014-03-01

    Machine Learning (ML), an approach that infers new results from accumulated knowledge, is in use for a variety of tasks ranging from face and voice recognition to internet searching and has recently been gaining increasing importance in chemistry and physics. In this talk, we investigate the possibility of using ML to solve the equations of dynamical mean field theory which otherwise requires the (numerically very expensive) solution of a quantum impurity model. Our ML scheme requires the relation between two functions: the hybridization function describing the bare (local) electronic structure of a material and the self-energy describing the many body physics. We discuss the parameterization of the two functions for the exact diagonalization solver and present examples, beginning with the Anderson Impurity model with a fixed bath density of states, demonstrating the advantages and the pitfalls of the method. DOE contract DE-AC02-06CH11357.

  4. Machine learning in cell biology - teaching computers to recognize phenotypes.

    PubMed

    Sommer, Christoph; Gerlich, Daniel W

    2013-12-15

    Recent advances in microscope automation provide new opportunities for high-throughput cell biology, such as image-based screening. High-complex image analysis tasks often make the implementation of static and predefined processing rules a cumbersome effort. Machine-learning methods, instead, seek to use intrinsic data structure, as well as the expert annotations of biologists to infer models that can be used to solve versatile data analysis tasks. Here, we explain how machine-learning methods work and what needs to be considered for their successful application in cell biology. We outline how microscopy images can be converted into a data representation suitable for machine learning, and then introduce various state-of-the-art machine-learning algorithms, highlighting recent applications in image-based screening. Our Commentary aims to provide the biologist with a guide to the application of machine learning to microscopy assays and we therefore include extensive discussion on how to optimize experimental workflow as well as the data analysis pipeline. PMID:24259662

  5. 3D Visualization of Machine Learning Algorithms with Astronomical Data

    NASA Astrophysics Data System (ADS)

    Kent, Brian R.

    2016-01-01

    We present innovative machine learning (ML) methods using unsupervised clustering with minimum spanning trees (MSTs) to study 3D astronomical catalogs. Utilizing Python code to build trees based on galaxy catalogs, we can render the results with the visualization suite Blender to produce interactive 360 degree panoramic videos. The catalogs and their ML results can be explored in a 3D space using mobile devices, tablets or desktop browsers. We compare the statistics of the MST results to a number of machine learning methods relating to optimization and efficiency.

  6. Distributed fuzzy learning using the MULTISOFT machine.

    PubMed

    Russo, M

    2001-01-01

    Describes PARGEFREX, a distributed approach to genetic-neuro-fuzzy learning which has been implemented using the MULTISOFT machine, a low-cost form of personal computers built at the University of Messina. The performance of the serial version is hugely enhanced with the simple parallelization scheme described in the paper. Once a learning dataset is fixed, there is a very high super linear speedup in the average time needed to reach a prefixed learning error, i.e., if the number of personal computers increases by n times, the mean learning time becomes less than 1/n times. PMID:18249882

  7. Accurate Identification of Cancerlectins through Hybrid Machine Learning Technology

    PubMed Central

    Ju, Ying

    2016-01-01

    Cancerlectins are cancer-related proteins that function as lectins. They have been identified through computational identification techniques, but these techniques have sometimes failed to identify proteins because of sequence diversity among the cancerlectins. Advanced machine learning identification methods, such as support vector machine and basic sequence features (n-gram), have also been used to identify cancerlectins. In this study, various protein fingerprint features and advanced classifiers, including ensemble learning techniques, were utilized to identify this group of proteins. We improved the prediction accuracy of the original feature extraction methods and classification algorithms by more than 10% on average. Our work provides a basis for the computational identification of cancerlectins and reveals the power of hybrid machine learning techniques in computational proteomics. PMID:27478823

  8. Machine Learning Toolkit for Extreme Scale

    SciTech Connect

    2014-03-31

    Support Vector Machines (SVM) is a popular machine learning technique, which has been applied to a wide range of domains such as science, finance, and social networks for supervised learning. MaTEx undertakes the challenge of designing a scalable parallel SVM training algorithm for large scale systems, which includes commodity multi-core machines, tightly connected supercomputers and cloud computing systems. Several techniques are proposed for improved speed and memory space usage including adaptive and aggressive elimination of samples for faster convergence , and sparse format representation of data samples. Several heuristics for earliest possible to lazy elimination of non-contributing samples are considered in MaTEx. In many cases, where an early sample elimination might result in a false positive, low overhead mechanisms for reconstruction of key data structures are proposed. The proposed algorithm and heuristics are implemented and evaluated on various publicly available datasets

  9. Machine Learning Toolkit for Extreme Scale

    Energy Science and Technology Software Center (ESTSC)

    2014-03-31

    Support Vector Machines (SVM) is a popular machine learning technique, which has been applied to a wide range of domains such as science, finance, and social networks for supervised learning. MaTEx undertakes the challenge of designing a scalable parallel SVM training algorithm for large scale systems, which includes commodity multi-core machines, tightly connected supercomputers and cloud computing systems. Several techniques are proposed for improved speed and memory space usage including adaptive and aggressive elimination ofmore » samples for faster convergence , and sparse format representation of data samples. Several heuristics for earliest possible to lazy elimination of non-contributing samples are considered in MaTEx. In many cases, where an early sample elimination might result in a false positive, low overhead mechanisms for reconstruction of key data structures are proposed. The proposed algorithm and heuristics are implemented and evaluated on various publicly available datasets« less

  10. Hypervelocity cutting machine and method

    DOEpatents

    Powell, J.R.; Reich, M.

    1996-11-12

    A method and machine are provided for cutting a workpiece such as concrete. A gun barrel is provided for repetitively loading projectiles therein and is supplied with a pressurized propellant from a storage tank. A thermal storage tank is disposed between the propellant storage tank and the gun barrel for repetitively receiving and heating propellant charges which are released in the gun barrel for repetitively firing projectiles therefrom toward the workpiece. In a preferred embodiment, hypervelocity of the projectiles is obtained for cutting the concrete workpiece by fracturing thereof. 10 figs.

  11. Hypervelocity cutting machine and method

    DOEpatents

    Powell, James R.; Reich, Morris

    1996-11-12

    A method and machine 14 are provided for cutting a workpiece 12 such as concrete. A gun barrel 16 is provided for repetitively loading projectiles 22 therein and is supplied with a pressurized propellant from a storage tank 28. A thermal storage tank 32,32A is disposed between the propellant storage tank 28 and the gun barrel 16 for repetitively receiving and heating propellant charges which are released in the gun barrel 16 for repetitively firing projectiles 22 therefrom toward the workpiece 12. In a preferred embodiment, hypervelocity of the projectiles 22 is obtained for cutting the concrete workpiece 12 by fracturing thereof.

  12. Movement error rate for evaluation of machine learning methods for sEMG-based hand movement classification.

    PubMed

    Gijsberts, Arjan; Atzori, Manfredo; Castellini, Claudio; Muller, Henning; Caputo, Barbara

    2014-07-01

    There has been increasing interest in applying learning algorithms to improve the dexterity of myoelectric prostheses. In this work, we present a large-scale benchmark evaluation on the second iteration of the publicly released NinaPro database, which contains surface electromyography data for 6 DOF force activations as well as for 40 discrete hand movements. The evaluation involves a modern kernel method and compares performance of three feature representations and three kernel functions. Both the force regression and movement classification problems can be learned successfully when using a nonlinear kernel function, while the exp- χ(2) kernel outperforms the more popular radial basis function kernel in all cases. Furthermore, combining surface electromyography and accelerometry in a multimodal classifier results in significant increases in accuracy as compared to when either modality is used individually. Since window-based classification accuracy should not be considered in isolation to estimate prosthetic controllability, we also provide results in terms of classification mistakes and prediction delay. To this extent, we propose the movement error rate as an alternative to the standard window-based accuracy. This error rate is insensitive to prediction delays and it allows us therefore to quantify mistakes and delays as independent performance characteristics. This type of analysis confirms that the inclusion of accelerometry is superior, as it results in fewer mistakes while at the same time reducing prediction delay. PMID:24760932

  13. Using Simple Machines to Leverage Learning

    ERIC Educational Resources Information Center

    Dotger, Sharon

    2008-01-01

    What would your students say if you told them they could lift you off the ground using a block and a board? Using a simple machine, they'll find out they can, and they'll learn about work, energy, and motion in the process! In addition, this integrated lesson gives students the opportunity to investigate variables while practicing measurement…

  14. Machine learning for real time remote detection

    NASA Astrophysics Data System (ADS)

    Labbé, Benjamin; Fournier, Jérôme; Henaff, Gilles; Bascle, Bénédicte; Canu, Stéphane

    2010-10-01

    Infrared systems are key to providing enhanced capability to military forces such as automatic control of threats and prevention from air, naval and ground attacks. Key requirements for such a system to produce operational benefits are real-time processing as well as high efficiency in terms of detection and false alarm rate. These are serious issues since the system must deal with a large number of objects and categories to be recognized (small vehicles, armored vehicles, planes, buildings, etc.). Statistical learning based algorithms are promising candidates to meet these requirements when using selected discriminant features and real-time implementation. This paper proposes a new decision architecture benefiting from recent advances in machine learning by using an effective method for level set estimation. While building decision function, the proposed approach performs variable selection based on a discriminative criterion. Moreover, the use of level set makes it possible to manage rejection of unknown or ambiguous objects thus preserving the false alarm rate. Experimental evidences reported on real world infrared images demonstrate the validity of our approach.

  15. Application of advanced machine learning methods on resting-state fMRI network for identification of mild cognitive impairment and Alzheimer's disease.

    PubMed

    Khazaee, Ali; Ebrahimzadeh, Ata; Babajani-Feremi, Abbas

    2016-09-01

    The study of brain networks by resting-state functional magnetic resonance imaging (rs-fMRI) is a promising method for identifying patients with dementia from healthy controls (HC). Using graph theory, different aspects of the brain network can be efficiently characterized by calculating measures of integration and segregation. In this study, we combined a graph theoretical approach with advanced machine learning methods to study the brain network in 89 patients with mild cognitive impairment (MCI), 34 patients with Alzheimer's disease (AD), and 45 age-matched HC. The rs-fMRI connectivity matrix was constructed using a brain parcellation based on a 264 putative functional areas. Using the optimal features extracted from the graph measures, we were able to accurately classify three groups (i.e., HC, MCI, and AD) with accuracy of 88.4 %. We also investigated performance of our proposed method for a binary classification of a group (e.g., MCI) from two other groups (e.g., HC and AD). The classification accuracies for identifying HC from AD and MCI, AD from HC and MCI, and MCI from HC and AD, were 87.3, 97.5, and 72.0 %, respectively. In addition, results based on the parcellation of 264 regions were compared to that of the automated anatomical labeling atlas (AAL), consisted of 90 regions. The accuracy of classification of three groups using AAL was degraded to 83.2 %. Our results show that combining the graph measures with the machine learning approach, on the basis of the rs-fMRI connectivity analysis, may assist in diagnosis of AD and MCI. PMID:26363784

  16. Energy landscapes for a machine learning application to series data

    NASA Astrophysics Data System (ADS)

    Ballard, Andrew J.; Stevenson, Jacob D.; Das, Ritankar; Wales, David J.

    2016-03-01

    Methods developed to explore and characterise potential energy landscapes are applied to the corresponding landscapes obtained from optimisation of a cost function in machine learning. We consider neural network predictions for the outcome of local geometry optimisation in a triatomic cluster, where four distinct local minima exist. The accuracy of the predictions is compared for fits using data from single and multiple points in the series of atomic configurations resulting from local geometry optimisation and for alternative neural networks. The machine learning solution landscapes are visualised using disconnectivity graphs, and signatures in the effective heat capacity are analysed in terms of distributions of local minima and their properties.

  17. Recognition of printed Arabic text using machine learning

    NASA Astrophysics Data System (ADS)

    Amin, Adnan

    1998-04-01

    Many papers have been concerned with the recognition of Latin, Chinese and Japanese characters. However, although almost a third of a billion people worldwide, in several different languages, use Arabic characters for writing, little research progress, in both on-line and off-line has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text database, dictionaries, etc. and of course of the cursive nature of its writing rules. The main theme of this paper is the automatic recognition of Arabic printed text using machine learning C4.5. Symbolic machine learning algorithms are designed to accept example descriptions in the form of feature vectors which include a label that identifies the class to which an example belongs. The output of the algorithm is a set of rules that classifies unseen examples based on generalization from the training set. This ability to generalize is the main attraction of machine learning for handwriting recognition. Samples of a character can be preprocessed into a feature vector representation for presentation to a machine learning algorithm that creates rules for recognizing characters of the same class. Symbolic machine learning has several advantages over other learning methods. It is fast in training and in recognition, generalizes well, is noise tolerant and the symbolic representation is easy to understand. The technique can be divided into three major steps: the first step is pre- processing in which the original image is transformed into a binary image utilizing a 300 dpi scanner and then forming the connected component. Second, global features of the input Arabic word are then extracted such as number subwords, number of peaks within the subword, number and position of the complementary character, etc. Finally, machine learning C4.5 is used for character classification to generate a decision tree.

  18. Acceleration of saddle-point searches with machine learning.

    PubMed

    Peterson, Andrew A

    2016-08-21

    In atomistic simulations, the location of the saddle point on the potential-energy surface (PES) gives important information on transitions between local minima, for example, via transition-state theory. However, the search for saddle points often involves hundreds or thousands of ab initio force calls, which are typically all done at full accuracy. This results in the vast majority of the computational effort being spent calculating the electronic structure of states not important to the researcher, and very little time performing the calculation of the saddle point state itself. In this work, we describe how machine learning (ML) can reduce the number of intermediate ab initio calculations needed to locate saddle points. Since machine-learning models can learn from, and thus mimic, atomistic simulations, the saddle-point search can be conducted rapidly in the machine-learning representation. The saddle-point prediction can then be verified by an ab initio calculation; if it is incorrect, this strategically has identified regions of the PES where the machine-learning representation has insufficient training data. When these training data are used to improve the machine-learning model, the estimates greatly improve. This approach can be systematized, and in two simple example problems we demonstrate a dramatic reduction in the number of ab initio force calls. We expect that this approach and future refinements will greatly accelerate searches for saddle points, as well as other searches on the potential energy surface, as machine-learning methods see greater adoption by the atomistics community. PMID:27544086

  19. Research on knowledge representation, machine learning, and knowledge acquisition

    NASA Technical Reports Server (NTRS)

    Buchanan, Bruce G.

    1987-01-01

    Research in knowledge representation, machine learning, and knowledge acquisition performed at Knowledge Systems Lab. is summarized. The major goal of the research was to develop flexible, effective methods for representing the qualitative knowledge necessary for solving large problems that require symbolic reasoning as well as numerical computation. The research focused on integrating different representation methods to describe different kinds of knowledge more effectively than any one method can alone. In particular, emphasis was placed on representing and using spatial information about three dimensional objects and constraints on the arrangement of these objects in space. Another major theme is the development of robust machine learning programs that can be integrated with a variety of intelligent systems. To achieve this goal, learning methods were designed, implemented and experimented within several different problem solving environments.

  20. Combining data mining and machine learning for effective user profiling

    SciTech Connect

    Fawcett, T.; Provost, F.

    1996-12-31

    This paper describes the automatic design of methods for detecting fraudulent behavior. Much of the design is accomplished using a series of machine learning methods. In particular, we combine data mining and constructive induction with more standard machine learning techniques to design methods for detecting fraudulent usage of cellular telephones based on profiling customer behavior. Specifically, we use a rule-learning program to uncover indicators of fraudulent behavior from a large database of cellular calls. These indicators are used to create profilers, which then serve as features to a system that combines evidence from multiple profilers to generate high-confidence alarms. Experiments indicate that this automatic approach performs nearly as well as the best hand-tuned methods for detecting fraud.

  1. Machine learning techniques for fault isolation and sensor placement

    NASA Technical Reports Server (NTRS)

    Carnes, James R.; Fisher, Douglas H.

    1993-01-01

    Fault isolation and sensor placement are vital for monitoring and diagnosis. A sensor conveys information about a system's state that guides troubleshooting if problems arise. We are using machine learning methods to uncover behavioral patterns over snapshots of system simulations that will aid fault isolation and sensor placement, with an eye towards minimality, fault coverage, and noise tolerance.

  2. Machine Learning of Maritime Fog Forecast Rules.

    NASA Astrophysics Data System (ADS)

    Tag, Paul M.; Peak, James E.

    1996-05-01

    In recent years, the field of artificial intelligence has contributed significantly to the science of meteorology, most notably in the now familiar form of expert systems. Expert systems have focused on rules or heuristics by establishing, in computer code, the reasoning process of a weather forecaster predicting, for example, thunderstorms or fog. In addition to the years of effort that goes into developing such a knowledge base is the time-consuming task of extracting such knowledge and experience from experts. In this paper, the induction of rules directly from meteorological data is explored-a process called machine learning. A commercial machine learning program called C4.5, is applied to a meteorological problem, forecasting maritime fog, for which a reliable expert system has been previously developed. Two detasets are used: 1) weather ship observations originally used for testing and evaluating the expert system, and 2) buoy measurements taken off the coast of California. For both datasets, the rules produced by C4.5 are reasonable and make physical sense, thus demonstrating that an objective induction approach can reveal physical processes directly from data. For the ship database, the machine-generated rules are not as accurate as those from the expert system but are still significantly better than persistence forecasts. For the buoy data, the forecast accuracies are very high, but only slightly superior to persistence. The results indicate that the machine learning approach is a viable tool for developing meteorological expertise, but only when applied to reliable data with sufficient cases of known outcome. In those instances when such databases are available, the use of machine learning can provide useful insight that otherwise might take considerable human analysis to produce.

  3. Many-body physics via machine learning

    NASA Astrophysics Data System (ADS)

    Arsenault, Louis-Francois; von Lilienfeld, O. Anatole; Millis, Andrew J.

    We demonstrate a method for the use of machine learning (ML) to solve the equations of many-body physics, which are functional equations linking a bare to an interacting Green's function (or self-energy) offering transferable power of prediction for physical quantities for both the forward and the reverse engineering problem of materials. Functions are represented by coefficients in an orthogonal polynomial expansion and kernel ridge regression is used. The method is demonstrated using as an example a database built from Dynamical Mean Field theory (DMFT) calculations on the three dimensional Hubbard model. We discuss the extension to a database for real materials. We also discuss some new area of investigation concerning high throughput predictions for real materials by offering a perspective of how our scheme is general enough for applications to other problems involving the inversion of integral equations from the integrated knowledge such as the analytical continuation of the Green's function and the reconstruction of lattice structures from X-ray spectra. Office of Science of the U.S. Department of Energy under SubContract DOE No. 3F-3138 and FG-ER04169.

  4. Outsmarting neural networks: an alternative paradigm for machine learning

    SciTech Connect

    Protopopescu, V.; Rao, N.S.V.

    1996-10-01

    We address three problems in machine learning, namely: (i) function learning, (ii) regression estimation, and (iii) sensor fusion, in the Probably and Approximately Correct (PAC) framework. We show that, under certain conditions, one can reduce the three problems above to the regression estimation. The latter is usually tackled with artificial neural networks (ANNs) that satisfy the PAC criteria, but have high computational complexity. We propose several computationally efficient PAC alternatives to ANNs to solve the regression estimation. Thereby we also provide efficient PAC solutions to the function learning and sensor fusion problems. The approach is based on cross-fertilizing concepts and methods from statistical estimation, nonlinear algorithms, and the theory of computational complexity, and is designed as part of a new, coherent paradigm for machine learning.

  5. Machine Learning for Treatment Assignment: Improving Individualized Risk Attribution

    PubMed Central

    Weiss, Jeremy; Kuusisto, Finn; Boyd, Kendrick; Liu, Jie; Page, David

    2015-01-01

    Clinical studies model the average treatment effect (ATE), but apply this population-level effect to future individuals. Due to recent developments of machine learning algorithms with useful statistical guarantees, we argue instead for modeling the individualized treatment effect (ITE), which has better applicability to new patients. We compare ATE-estimation using randomized and observational analysis methods against ITE-estimation using machine learning, and describe how the ITE theoretically generalizes to new population distributions, whereas the ATE may not. On a synthetic data set of statin use and myocardial infarction (MI), we show that a learned ITE model improves true ITE estimation and outperforms the ATE. We additionally argue that ITE models should be learned with a consistent, nonparametric algorithm from unweighted examples and show experiments in favor of our argument using our synthetic data model and a real data set of D-penicillamine use for primary biliary cirrhosis. PMID:26958271

  6. Committee of machine learning predictors of hydrological models uncertainty

    NASA Astrophysics Data System (ADS)

    Kayastha, Nagendra; Solomatine, Dimitri

    2014-05-01

    In prediction of uncertainty based on machine learning methods, the results of various sampling schemes namely, Monte Carlo sampling (MCS), generalized likelihood uncertainty estimation (GLUE), Markov chain Monte Carlo (MCMC), shuffled complex evolution metropolis algorithm (SCEMUA), differential evolution adaptive metropolis (DREAM), particle swarm optimization (PSO) and adaptive cluster covering (ACCO)[1] used to build a predictive models. These models predict the uncertainty (quantiles of pdf) of a deterministic output from hydrological model [2]. Inputs to these models are the specially identified representative variables (past events precipitation and flows). The trained machine learning models are then employed to predict the model output uncertainty which is specific for the new input data. For each sampling scheme three machine learning methods namely, artificial neural networks, model tree, locally weighted regression are applied to predict output uncertainties. The problem here is that different sampling algorithms result in different data sets used to train different machine learning models which leads to several models (21 predictive uncertainty models). There is no clear evidence which model is the best since there is no basis for comparison. A solution could be to form a committee of all models and to sue a dynamic averaging scheme to generate the final output [3]. This approach is applied to estimate uncertainty of streamflows simulation from a conceptual hydrological model HBV in the Nzoia catchment in Kenya. [1] N. Kayastha, D. L. Shrestha and D. P. Solomatine. Experiments with several methods of parameter uncertainty estimation in hydrological modeling. Proc. 9th Intern. Conf. on Hydroinformatics, Tianjin, China, September 2010. [2] D. L. Shrestha, N. Kayastha, and D. P. Solomatine, and R. Price. Encapsulation of parameteric uncertainty statistics by various predictive machine learning models: MLUE method, Journal of Hydroinformatic, in press

  7. Paradigms for Realizing Machine Learning Algorithms.

    PubMed

    Agneeswaran, Vijay Srinivas; Tonpay, Pranay; Tiwary, Jayati

    2013-12-01

    The article explains the three generations of machine learning algorithms-with all three trying to operate on big data. The first generation tools are SAS, SPSS, etc., while second generation realizations include Mahout and RapidMiner (that work over Hadoop), and the third generation paradigms include Spark and GraphLab, among others. The essence of the article is that for a number of machine learning algorithms, it is important to look beyond the Hadoop's Map-Reduce paradigm in order to make them work on big data. A number of promising contenders have emerged in the third generation that can be exploited to realize deep analytics on big data. PMID:27447253

  8. Distinguishing Asthma Phenotypes Using Machine Learning Approaches.

    PubMed

    Howard, Rebecca; Rattray, Magnus; Prosperi, Mattia; Custovic, Adnan

    2015-07-01

    Asthma is not a single disease, but an umbrella term for a number of distinct diseases, each of which are caused by a distinct underlying pathophysiological mechanism. These discrete disease entities are often labelled as 'asthma endotypes'. The discovery of different asthma subtypes has moved from subjective approaches in which putative phenotypes are assigned by experts to data-driven ones which incorporate machine learning. This review focuses on the methodological developments of one such machine learning technique-latent class analysis-and how it has contributed to distinguishing asthma and wheezing subtypes in childhood. It also gives a clinical perspective, presenting the findings of studies from the past 5 years that used this approach. The identification of true asthma endotypes may be a crucial step towards understanding their distinct pathophysiological mechanisms, which could ultimately lead to more precise prevention strategies, identification of novel therapeutic targets and the development of effective personalized therapies. PMID:26143394

  9. AstroML: Python-powered Machine Learning for Astronomy

    NASA Astrophysics Data System (ADS)

    Vander Plas, Jake; Connolly, A. J.; Ivezic, Z.

    2014-01-01

    As astronomical data sets grow in size and complexity, automated machine learning and data mining methods are becoming an increasingly fundamental component of research in the field. The astroML project (http://astroML.org) provides a common repository for practical examples of the data mining and machine learning tools used and developed by astronomical researchers, written in Python. The astroML module contains a host of general-purpose data analysis and machine learning routines, loaders for openly-available astronomical datasets, and fast implementations of specific computational methods often used in astronomy and astrophysics. The associated website features hundreds of examples of these routines being used for analysis of real astronomical datasets, while the associated textbook provides a curriculum resource for graduate-level courses focusing on practical statistics, machine learning, and data mining approaches within Astronomical research. This poster will highlight several of the more powerful and unique examples of analysis performed with astroML, all of which can be reproduced in their entirety on any computer with the proper packages installed.

  10. Machine Learning and Geometric Technique for SLAM

    NASA Astrophysics Data System (ADS)

    Bernal-Marin, Miguel; Bayro-Corrochano, Eduardo

    This paper describes a new approach for building 3D geometric maps using a laser rangefinder, a stereo camera system and a mathematical system the Conformal Geometric Algebra. The use of a known visual landmarks in the map helps to carry out a good localization of the robot. A machine learning technique is used for recognition of objects in the environment. These landmarks are found using the Viola and Jones algorithm and are represented with their position in the 3D virtual map.

  11. Prototype-based models in machine learning.

    PubMed

    Biehl, Michael; Hammer, Barbara; Villmann, Thomas

    2016-01-01

    An overview is given of prototype-based models in machine learning. In this framework, observations, i.e., data, are stored in terms of typical representatives. Together with a suitable measure of similarity, the systems can be employed in the context of unsupervised and supervised analysis of potentially high-dimensional, complex datasets. We discuss basic schemes of competitive vector quantization as well as the so-called neural gas approach and Kohonen's topology-preserving self-organizing map. Supervised learning in prototype systems is exemplified in terms of learning vector quantization. Most frequently, the familiar Euclidean distance serves as a dissimilarity measure. We present extensions of the framework to nonstandard measures and give an introduction to the use of adaptive distances in relevance learning. PMID:26800334

  12. Dimension Reduction With Extreme Learning Machine.

    PubMed

    Kasun, Liyanaarachchi Lekamalage Chamara; Yang, Yan; Huang, Guang-Bin; Zhang, Zhengyou

    2016-08-01

    Data may often contain noise or irrelevant information, which negatively affect the generalization capability of machine learning algorithms. The objective of dimension reduction algorithms, such as principal component analysis (PCA), non-negative matrix factorization (NMF), random projection (RP), and auto-encoder (AE), is to reduce the noise or irrelevant information of the data. The features of PCA (eigenvectors) and linear AE are not able to represent data as parts (e.g. nose in a face image). On the other hand, NMF and non-linear AE are maimed by slow learning speed and RP only represents a subspace of original data. This paper introduces a dimension reduction framework which to some extend represents data as parts, has fast learning speed, and learns the between-class scatter subspace. To this end, this paper investigates a linear and non-linear dimension reduction framework referred to as extreme learning machine AE (ELM-AE) and sparse ELM-AE (SELM-AE). In contrast to tied weight AE, the hidden neurons in ELM-AE and SELM-AE need not be tuned, and their parameters (e.g, input weights in additive neurons) are initialized using orthogonal and sparse random weights, respectively. Experimental results on USPS handwritten digit recognition data set, CIFAR-10 object recognition, and NORB object recognition data set show the efficacy of linear and non-linear ELM-AE and SELM-AE in terms of discriminative capability, sparsity, training time, and normalized mean square error. PMID:27214902

  13. Machine learning optimization of cross docking accuracy.

    PubMed

    Bjerrum, Esben J

    2016-06-01

    Performance of small molecule automated docking programs has conceptually been divided into docking -, scoring -, ranking - and screening power, which focuses on the crystal pose prediction, affinity prediction, ligand ranking and database screening capabilities of the docking program, respectively. Benchmarks show that different docking programs can excel in individual benchmarks which suggests that the scoring function employed by the programs can be optimized for a particular task. Here the scoring function of Smina is re-optimized towards enhancing the docking power using a supervised machine learning approach and a manually curated database of ligands and cross docking receptor pairs. The optimization method does not need associated binding data for the receptor-ligand examples used in the data set and works with small train sets. The re-optimization of the weights for the scoring function results in a similar docking performance with regard to docking power towards a cross docking test set. A ligand decoy based benchmark indicates a better discrimination between poses with high and low RMSD. The reported parameters for Smina are compatible with Autodock Vina and represent ready-to-use alternative parameters for researchers who aim at pose prediction rather than affinity prediction. PMID:27179709

  14. Machine learning: how to get more out of HEP data and the Higgs Boson Machine Learning Challenge

    NASA Astrophysics Data System (ADS)

    Wolter, Marcin

    2015-09-01

    Multivariate techniques using machine learning algorithms have become an integral part in many High Energy Physics (HEP) data analyses. The article shows the gain in physics reach of the physics experiments due to the adaptation of machine learning techniques. Rapid development in the field of machine learning in the last years is a challenge for the HEP community. The open competition for machine learning experts "Higgs Boson Machine Learning Challenge" shows, that the modern techniques developed outside HEP can significantly improve the analysis of data from HEP experiments and improve the sensitivity of searches for new particles and processes.

  15. Relative optical navigation around small bodies via Extreme Learning Machine

    NASA Astrophysics Data System (ADS)

    Law, Andrew M.

    To perform close proximity operations under a low-gravity environment, relative and absolute positions are vital information to the maneuver. Hence navigation is inseparably integrated in space travel. Extreme Learning Machine (ELM) is presented as an optical navigation method around small celestial bodies. Optical Navigation uses visual observation instruments such as a camera to acquire useful data and determine spacecraft position. The required input data for operation is merely a single image strip and a nadir image. ELM is a machine learning Single Layer feed-Forward Network (SLFN), a type of neural network (NN). The algorithm is developed on the predicate that input weights and biases can be randomly assigned and does not require back-propagation. The learned model is the output layer weights which are used to calculate a prediction. Together, Extreme Learning Machine Optical Navigation (ELM OpNav) utilizes optical images and ELM algorithm to train the machine to navigate around a target body. In this thesis the asteroid, Vesta, is the designated celestial body. The trained ELMs estimate the position of the spacecraft during operation with a single data set. The results show the approach is promising and potentially suitable for on-board navigation.

  16. Testing and Validating Machine Learning Classifiers by Metamorphic Testing☆

    PubMed Central

    Xie, Xiaoyuan; Ho, Joshua W. K.; Murphy, Christian; Kaiser, Gail; Xu, Baowen; Chen, Tsong Yueh

    2011-01-01

    Machine Learning algorithms have provided core functionality to many application domains - such as bioinformatics, computational linguistics, etc. However, it is difficult to detect faults in such applications because often there is no “test oracle” to verify the correctness of the computed outputs. To help address the software quality, in this paper we present a technique for testing the implementations of machine learning classification algorithms which support such applications. Our approach is based on the technique “metamorphic testing”, which has been shown to be effective to alleviate the oracle problem. Also presented include a case study on a real-world machine learning application framework, and a discussion of how programmers implementing machine learning algorithms can avoid the common pitfalls discovered in our study. We also conduct mutation analysis and cross-validation, which reveal that our method has high effectiveness in killing mutants, and that observing expected cross-validation result alone is not sufficiently effective to detect faults in a supervised classification program. The effectiveness of metamorphic testing is further confirmed by the detection of real faults in a popular open-source classification program. PMID:21532969

  17. Binary classification of a large collection of environmental chemicals from estrogen receptor assays by quantitative structure-activity relationship and machine learning methods.

    PubMed

    Zang, Qingda; Rotroff, Daniel M; Judson, Richard S

    2013-12-23

    There are thousands of environmental chemicals subject to regulatory decisions for endocrine disrupting potential. The ToxCast and Tox21 programs have tested ∼8200 chemicals in a broad screening panel of in vitro high-throughput screening (HTS) assays for estrogen receptor (ER) agonist and antagonist activity. The present work uses this large data set to develop in silico quantitative structure-activity relationship (QSAR) models using machine learning (ML) methods and a novel approach to manage the imbalanced data distribution. Training compounds from the ToxCast project were categorized as active or inactive (binding or nonbinding) classes based on a composite ER Interaction Score derived from a collection of 13 ER in vitro assays. A total of 1537 chemicals from ToxCast were used to derive and optimize the binary classification models while 5073 additional chemicals from the Tox21 project, evaluated in 2 of the 13 in vitro assays, were used to externally validate the model performance. In order to handle the imbalanced distribution of active and inactive chemicals, we developed a cluster-selection strategy to minimize information loss and increase predictive performance and compared this strategy to three currently popular techniques: cost-sensitive learning, oversampling of the minority class, and undersampling of the majority class. QSAR classification models were built to relate the molecular structures of chemicals to their ER activities using linear discriminant analysis (LDA), classification and regression trees (CART), and support vector machines (SVM) with 51 molecular descriptors from QikProp and 4328 bits of structural fingerprints as explanatory variables. A random forest (RF) feature selection method was employed to extract the structural features most relevant to the ER activity. The best model was obtained using SVM in combination with a subset of descriptors identified from a large set via the RF algorithm, which recognized the active and

  18. Stacking for machine learning redshifts applied to SDSS galaxies

    NASA Astrophysics Data System (ADS)

    Zitlau, Roman; Hoyle, Ben; Paech, Kerstin; Weller, Jochen; Rau, Markus Michael; Seitz, Stella

    2016-08-01

    We present an analysis of a general machine learning technique called 'stacking' for the estimation of photometric redshifts. Stacking techniques can feed the photometric redshift estimate, as output by a base algorithm, back into the same algorithm as an additional input feature in a subsequent learning round. We shown how all tested base algorithms benefit from at least one additional stacking round (or layer). To demonstrate the benefit of stacking, we apply the method to both unsupervised machine learning techniques based on self-organising maps (SOMs), and supervised machine learning methods based on decision trees. We explore a range of stacking architectures, such as the number of layers and the number of base learners per layer. Finally we explore the effectiveness of stacking even when using a successful algorithm such as AdaBoost. We observe a significant improvement of between 1.9% and 21% on all computed metrics when stacking is applied to weak learners (such as SOMs and decision trees). When applied to strong learning algorithms (such as AdaBoost) the ratio of improvement shrinks, but still remains positive and is between 0.4% and 2.5% for the explored metrics and comes at almost no additional computational cost.

  19. Stacking for machine learning redshifts applied to SDSS galaxies

    NASA Astrophysics Data System (ADS)

    Zitlau, Roman; Hoyle, Ben; Paech, Kerstin; Weller, Jochen; Rau, Markus Michael; Seitz, Stella

    2016-08-01

    We present an analysis of a general machine learning technique called `stacking' for the estimation of photometric redshifts. Stacking techniques can feed the photometric redshift estimate, as output by a base algorithm, back into the same algorithm as an additional input feature in a subsequent learning round. We show how all tested base algorithms benefit from at least one additional stacking round (or layer). To demonstrate the benefit of stacking, we apply the method to both unsupervised machine learning techniques based on self-organizing maps (SOMs), and supervised machine learning methods based on decision trees. We explore a range of stacking architectures, such as the number of layers and the number of base learners per layer. Finally we explore the effectiveness of stacking even when using a successful algorithm such as AdaBoost. We observe a significant improvement of between 1.9 per cent and 21 per cent on all computed metrics when stacking is applied to weak learners (such as SOMs and decision trees). When applied to strong learning algorithms (such as AdaBoost) the ratio of improvement shrinks, but still remains positive and is between 0.4 per cent and 2.5 per cent for the explored metrics and comes at almost no additional computational cost.

  20. Protein secondary structure prediction using logic-based machine learning.

    PubMed

    Muggleton, S; King, R D; Sternberg, M J

    1992-10-01

    Many attempts have been made to solve the problem of predicting protein secondary structure from the primary sequence but the best performance results are still disappointing. In this paper, the use of a machine learning algorithm which allows relational descriptions is shown to lead to improved performance. The Inductive Logic Programming computer program, Golem, was applied to learning secondary structure prediction rules for alpha/alpha domain type proteins. The input to the program consisted of 12 non-homologous proteins (1612 residues) of known structure, together with a background knowledge describing the chemical and physical properties of the residues. Golem learned a small set of rules that predict which residues are part of the alpha-helices--based on their positional relationships and chemical and physical properties. The rules were tested on four independent non-homologous proteins (416 residues) giving an accuracy of 81% (+/- 2%). This is an improvement, on identical data, over the previously reported result of 73% by King and Sternberg (1990, J. Mol. Biol., 216, 441-457) using the machine learning program PROMIS, and of 72% using the standard Garnier-Osguthorpe-Robson method. The best previously reported result in the literature for the alpha/alpha domain type is 76%, achieved using a neural net approach. Machine learning also has the advantage over neural network and statistical methods in producing more understandable results. PMID:1480619

  1. Finding new perovskite halides via machine learning

    DOE PAGESBeta

    Pilania, Ghanshyam; Balachandran, Prasanna V.; Kim, Chiho; Lookman, Turab

    2016-04-26

    Advanced materials with improved properties have the potential to fuel future technological advancements. However, identification and discovery of these optimal materials for a specific application is a non-trivial task, because of the vastness of the chemical search space with enormous compositional and configurational degrees of freedom. Materials informatics provides an efficient approach toward rational design of new materials, via learning from known data to make decisions on new and previously unexplored compounds in an accelerated manner. Here, we demonstrate the power and utility of such statistical learning (or machine learning, henceforth referred to as ML) via building a support vectormore » machine (SVM) based classifier that uses elemental features (or descriptors) to predict the formability of a given ABX3 halide composition (where A and B represent monovalent and divalent cations, respectively, and X is F, Cl, Br, or I anion) in the perovskite crystal structure. The classification model is built by learning from a dataset of 185 experimentally known ABX3 compounds. After exploring a wide range of features, we identify ionic radii, tolerance factor, and octahedral factor to be the most important factors for the classification, suggesting that steric and geometric packing effects govern the stability of these halides. As a result, the trained and validated models then predict, with a high degree of confidence, several novel ABX3 compositions with perovskite crystal structure.« less

  2. Finding New Perovskite Halides via Machine learning

    NASA Astrophysics Data System (ADS)

    Pilania, Ghanshyam; Balachandran, Prasanna V.; Kim, Chiho; Lookman, Turab

    2016-04-01

    Advanced materials with improved properties have the potential to fuel future technological advancements. However, identification and discovery of these optimal materials for a specific application is a non-trivial task, because of the vastness of the chemical search space with enormous compositional and configurational degrees of freedom. Materials informatics provides an efficient approach towards rational design of new materials, via learning from known data to make decisions on new and previously unexplored compounds in an accelerated manner. Here, we demonstrate the power and utility of such statistical learning (or machine learning) via building a support vector machine (SVM) based classifier that uses elemental features (or descriptors) to predict the formability of a given ABX3 halide composition (where A and B represent monovalent and divalent cations, respectively, and X is F, Cl, Br or I anion) in the perovskite crystal structure. The classification model is built by learning from a dataset of 181 experimentally known ABX3 compounds. After exploring a wide range of features, we identify ionic radii, tolerance factor and octahedral factor to be the most important factors for the classification, suggesting that steric and geometric packing effects govern the stability of these halides. The trained and validated models then predict, with a high degree of confidence, several novel ABX3 compositions with perovskite crystal structure.

  3. In Silico Calculation of Infinite Dilution Activity Coefficients of Molecular Solutes in Ionic Liquids: Critical Review of Current Methods and New Models Based on Three Machine Learning Algorithms.

    PubMed

    Paduszyński, Kamil

    2016-08-22

    The aim of the paper is to address all the disadvantages of currently available models for calculating infinite dilution activity coefficients (γ(∞)) of molecular solutes in ionic liquids (ILs)-a relevant property from the point of view of many applications of ILs, particularly in separations. Three new models are proposed, each of them based on distinct machine learning algorithm: stepwise multiple linear regression (SWMLR), feed-forward artificial neural network (FFANN), and least-squares support vector machine (LSSVM). The models were established based on the most comprehensive γ(∞) data bank reported so far (>34 000 data points for 188 ILs and 128 solutes). Following the paper published previously [J. Chem. Inf. Model 2014, 54, 1311-1324], the ILs were treated in terms of group contributions, whereas the Abraham solvation parameters were used to quantify an impact of solute structure. Temperature is also included in the input data of the models so that they can be utilized to obtain temperature-dependent data and thus related thermodynamic functions. Both internal and external validation techniques were applied to assess the statistical significance and explanatory power of the final correlations. A comparative study of the overall performance of the investigated SWMLR/FFANN/LSSVM approaches is presented in terms of root-mean-square error and average absolute relative deviation between calculated and experimental γ(∞), evaluated for different families of ILs and solutes, as well as between calculated and experimental infinite dilution selectivity for separation problems benzene from n-hexane and thiophene from n-heptane. LSSVM is shown to be a method with the lowest values of both training and generalization errors. It is finally demonstrated that the established models exhibit an improved accuracy compared to the state-of-the-art model, namely, temperature-dependent group contribution linear solvation energy relationship, published in 2011 [J. Chem

  4. Data Mining and Machine Learning in Astronomy

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Brunner, Robert J.

    We review the current state of data mining and machine learning in astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those in which data mining techniques directly contributed to improving science, and important current and future directions, including probability density functions, parallel algorithms, Peta-Scale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.

  5. Prediction of drug-induced nephrotoxicity and injury mechanisms with human induced pluripotent stem cell-derived cells and machine learning methods

    PubMed Central

    Kandasamy, Karthikeyan; Chuah, Jacqueline Kai Chin; Su, Ran; Huang, Peng; Eng, Kim Guan; Xiong, Sijing; Li, Yao; Chia, Chun Siang; Loo, Lit-Hsin; Zink, Daniele

    2015-01-01

    The renal proximal tubule is a main target for drug-induced toxicity. The prediction of proximal tubular toxicity during drug development remains difficult. Any in vitro methods based on induced pluripotent stem cell-derived renal cells had not been developed, so far. Here, we developed a rapid 1-step protocol for the differentiation of human induced pluripotent stem cells (hiPSC) into proximal tubular-like cells. These proximal tubular-like cells had a purity of >90% after 8 days of differentiation and could be directly applied for compound screening. The nephrotoxicity prediction performance of the cells was determined by evaluating their responses to 30 compounds. The results were automatically determined using a machine learning algorithm called random forest. In this way, proximal tubular toxicity in humans could be predicted with 99.8% training accuracy and 87.0% test accuracy. Further, we studied the underlying mechanisms of injury and drug-induced cellular pathways in these hiPSC-derived renal cells, and the results were in agreement with human and animal data. Our methods will enable the development of personalized or disease-specific hiPSC-based renal in vitro models for compound screening and nephrotoxicity prediction. PMID:26212763

  6. Machine learning strategies for systems with invariance properties

    NASA Astrophysics Data System (ADS)

    Ling, Julia; Jones, Reese; Templeton, Jeremy

    2016-08-01

    In many scientific fields, empirical models are employed to facilitate computational simulations of engineering systems. For example, in fluid mechanics, empirical Reynolds stress closures enable computationally-efficient Reynolds Averaged Navier Stokes simulations. Likewise, in solid mechanics, constitutive relations between the stress and strain in a material are required in deformation analysis. Traditional methods for developing and tuning empirical models usually combine physical intuition with simple regression techniques on limited data sets. The rise of high performance computing has led to a growing availability of high fidelity simulation data. These data open up the possibility of using machine learning algorithms, such as random forests or neural networks, to develop more accurate and general empirical models. A key question when using data-driven algorithms to develop these empirical models is how domain knowledge should be incorporated into the machine learning process. This paper will specifically address physical systems that possess symmetry or invariance properties. Two different methods for teaching a machine learning model an invariance property are compared. In the first method, a basis of invariant inputs is constructed, and the machine learning model is trained upon this basis, thereby embedding the invariance into the model. In the second method, the algorithm is trained on multiple transformations of the raw input data until the model learns invariance to that transformation. Results are discussed for two case studies: one in turbulence modeling and one in crystal elasticity. It is shown that in both cases embedding the invariance property into the input features yields higher performance at significantly reduced computational training costs.

  7. A two-layered machine learning method to identify protein O-GlcNAcylation sites with O-GlcNAc transferase substrate motifs.

    PubMed

    Kao, Hui-Ju; Huang, Chien-Hsun; Bretaña, Neil Arvin; Lu, Cheng-Tsung; Huang, Kai-Yao; Weng, Shun-Long; Lee, Tzong-Yi

    2015-01-01

    Protein O-GlcNAcylation, involving the β-attachment of single N-acetylglucosamine (GlcNAc) to the hydroxyl group of serine or threonine residues, is an O-linked glycosylation catalyzed by O-GlcNAc transferase (OGT). Molecular level investigation of the basis for OGT's substrate specificity should aid understanding how O-GlcNAc contributes to diverse cellular processes. Due to an increasing number of O-GlcNAcylated peptides with site-specific information identified by mass spectrometry (MS)-based proteomics, we were motivated to characterize substrate site motifs of O-GlcNAc transferases. In this investigation, a non-redundant dataset of 410 experimentally verified O-GlcNAcylation sites were manually extracted from dbOGAP, OGlycBase and UniProtKB. After detection of conserved motifs by using maximal dependence decomposition, profile hidden Markov model (profile HMM) was adopted to learn a first-layered model for each identified OGT substrate motif. Support Vector Machine (SVM) was then used to generate a second-layered model learned from the output values of profile HMMs in first layer. The two-layered predictive model was evaluated using a five-fold cross validation which yielded a sensitivity of 85.4%, a specificity of 84.1%, and an accuracy of 84.7%. Additionally, an independent testing set from PhosphoSitePlus, which was really non-homologous to the training data of predictive model, was used to demonstrate that the proposed method could provide a promising accuracy (84.05%) and outperform other O-GlcNAcylation site prediction tools. A case study indicated that the proposed method could be a feasible means of conducting preliminary analyses of protein O-GlcNAcylation and has been implemented as a web-based system, OGTSite, which is now freely available at http://csb.cse.yzu.edu.tw/OGTSite/. PMID:26680539

  8. Machine learning: An artificial intelligence approach. Vol. II

    SciTech Connect

    Michalski, R.S.; Carbonell, J.G.; Mitchell, T.M.

    1986-01-01

    This book reflects the expansion of machine learning research through presentation of recent advances in the field. The book provides an account of current research directions. Major topics covered include the following: learning concepts and rules from examples; cognitive aspects of learning; learning by analogy; learning by observation and discovery; and an exploration of general aspects of learning.

  9. Machine learning strategies for systems with invariance properties

    SciTech Connect

    Ling, Julia; Jones, Reese E.; Templeton, Jeremy Alan

    2016-01-01

    Here, in many scientific fields, empirical models are employed to facilitate computational simulations of engineering systems. For example, in fluid mechanics, empirical Reynolds stress closures enable computationally-efficient Reynolds-Averaged Navier-Stokes simulations. Likewise, in solid mechanics, constitutive relations between the stress and strain in a material are required in deformation analysis. Traditional methods for developing and tuning empirical models usually combine physical intuition with simple regression techniques on limited data sets. The rise of high-performance computing has led to a growing availability of high-fidelity simulation data, which open up the possibility of using machine learning algorithms, such as random forests or neural networks, to develop more accurate and general empirical models. A key question when using data-driven algorithms to develop these models is how domain knowledge should be incorporated into the machine learning process. This paper will specifically address physical systems that possess symmetry or invariance properties. Two different methods for teaching a machine learning model an invariance property are compared. In the first , a basis of invariant inputs is constructed, and the machine learning model is trained upon this basis, thereby embedding the invariance into the model. In the second method, the algorithm is trained on multiple transformations of the raw input data until the model learns invariance to that transformation. Results are discussed for two case studies: one in turbulence modeling and one in crystal elasticity. It is shown that in both cases embedding the invariance property into the input features yields higher performance with significantly reduced computational training costs.

  10. Machine learning strategies for systems with invariance properties

    DOE PAGESBeta

    Ling, Julia; Jones, Reese E.; Templeton, Jeremy Alan

    2016-05-06

    Here, in many scientific fields, empirical models are employed to facilitate computational simulations of engineering systems. For example, in fluid mechanics, empirical Reynolds stress closures enable computationally-efficient Reynolds-Averaged Navier-Stokes simulations. Likewise, in solid mechanics, constitutive relations between the stress and strain in a material are required in deformation analysis. Traditional methods for developing and tuning empirical models usually combine physical intuition with simple regression techniques on limited data sets. The rise of high-performance computing has led to a growing availability of high-fidelity simulation data, which open up the possibility of using machine learning algorithms, such as random forests or neuralmore » networks, to develop more accurate and general empirical models. A key question when using data-driven algorithms to develop these models is how domain knowledge should be incorporated into the machine learning process. This paper will specifically address physical systems that possess symmetry or invariance properties. Two different methods for teaching a machine learning model an invariance property are compared. In the first , a basis of invariant inputs is constructed, and the machine learning model is trained upon this basis, thereby embedding the invariance into the model. In the second method, the algorithm is trained on multiple transformations of the raw input data until the model learns invariance to that transformation. Results are discussed for two case studies: one in turbulence modeling and one in crystal elasticity. It is shown that in both cases embedding the invariance property into the input features yields higher performance with significantly reduced computational training costs.« less

  11. An Evolutionary Machine Learning Framework for Big Data Sequence Mining

    ERIC Educational Resources Information Center

    Kamath, Uday Krishna

    2014-01-01

    Sequence classification is an important problem in many real-world applications. Unlike other machine learning data, there are no "explicit" features or signals in sequence data that can help traditional machine learning algorithms learn and predict from the data. Sequence data exhibits inter-relationships in the elements that are…

  12. AZOrange - High performance open source machine learning for QSAR modeling in a graphical programming environment

    PubMed Central

    2011-01-01

    Background Machine learning has a vast range of applications. In particular, advanced machine learning methods are routinely and increasingly used in quantitative structure activity relationship (QSAR) modeling. QSAR data sets often encompass tens of thousands of compounds and the size of proprietary, as well as public data sets, is rapidly growing. Hence, there is a demand for computationally efficient machine learning algorithms, easily available to researchers without extensive machine learning knowledge. In granting the scientific principles of transparency and reproducibility, Open Source solutions are increasingly acknowledged by regulatory authorities. Thus, an Open Source state-of-the-art high performance machine learning platform, interfacing multiple, customized machine learning algorithms for both graphical programming and scripting, to be used for large scale development of QSAR models of regulatory quality, is of great value to the QSAR community. Results This paper describes the implementation of the Open Source machine learning package AZOrange. AZOrange is specially developed to support batch generation of QSAR models in providing the full work flow of QSAR modeling, from descriptor calculation to automated model building, validation and selection. The automated work flow relies upon the customization of the machine learning algorithms and a generalized, automated model hyper-parameter selection process. Several high performance machine learning algorithms are interfaced for efficient data set specific selection of the statistical method, promoting model accuracy. Using the high performance machine learning algorithms of AZOrange does not require programming knowledge as flexible applications can be created, not only at a scripting level, but also in a graphical programming environment. Conclusions AZOrange is a step towards meeting the needs for an Open Source high performance machine learning platform, supporting the efficient development of

  13. Normal tissue complication probability (NTCP) modelling using spatial dose metrics and machine learning methods for severe acute oral mucositis resulting from head and neck radiotherapy

    PubMed Central

    Dean, Jamie A; Wong, Kee H; Welsh, Liam C; Jones, Ann-Britt; Schick, Ulrike; Newbold, Kate L; Bhide, Shreerang A; Harrington, Kevin J; Nutting, Christopher M; Gulliford, Sarah L

    2016-01-01

    Background and Purpose Severe acute mucositis commonly results from head and neck (chemo)radiotherapy. A predictive model of mucositis could guide clinical decision-making and inform treatment planning. We aimed to generate such a model using spatial dose metrics and machine learning. Material and Methods Predictive models of severe acute mucositis were generated using radiotherapy dose (dose-volume and spatial dose metrics) and clinical data. Penalised logistic regression, support vector classification and random forest classification (RFC) models were generated and compared. Internal validation was performed (with 100-iteration cross-validation), using multiple metrics, including area under the receiver operating characteristic curve (AUC) and calibration slope, to assess performance. Associations between covariates and severe mucositis were explored using the models. Results The dose-volume-based models (standard) performed equally to those incorporating spatial information. Discrimination was similar between models, but the RFCstandard had the best calibration. The mean AUC and calibration slope for this model were 0.71 (s.d.=0.09) and 3.9 (s.d.=2.2), respectively. The volumes of oral cavity receiving intermediate and high doses were associated with severe mucositis. Conclusions The RFCstandard model performance is modest-to-good, but should be improved, and requires external validation. Reducing the volumes of oral cavity receiving intermediate and high doses may reduce mucositis incidence. PMID:27240717

  14. Recognition of explosives fingerprints on objects for courier services using machine learning methods and laser-induced breakdown spectroscopy.

    PubMed

    Moros, J; Serrano, J; Gallego, F J; Macías, J; Laserna, J J

    2013-06-15

    During recent years laser-induced breakdown spectroscopy (LIBS) has been considered one of the techniques with larger ability for trace detection of explosives. However, despite of the high sensitivity exhibited for this application, LIBS suffers from a limited selectivity due to difficulties in assigning the molecular origin of the spectral emissions observed. This circumstance makes the recognition of fingerprints a latent challenging problem. In the present manuscript the sorting of six explosives (chloratite, ammonal, DNT, TNT, RDX and PETN) against a broad list of potential harmless interferents (butter, fuel oil, hand cream, olive oil, …), all of them in the form of fingerprints deposited on the surfaces of objects for courier services, has been carried out. When LIBS information is processed through a multi-stage architecture algorithm built from a suitable combination of 3 learning classifiers, an unknown fingerprint may be labeled into a particular class. Neural network classifiers trained by the Levenberg-Marquardt rule were decided within 3D scatter plots projected onto the subspace of the most useful features extracted from the LIBS spectra. Experimental results demonstrate that the presented algorithm sorts fingerprints according to their hazardous character, although its spectral information is virtually identical in appearance, with rates of false negatives and false positives not beyond of 10%. These reported achievements mean a step forward in the technology readiness level of LIBS for this complex application related to defense, homeland security and force protection. PMID:23618183

  15. Machine learning of user profiles: Representational issues

    SciTech Connect

    Bloedorn, E.; Mani, I.; MacMillan, T.R.

    1996-12-31

    As more information becomes available electronically, tools for finding information of interest to users becomes increasingly important. The goal of the research described here is to build a system for generating comprehensible user profiles that accurately capture user interest with minimum user interaction. The research described here focuses on the importance of a suitable generalization hierarchy and representation for learning profiles which are predictively accurate and comprehensible. In our experiments we evaluated both traditional features based on weighted term vectors as well as subject features corresponding to categories which could be drawn from a thesaurus. Our experiments, conducted in the context of a content-based profiling system for on-line newspapers on the World Wide Web (the IDD News Browser), demonstrate the importance of a generalization hierarchy and the promise of combining natural language processing techniques with machine learning (ML) to address an information retrieval (ER) problem.

  16. Estimation of octanol/water partition coefficient and aqueous solubility of environmental chemicals using molecular fingerprints and machine learning methods

    EPA Science Inventory

    Octanol/water partition coefficient (logP) and aqueous solubility (logS) are two important parameters in pharmacology and toxicology studies, and experimental measurements are usually time-consuming and expensive. In the present research, novel methods are presented for the estim...

  17. A study of machine learning regression methods for major elemental analysis of rocks using laser-induced breakdown spectroscopy

    NASA Astrophysics Data System (ADS)

    Boucher, Thomas F.; Ozanne, Marie V.; Carmosino, Marco L.; Dyar, M. Darby; Mahadevan, Sridhar; Breves, Elly A.; Lepore, Kate H.; Clegg, Samuel M.

    2015-05-01

    The ChemCam instrument on the Mars Curiosity rover is generating thousands of LIBS spectra and bringing interest in this technique to public attention. The key to interpreting Mars or any other types of LIBS data are calibrations that relate laboratory standards to unknowns examined in other settings and enable predictions of chemical composition. Here, LIBS spectral data are analyzed using linear regression methods including partial least squares (PLS-1 and PLS-2), principal component regression (PCR), least absolute shrinkage and selection operator (lasso), elastic net, and linear support vector regression (SVR-Lin). These were compared against results from nonlinear regression methods including kernel principal component regression (K-PCR), polynomial kernel support vector regression (SVR-Py) and k-nearest neighbor (kNN) regression to discern the most effective models for interpreting chemical abundances from LIBS spectra of geological samples. The results were evaluated for 100 samples analyzed with 50 laser pulses at each of five locations averaged together. Wilcoxon signed-rank tests were employed to evaluate the statistical significance of differences among the nine models using their predicted residual sum of squares (PRESS) to make comparisons. For MgO, SiO2, Fe2O3, CaO, and MnO, the sparse models outperform all the others except for linear SVR, while for Na2O, K2O, TiO2, and P2O5, the sparse methods produce inferior results, likely because their emission lines in this energy range have lower transition probabilities. The strong performance of the sparse methods in this study suggests that use of dimensionality-reduction techniques as a preprocessing step may improve the performance of the linear models. Nonlinear methods tend to overfit the data and predict less accurately, while the linear methods proved to be more generalizable with better predictive performance. These results are attributed to the high dimensionality of the data (6144 channels

  18. Measure Transformer Semantics for Bayesian Machine Learning

    NASA Astrophysics Data System (ADS)

    Borgström, Johannes; Gordon, Andrew D.; Greenberg, Michael; Margetson, James; van Gael, Jurgen

    The Bayesian approach to machine learning amounts to inferring posterior distributions of random variables from a probabilistic model of how the variables are related (that is, a prior distribution) and a set of observations of variables. There is a trend in machine learning towards expressing Bayesian models as probabilistic programs. As a foundation for this kind of programming, we propose a core functional calculus with primitives for sampling prior distributions and observing variables. We define combinators for measure transformers, based on theorems in measure theory, and use these to give a rigorous semantics to our core calculus. The original features of our semantics include its support for discrete, continuous, and hybrid measures, and, in particular, for observations of zero-probability events. We compile our core language to a small imperative language that has a straightforward semantics via factor graphs, data structures that enable many efficient inference algorithms. We use an existing inference engine for efficient approximate inference of posterior marginal distributions, treating thousands of observations per second for large instances of realistic models.

  19. Mining the Kepler Data using Machine Learning

    NASA Astrophysics Data System (ADS)

    Walkowicz, Lucianne; Howe, A. R.; Nayar, R.; Turner, E. L.; Scargle, J.; Meadows, V.; Zee, A.

    2014-01-01

    Kepler's high cadence and incredible precision has provided an unprecedented view into stars and their planetary companions, revealing both expected and novel phenomena and systems. Due to the large number of Kepler lightcurves, the discovery of novel phenomena in particular has often been serendipitous in the course of searching for known forms of variability (for example, the discovery of the doubly pulsating elliptical binary KOI-54, originally identified by the transiting planet search pipeline). In this talk, we discuss progress on mining the Kepler data through both supervised and unsupervised machine learning, intended to both systematically search the Kepler lightcurves for rare or anomalous variability, and to create a variability catalog for community use. Mining the dataset in this way also allows for a quantitative identification of anomalous variability, and so may also be used as a signal-agnostic form of optical SETI. As the Kepler data are exceptionally rich, they provide an interesting counterpoint to machine learning efforts typically performed on sparser and/or noisier survey data, and will inform similar characterization carried out on future survey datasets.

  20. A Fast Reduced Kernel Extreme Learning Machine.

    PubMed

    Deng, Wan-Yu; Ong, Yew-Soon; Zheng, Qing-Hua

    2016-04-01

    In this paper, we present a fast and accurate kernel-based supervised algorithm referred to as the Reduced Kernel Extreme Learning Machine (RKELM). In contrast to the work on Support Vector Machine (SVM) or Least Square SVM (LS-SVM), which identifies the support vectors or weight vectors iteratively, the proposed RKELM randomly selects a subset of the available data samples as support vectors (or mapping samples). By avoiding the iterative steps of SVM, significant cost savings in the training process can be readily attained, especially on Big datasets. RKELM is established based on the rigorous proof of universal learning involving reduced kernel-based SLFN. In particular, we prove that RKELM can approximate any nonlinear functions accurately under the condition of support vectors sufficiency. Experimental results on a wide variety of real world small instance size and large instance size applications in the context of binary classification, multi-class problem and regression are then reported to show that RKELM can perform at competitive level of generalized performance as the SVM/LS-SVM at only a fraction of the computational effort incurred. PMID:26829605

  1. Predicting Methylphenidate Response in ADHD Using Machine Learning Approaches

    PubMed Central

    Kim, Jae-Won; Sharma, Vinod

    2015-01-01

    Background: There are no objective, biological markers that can robustly predict methylphenidate response in attention deficit hyperactivity disorder. This study aimed to examine whether applying machine learning approaches to pretreatment demographic, clinical questionnaire, environmental, neuropsychological, neuroimaging, and genetic information can predict therapeutic response following methylphenidate administration. Methods: The present study included 83 attention deficit hyperactivity disorder youth. At baseline, parents completed the ADHD Rating Scale-IV and Disruptive Behavior Disorder rating scale, and participants undertook the continuous performance test, Stroop color word test, and resting-state functional MRI scans. The dopamine transporter gene, dopamine D4 receptor gene, alpha-2A adrenergic receptor gene (ADRA2A) and norepinephrine transporter gene polymorphisms, and blood lead and urine cotinine levels were also measured. The participants were enrolled in an 8-week, open-label trial of methylphenidate. Four different machine learning algorithms were used for data analysis. Results: Support vector machine classification accuracy was 84.6% (area under receiver operating characteristic curve 0.84) for predicting methylphenidate response. The age, weight, ADRA2A MspI and DraI polymorphisms, lead level, Stroop color word test performance, and oppositional symptoms of Disruptive Behavior Disorder rating scale were identified as the most differentiating subset of features. Conclusions: Our results provide preliminary support to the translational development of support vector machine as an informative method that can assist in predicting treatment response in attention deficit hyperactivity disorder, though further work is required to provide enhanced levels of classification performance. PMID:25964505

  2. TEMPTING system: a hybrid method of rule and machine learning for temporal relation extraction in patient discharge summaries.

    PubMed

    Chang, Yung-Chun; Dai, Hong-Jie; Wu, Johnny Chi-Yang; Chen, Jian-Ming; Tsai, Richard Tzong-Han; Hsu, Wen-Lian

    2013-12-01

    Patient discharge summaries provide detailed medical information about individuals who have been hospitalized. To make a precise and legitimate assessment of the abundant data, a proper time layout of the sequence of relevant events should be compiled and used to drive a patient-specific timeline, which could further assist medical personnel in making clinical decisions. The process of identifying the chronological order of entities is called temporal relation extraction. In this paper, we propose a hybrid method to identify appropriate temporal links between a pair of entities. The method combines two approaches: one is rule-based and the other is based on the maximum entropy model. We develop an integration algorithm to fuse the results of the two approaches. All rules and the integration algorithm are formally stated so that one can easily reproduce the system and results. To optimize the system's configuration, we used the 2012 i2b2 challenge TLINK track dataset and applied threefold cross validation to the training set. Then, we evaluated its performance on the training and test datasets. The experiment results show that the proposed TEMPTING (TEMPoral relaTion extractING) system (ranked seventh) achieved an F-score of 0.563, which was at least 30% better than that of the baseline system, which randomly selects TLINK candidates from all pairs and assigns the TLINK types. The TEMPTING system using the hybrid method also outperformed the stage-based TEMPTING system. Its F-scores were 3.51% and 0.97% better than those of the stage-based system on the training set and test set, respectively. PMID:24060600

  3. A duct mapping method using least squares support vector machines

    NASA Astrophysics Data System (ADS)

    Douvenot, RéMi; Fabbro, Vincent; Gerstoft, Peter; Bourlier, Christophe; Saillard, Joseph

    2008-12-01

    This paper introduces a "refractivity from clutter" (RFC) approach with an inversion method based on a pregenerated database. The RFC method exploits the information contained in the radar sea clutter return to estimate the refractive index profile. Whereas initial efforts are based on algorithms giving a good accuracy involving high computational needs, the present method is based on a learning machine algorithm in order to obtain a real-time system. This paper shows the feasibility of a RFC technique based on the least squares support vector machine inversion method by comparing it to a genetic algorithm on simulated and noise-free data, at 1 and 5 GHz. These data are simulated in the presence of ideal trilinear surface-based ducts. The learning machine is based on a pregenerated database computed using Latin hypercube sampling to improve the efficiency of the learning. The results show that little accuracy is lost compared to a genetic algorithm approach. The computational time of a genetic algorithm is very high, whereas the learning machine approach is real time. The advantage of a real-time RFC system is that it could work on several azimuths in near real time.

  4. Online Sequential Extreme Learning Machine With Kernels.

    PubMed

    Scardapane, Simone; Comminiello, Danilo; Scarpiniti, Michele; Uncini, Aurelio

    2015-09-01

    The extreme learning machine (ELM) was recently proposed as a unifying framework for different families of learning algorithms. The classical ELM model consists of a linear combination of a fixed number of nonlinear expansions of the input vector. Learning in ELM is hence equivalent to finding the optimal weights that minimize the error on a dataset. The update works in batch mode, either with explicit feature mappings or with implicit mappings defined by kernels. Although an online version has been proposed for the former, no work has been done up to this point for the latter, and whether an efficient learning algorithm for online kernel-based ELM exists remains an open problem. By explicating some connections between nonlinear adaptive filtering and ELM theory, in this brief, we present an algorithm for this task. In particular, we propose a straightforward extension of the well-known kernel recursive least-squares, belonging to the kernel adaptive filtering (KAF) family, to the ELM framework. We call the resulting algorithm the kernel online sequential ELM (KOS-ELM). Moreover, we consider two different criteria used in the KAF field to obtain sparse filters and extend them to our context. We show that KOS-ELM, with their integration, can result in a highly efficient algorithm, both in terms of obtained generalization error and training time. Empirical evaluations demonstrate interesting results on some benchmarking datasets. PMID:25561597

  5. Introduction to machine learning: k-nearest neighbors.

    PubMed

    Zhang, Zhongheng

    2016-06-01

    Machine learning techniques have been widely used in many scientific fields, but its use in medical literature is limited partly because of technical difficulties. k-nearest neighbors (kNN) is a simple method of machine learning. The article introduces some basic ideas underlying the kNN algorithm, and then focuses on how to perform kNN modeling with R. The dataset should be prepared before running the knn() function in R. After prediction of outcome with kNN algorithm, the diagnostic performance of the model should be checked. Average accuracy is the mostly widely used statistic to reflect the kNN algorithm. Factors such as k value, distance calculation and choice of appropriate predictors all have significant impact on the model performance. PMID:27386492

  6. Introduction to machine learning: k-nearest neighbors

    PubMed Central

    2016-01-01

    Machine learning techniques have been widely used in many scientific fields, but its use in medical literature is limited partly because of technical difficulties. k-nearest neighbors (kNN) is a simple method of machine learning. The article introduces some basic ideas underlying the kNN algorithm, and then focuses on how to perform kNN modeling with R. The dataset should be prepared before running the knn() function in R. After prediction of outcome with kNN algorithm, the diagnostic performance of the model should be checked. Average accuracy is the mostly widely used statistic to reflect the kNN algorithm. Factors such as k value, distance calculation and choice of appropriate predictors all have significant impact on the model performance. PMID:27386492

  7. Machine Shop I. Learning Activity Packets (LAPs). Section D--Power Saws and Drilling Machines.

    ERIC Educational Resources Information Center

    Oklahoma State Board of Vocational and Technical Education, Stillwater. Curriculum and Instructional Materials Center.

    This document contains two learning activity packets (LAPs) for the "power saws and drilling machines" instructional area of a Machine Shop I course. The two LAPs cover the following topics: power saws and drill press. Each LAP contains a cover sheet that describes its purpose, an introduction, and the tasks included in the LAP; learning steps…

  8. Learning Activity Packets for Milling Machines. Unit I--Introduction to Milling Machines.

    ERIC Educational Resources Information Center

    Oklahoma State Board of Vocational and Technical Education, Stillwater. Curriculum and Instructional Materials Center.

    This learning activity packet (LAP) outlines the study activities and performance tasks covered in a related curriculum guide on milling machines. The course of study in this LAP is intended to help students learn to identify parts and attachments of vertical and horizontal milling machines, identify work-holding devices, state safety rules, and…

  9. Machine learning classification of SDSS transient survey images

    NASA Astrophysics Data System (ADS)

    du Buisson, L.; Sivanandam, N.; Bassett, Bruce A.; Smith, M.

    2015-12-01

    We show that multiple machine learning algorithms can match human performance in classifying transient imaging data from the Sloan Digital Sky Survey (SDSS) supernova survey into real objects and artefacts. This is a first step in any transient science pipeline and is currently still done by humans, but future surveys such as the Large Synoptic Survey Telescope (LSST) will necessitate fully machine-enabled solutions. Using features trained from eigenimage analysis (principal component analysis, PCA) of single-epoch g, r and i difference images, we can reach a completeness (recall) of 96 per cent, while only incorrectly classifying at most 18 per cent of artefacts as real objects, corresponding to a precision (purity) of 84 per cent. In general, random forests performed best, followed by the k-nearest neighbour and the SkyNet artificial neural net algorithms, compared to other methods such as naive Bayes and kernel support vector machine. Our results show that PCA-based machine learning can match human success levels and can naturally be extended by including multiple epochs of data, transient colours and host galaxy information which should allow for significant further improvements, especially at low signal-to-noise.

  10. Multivariate Mapping of Environmental Data Using Extreme Learning Machines

    NASA Astrophysics Data System (ADS)

    Leuenberger, Michael; Kanevski, Mikhail

    2014-05-01

    In most real cases environmental data are multivariate, highly variable at several spatio-temporal scales, and are generated by nonlinear and complex phenomena. Mapping - spatial predictions of such data, is a challenging problem. Machine learning algorithms, being universal nonlinear tools, have demonstrated their efficiency in modelling of environmental spatial and space-time data (Kanevski et al. 2009). Recently, a new approach in machine learning - Extreme Learning Machine (ELM), has gained a great popularity. ELM is a fast and powerful approach being a part of the machine learning algorithm category. Developed by G.-B. Huang et al. (2006), it follows the structure of a multilayer perceptron (MLP) with one single-hidden layer feedforward neural networks (SLFNs). The learning step of classical artificial neural networks, like MLP, deals with the optimization of weights and biases by using gradient-based learning algorithm (e.g. back-propagation algorithm). Opposed to this optimization phase, which can fall into local minima, ELM generates randomly the weights between the input layer and the hidden layer and also the biases in the hidden layer. By this initialization, it optimizes just the weight vector between the hidden layer and the output layer in a single way. The main advantage of this algorithm is the speed of the learning step. In a theoretical context and by growing the number of hidden nodes, the algorithm can learn any set of training data with zero error. To avoid overfitting, cross-validation method or "true validation" (by randomly splitting data into training, validation and testing subsets) are recommended in order to find an optimal number of neurons. With its universal property and solid theoretical basis, ELM is a good machine learning algorithm which can push the field forward. The present research deals with an extension of ELM to multivariate output modelling and application of ELM to the real data case study - pollution of the sediments in