Sample records for unsupervised cluster analysis

  1. Manifold Learning in MR spectroscopy using nonlinear dimensionality reduction and unsupervised clustering.

    PubMed

    Yang, Guang; Raschke, Felix; Barrick, Thomas R; Howe, Franklyn A

    2015-09-01

    To investigate whether nonlinear dimensionality reduction improves unsupervised classification of (1) H MRS brain tumor data compared with a linear method. In vivo single-voxel (1) H magnetic resonance spectroscopy (55 patients) and (1) H magnetic resonance spectroscopy imaging (MRSI) (29 patients) data were acquired from histopathologically diagnosed gliomas. Data reduction using Laplacian eigenmaps (LE) or independent component analysis (ICA) was followed by k-means clustering or agglomerative hierarchical clustering (AHC) for unsupervised learning to assess tumor grade and for tissue type segmentation of MRSI data. An accuracy of 93% in classification of glioma grade II and grade IV, with 100% accuracy in distinguishing tumor and normal spectra, was obtained by LE with unsupervised clustering, but not with the combination of k-means and ICA. With (1) H MRSI data, LE provided a more linear distribution of data for cluster analysis and better cluster stability than ICA. LE combined with k-means or AHC provided 91% accuracy for classifying tumor grade and 100% accuracy for identifying normal tissue voxels. Color-coded visualization of normal brain, tumor core, and infiltration regions was achieved with LE combined with AHC. The LE method is promising for unsupervised clustering to separate brain and tumor tissue with automated color-coding for visualization of (1) H MRSI data after cluster analysis. © 2014 Wiley Periodicals, Inc.

  2. An unsupervised classification technique for multispectral remote sensing data.

    NASA Technical Reports Server (NTRS)

    Su, M. Y.; Cummings, R. E.

    1973-01-01

    Description of a two-part clustering technique consisting of (a) a sequential statistical clustering, which is essentially a sequential variance analysis, and (b) a generalized K-means clustering. In this composite clustering technique, the output of (a) is a set of initial clusters which are input to (b) for further improvement by an iterative scheme. This unsupervised composite technique was employed for automatic classification of two sets of remote multispectral earth resource observations. The classification accuracy by the unsupervised technique is found to be comparable to that by traditional supervised maximum-likelihood classification techniques.

  3. Unsupervised classification of earth resources data.

    NASA Technical Reports Server (NTRS)

    Su, M. Y.; Jayroe, R. R., Jr.; Cummings, R. E.

    1972-01-01

    A new clustering technique is presented. It consists of two parts: (a) a sequential statistical clustering which is essentially a sequential variance analysis and (b) a generalized K-means clustering. In this composite clustering technique, the output of (a) is a set of initial clusters which are input to (b) for further improvement by an iterative scheme. This unsupervised composite technique was employed for automatic classification of two sets of remote multispectral earth resource observations. The classification accuracy by the unsupervised technique is found to be comparable to that by existing supervised maximum liklihood classification technique.

  4. Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm

    PubMed Central

    Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong

    2016-01-01

    In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis. PMID:27959895

  5. Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm.

    PubMed

    Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong

    2016-01-01

    In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis.

  6. The composite sequential clustering technique for analysis of multispectral scanner data

    NASA Technical Reports Server (NTRS)

    Su, M. Y.

    1972-01-01

    The clustering technique consists of two parts: (1) a sequential statistical clustering which is essentially a sequential variance analysis, and (2) a generalized K-means clustering. In this composite clustering technique, the output of (1) is a set of initial clusters which are input to (2) for further improvement by an iterative scheme. This unsupervised composite technique was employed for automatic classification of two sets of remote multispectral earth resource observations. The classification accuracy by the unsupervised technique is found to be comparable to that by traditional supervised maximum likelihood classification techniques. The mathematical algorithms for the composite sequential clustering program and a detailed computer program description with job setup are given.

  7. An unsupervised classification approach for analysis of Landsat data to monitor land reclamation in Belmont county, Ohio

    NASA Technical Reports Server (NTRS)

    Brumfield, J. O.; Bloemer, H. H. L.; Campbell, W. J.

    1981-01-01

    Two unsupervised classification procedures for analyzing Landsat data used to monitor land reclamation in a surface mining area in east central Ohio are compared for agreement with data collected from the corresponding locations on the ground. One procedure is based on a traditional unsupervised-clustering/maximum-likelihood algorithm sequence that assumes spectral groupings in the Landsat data in n-dimensional space; the other is based on a nontraditional unsupervised-clustering/canonical-transformation/clustering algorithm sequence that not only assumes spectral groupings in n-dimensional space but also includes an additional feature-extraction technique. It is found that the nontraditional procedure provides an appreciable improvement in spectral groupings and apparently increases the level of accuracy in the classification of land cover categories.

  8. Identifying influential individuals on intensive care units: using cluster analysis to explore culture.

    PubMed

    Fong, Allan; Clark, Lindsey; Cheng, Tianyi; Franklin, Ella; Fernandez, Nicole; Ratwani, Raj; Parker, Sarah Henrickson

    2017-07-01

    The objective of this paper is to identify attribute patterns of influential individuals in intensive care units using unsupervised cluster analysis. Despite the acknowledgement that culture of an organisation is critical to improving patient safety, specific methods to shift culture have not been explicitly identified. A social network analysis survey was conducted and an unsupervised cluster analysis was used. A total of 100 surveys were gathered. Unsupervised cluster analysis was used to group individuals with similar dimensions highlighting three general genres of influencers: well-rounded, knowledge and relational. Culture is created locally by individual influencers. Cluster analysis is an effective way to identify common characteristics among members of an intensive care unit team that are noted as highly influential by their peers. To change culture, identifying and then integrating the influencers in intervention development and dissemination may create more sustainable and effective culture change. Additional studies are ongoing to test the effectiveness of utilising these influencers to disseminate patient safety interventions. This study offers an approach that can be helpful in both identifying and understanding influential team members and may be an important aspect of developing methods to change organisational culture. © 2017 John Wiley & Sons Ltd.

  9. Unsupervised classification of remote multispectral sensing data

    NASA Technical Reports Server (NTRS)

    Su, M. Y.

    1972-01-01

    The new unsupervised classification technique for classifying multispectral remote sensing data which can be either from the multispectral scanner or digitized color-separation aerial photographs consists of two parts: (a) a sequential statistical clustering which is a one-pass sequential variance analysis and (b) a generalized K-means clustering. In this composite clustering technique, the output of (a) is a set of initial clusters which are input to (b) for further improvement by an iterative scheme. Applications of the technique using an IBM-7094 computer on multispectral data sets over Purdue's Flight Line C-1 and the Yellowstone National Park test site have been accomplished. Comparisons between the classification maps by the unsupervised technique and the supervised maximum liklihood technique indicate that the classification accuracies are in agreement.

  10. Down-Regulation of Olfactory Receptors in Response to Traumatic Brain Injury Promotes Risk for Alzheimers Disease

    DTIC Science & Technology

    2015-12-01

    group assignment of samples in unsupervised hierarchical clustering by the Unweighted Pair-Group Method using Arithmetic averages ( UPGMA ) based on...log2 transformed MAS5.0 signal values; probe set clustering was performed by the UPGMA method using Cosine correlation as the similarity met- ric. For...differentially-regulated genes identified were subjected to unsupervised hierarchical clustering analysis using the UPGMA algorithm with cosine correlation as

  11. Novel Histogram Based Unsupervised Classification Technique to Determine Natural Classes From Biophysically Relevant Fit Parameters to Hyperspectral Data

    DOE PAGES

    McCann, Cooper; Repasky, Kevin S.; Morin, Mikindra; ...

    2017-05-23

    Hyperspectral image analysis has benefited from an array of methods that take advantage of the increased spectral depth compared to multispectral sensors; however, the focus of these developments has been on supervised classification methods. Lack of a priori knowledge regarding land cover characteristics can make unsupervised classification methods preferable under certain circumstances. An unsupervised classification technique is presented in this paper that utilizes physically relevant basis functions to model the reflectance spectra. These fit parameters used to generate the basis functions allow clustering based on spectral characteristics rather than spectral channels and provide both noise and data reduction. Histogram splittingmore » of the fit parameters is then used as a means of producing an unsupervised classification. Unlike current unsupervised classification techniques that rely primarily on Euclidian distance measures to determine similarity, the unsupervised classification technique uses the natural splitting of the fit parameters associated with the basis functions creating clusters that are similar in terms of physical parameters. The data set used in this work utilizes the publicly available data collected at Indian Pines, Indiana. This data set provides reference data allowing for comparisons of the efficacy of different unsupervised data analysis. The unsupervised histogram splitting technique presented in this paper is shown to be better than the standard unsupervised ISODATA clustering technique with an overall accuracy of 34.3/19.0% before merging and 40.9/39.2% after merging. Finally, this improvement is also seen as an improvement of kappa before/after merging of 24.8/30.5 for the histogram splitting technique compared to 15.8/28.5 for ISODATA.« less

  12. Novel Histogram Based Unsupervised Classification Technique to Determine Natural Classes From Biophysically Relevant Fit Parameters to Hyperspectral Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    McCann, Cooper; Repasky, Kevin S.; Morin, Mikindra

    Hyperspectral image analysis has benefited from an array of methods that take advantage of the increased spectral depth compared to multispectral sensors; however, the focus of these developments has been on supervised classification methods. Lack of a priori knowledge regarding land cover characteristics can make unsupervised classification methods preferable under certain circumstances. An unsupervised classification technique is presented in this paper that utilizes physically relevant basis functions to model the reflectance spectra. These fit parameters used to generate the basis functions allow clustering based on spectral characteristics rather than spectral channels and provide both noise and data reduction. Histogram splittingmore » of the fit parameters is then used as a means of producing an unsupervised classification. Unlike current unsupervised classification techniques that rely primarily on Euclidian distance measures to determine similarity, the unsupervised classification technique uses the natural splitting of the fit parameters associated with the basis functions creating clusters that are similar in terms of physical parameters. The data set used in this work utilizes the publicly available data collected at Indian Pines, Indiana. This data set provides reference data allowing for comparisons of the efficacy of different unsupervised data analysis. The unsupervised histogram splitting technique presented in this paper is shown to be better than the standard unsupervised ISODATA clustering technique with an overall accuracy of 34.3/19.0% before merging and 40.9/39.2% after merging. Finally, this improvement is also seen as an improvement of kappa before/after merging of 24.8/30.5 for the histogram splitting technique compared to 15.8/28.5 for ISODATA.« less

  13. Evaluating Mixture Modeling for Clustering: Recommendations and Cautions

    ERIC Educational Resources Information Center

    Steinley, Douglas; Brusco, Michael J.

    2011-01-01

    This article provides a large-scale investigation into several of the properties of mixture-model clustering techniques (also referred to as latent class cluster analysis, latent profile analysis, model-based clustering, probabilistic clustering, Bayesian classification, unsupervised learning, and finite mixture models; see Vermunt & Magdison,…

  14. Down-Regulation of Olfactory Receptors in Response to Traumatic Brain Injury Promotes Risk for Alzheimer’s Disease

    DTIC Science & Technology

    2013-10-01

    correct group assignment of samples in unsupervised hierarchical clustering by the Unweighted Pair-Group Method using Arithmetic averages ( UPGMA ) based on...centering of log2 transformed MAS5.0 signal values; probe set clustering was performed by the UPGMA method using Cosine correlation as the similarity met...A) The 108 differentially-regulated genes identified were subjected to unsupervised hierarchical clustering analysis using the UPGMA algorithm with

  15. Discrete Wavelet Transform-Based Whole-Spectral and Subspectral Analysis for Improved Brain Tumor Clustering Using Single Voxel MR Spectroscopy.

    PubMed

    Yang, Guang; Nawaz, Tahir; Barrick, Thomas R; Howe, Franklyn A; Slabaugh, Greg

    2015-12-01

    Many approaches have been considered for automatic grading of brain tumors by means of pattern recognition with magnetic resonance spectroscopy (MRS). Providing an improved technique which can assist clinicians in accurately identifying brain tumor grades is our main objective. The proposed technique, which is based on the discrete wavelet transform (DWT) of whole-spectral or subspectral information of key metabolites, combined with unsupervised learning, inspects the separability of the extracted wavelet features from the MRS signal to aid the clustering. In total, we included 134 short echo time single voxel MRS spectra (SV MRS) in our study that cover normal controls, low grade and high grade tumors. The combination of DWT-based whole-spectral or subspectral analysis and unsupervised clustering achieved an overall clustering accuracy of 94.8% and a balanced error rate of 7.8%. To the best of our knowledge, it is the first study using DWT combined with unsupervised learning to cluster brain SV MRS. Instead of dimensionality reduction on SV MRS or feature selection using model fitting, our study provides an alternative method of extracting features to obtain promising clustering results.

  16. Unsupervised analysis of small animal dynamic Cerenkov luminescence imaging

    NASA Astrophysics Data System (ADS)

    Spinelli, Antonello E.; Boschi, Federico

    2011-12-01

    Clustering analysis (CA) and principal component analysis (PCA) were applied to dynamic Cerenkov luminescence images (dCLI). In order to investigate the performances of the proposed approaches, two distinct dynamic data sets obtained by injecting mice with 32P-ATP and 18F-FDG were acquired using the IVIS 200 optical imager. The k-means clustering algorithm has been applied to dCLI and was implemented using interactive data language 8.1. We show that cluster analysis allows us to obtain good agreement between the clustered and the corresponding emission regions like the bladder, the liver, and the tumor. We also show a good correspondence between the time activity curves of the different regions obtained by using CA and manual region of interest analysis on dCLIT and PCA images. We conclude that CA provides an automatic unsupervised method for the analysis of preclinical dynamic Cerenkov luminescence image data.

  17. Comparisons of non-Gaussian statistical models in DNA methylation analysis.

    PubMed

    Ma, Zhanyu; Teschendorff, Andrew E; Yu, Hong; Taghia, Jalil; Guo, Jun

    2014-06-16

    As a key regulatory mechanism of gene expression, DNA methylation patterns are widely altered in many complex genetic diseases, including cancer. DNA methylation is naturally quantified by bounded support data; therefore, it is non-Gaussian distributed. In order to capture such properties, we introduce some non-Gaussian statistical models to perform dimension reduction on DNA methylation data. Afterwards, non-Gaussian statistical model-based unsupervised clustering strategies are applied to cluster the data. Comparisons and analysis of different dimension reduction strategies and unsupervised clustering methods are presented. Experimental results show that the non-Gaussian statistical model-based methods are superior to the conventional Gaussian distribution-based method. They are meaningful tools for DNA methylation analysis. Moreover, among several non-Gaussian methods, the one that captures the bounded nature of DNA methylation data reveals the best clustering performance.

  18. Comparisons of Non-Gaussian Statistical Models in DNA Methylation Analysis

    PubMed Central

    Ma, Zhanyu; Teschendorff, Andrew E.; Yu, Hong; Taghia, Jalil; Guo, Jun

    2014-01-01

    As a key regulatory mechanism of gene expression, DNA methylation patterns are widely altered in many complex genetic diseases, including cancer. DNA methylation is naturally quantified by bounded support data; therefore, it is non-Gaussian distributed. In order to capture such properties, we introduce some non-Gaussian statistical models to perform dimension reduction on DNA methylation data. Afterwards, non-Gaussian statistical model-based unsupervised clustering strategies are applied to cluster the data. Comparisons and analysis of different dimension reduction strategies and unsupervised clustering methods are presented. Experimental results show that the non-Gaussian statistical model-based methods are superior to the conventional Gaussian distribution-based method. They are meaningful tools for DNA methylation analysis. Moreover, among several non-Gaussian methods, the one that captures the bounded nature of DNA methylation data reveals the best clustering performance. PMID:24937687

  19. Cluster analysis of sputum cytokine-high profiles reveals diversity in T(h)2-high asthma patients.

    PubMed

    Seys, Sven F; Scheers, Hans; Van den Brande, Paul; Marijsse, Gudrun; Dilissen, Ellen; Van Den Bergh, Annelies; Goeminne, Pieter C; Hellings, Peter W; Ceuppens, Jan L; Dupont, Lieven J; Bullens, Dominique M A

    2017-02-23

    Asthma is characterized by a heterogeneous inflammatory profile and can be subdivided into T(h)2-high and T(h)2-low airway inflammation. Profiling of a broader panel of airway cytokines in large unselected patient cohorts is lacking. Patients (n = 205) were defined as being "cytokine-low/high" if sputum mRNA expression of a particular cytokine was outside the respective 10 th /90 th percentile range of the control group (n = 80). Unsupervised hierarchical clustering was used to determine clusters based on sputum cytokine profiles. Half of patients (n = 108; 52.6%) had a classical T(h)2-high ("IL-4-, IL-5- and/or IL-13-high") sputum cytokine profile. Unsupervised cluster analysis revealed 5 clusters. Patients with an "IL-4- and/or IL-13-high" pattern surprisingly did not cluster but were equally distributed among the 5 clusters. Patients with an "IL-5-, IL-17A-/F- and IL-25- high" profile were restricted to cluster 1 (n = 24) with increased sputum eosinophil as well as neutrophil counts and poor lung function parameters at baseline and 2 years later. Four other clusters were identified: "IL-5-high or IL-10-high" (n = 16), "IL-6-high" (n = 8), "IL-22-high" (n = 25). Cluster 5 (n = 132) consists of patients without "cytokine-high" pattern or patients with only high IL-4 and/or IL-13. We identified 5 unique asthma molecular phenotypes by biological clustering. Type 2 cytokines cluster with non-type 2 cytokines in 4 out of 5 clusters. Unsupervised analysis thus not supports a priori type 2 versus non-type 2 molecular phenotypes. www.clinicaltrials.gov NCT01224938. Registered 18 October 2010.

  20. Using preoperative unsupervised cluster analysis of chronic rhinosinusitis to inform patient decision and endoscopic sinus surgery outcome.

    PubMed

    Adnane, Choaib; Adouly, Taoufik; Khallouk, Amine; Rouadi, Sami; Abada, Redallah; Roubal, Mohamed; Mahtar, Mohamed

    2017-02-01

    The purpose of this study is to use unsupervised cluster methodology to identify phenotype and mucosal eosinophilia endotype subgroups of patients with medical refractory chronic rhinosinusitis (CRS), and evaluate the difference in quality of life (QOL) outcomes after endoscopic sinus surgery (ESS) between these clusters for better surgical case selection. A prospective cohort study included 131 patients with medical refractory CRS who elected ESS. The Sino-Nasal Outcome Test (SNOT-22) was used to evaluate QOL before and 12 months after surgery. Unsupervised two-step clustering method was performed. One hundred and thirteen subjects were retained in this study: 46 patients with CRS without nasal polyps and 67 patients with nasal polyps. Nasal polyps, gender, mucosal eosinophilia profile, and prior sinus surgery were the most discriminating factors in the generated clusters. Three clusters were identified. A significant clinical improvement was observed in all clusters 12 months after surgery with a reduction of SNOT-22 scores. There was a significant difference in QOL outcomes between clusters; cluster 1 had the worst QOL improvement after FESS in comparison with the other clusters 2 and 3. All patients in cluster 1 presented CRSwNP with the highest mucosal eosinophilia endotype. Clustering method is able to classify CRS phenotypes and endotypes with different associated surgical outcomes.

  1. Penalized unsupervised learning with outliers

    PubMed Central

    Witten, Daniela M.

    2013-01-01

    We consider the problem of performing unsupervised learning in the presence of outliers – that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, K-means clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outliers. In this paper, we take a new approach to extending existing unsupervised learning techniques to accommodate outliers. Our approach is an extension of a recent proposal for outlier detection in the regression setting. We allow each observation to take on an “error” term, and we penalize the errors using a group lasso penalty in order to encourage most of the observations’ errors to exactly equal zero. We show that this approach can be used in order to develop extensions of K-means clustering and principal components analysis that result in accurate outlier detection, as well as improved performance in the presence of outliers. These methods are illustrated in a simulation study and on two gene expression data sets, and connections with M-estimation are explored. PMID:23875057

  2. Noise-robust unsupervised spike sorting based on discriminative subspace learning with outlier handling.

    PubMed

    Keshtkaran, Mohammad Reza; Yang, Zhi

    2017-06-01

    Spike sorting is a fundamental preprocessing step for many neuroscience studies which rely on the analysis of spike trains. Most of the feature extraction and dimensionality reduction techniques that have been used for spike sorting give a projection subspace which is not necessarily the most discriminative one. Therefore, the clusters which appear inherently separable in some discriminative subspace may overlap if projected using conventional feature extraction approaches leading to a poor sorting accuracy especially when the noise level is high. In this paper, we propose a noise-robust and unsupervised spike sorting algorithm based on learning discriminative spike features for clustering. The proposed algorithm uses discriminative subspace learning to extract low dimensional and most discriminative features from the spike waveforms and perform clustering with automatic detection of the number of the clusters. The core part of the algorithm involves iterative subspace selection using linear discriminant analysis and clustering using Gaussian mixture model with outlier detection. A statistical test in the discriminative subspace is proposed to automatically detect the number of the clusters. Comparative results on publicly available simulated and real in vivo datasets demonstrate that our algorithm achieves substantially improved cluster distinction leading to higher sorting accuracy and more reliable detection of clusters which are highly overlapping and not detectable using conventional feature extraction techniques such as principal component analysis or wavelets. By providing more accurate information about the activity of more number of individual neurons with high robustness to neural noise and outliers, the proposed unsupervised spike sorting algorithm facilitates more detailed and accurate analysis of single- and multi-unit activities in neuroscience and brain machine interface studies.

  3. Noise-robust unsupervised spike sorting based on discriminative subspace learning with outlier handling

    NASA Astrophysics Data System (ADS)

    Keshtkaran, Mohammad Reza; Yang, Zhi

    2017-06-01

    Objective. Spike sorting is a fundamental preprocessing step for many neuroscience studies which rely on the analysis of spike trains. Most of the feature extraction and dimensionality reduction techniques that have been used for spike sorting give a projection subspace which is not necessarily the most discriminative one. Therefore, the clusters which appear inherently separable in some discriminative subspace may overlap if projected using conventional feature extraction approaches leading to a poor sorting accuracy especially when the noise level is high. In this paper, we propose a noise-robust and unsupervised spike sorting algorithm based on learning discriminative spike features for clustering. Approach. The proposed algorithm uses discriminative subspace learning to extract low dimensional and most discriminative features from the spike waveforms and perform clustering with automatic detection of the number of the clusters. The core part of the algorithm involves iterative subspace selection using linear discriminant analysis and clustering using Gaussian mixture model with outlier detection. A statistical test in the discriminative subspace is proposed to automatically detect the number of the clusters. Main results. Comparative results on publicly available simulated and real in vivo datasets demonstrate that our algorithm achieves substantially improved cluster distinction leading to higher sorting accuracy and more reliable detection of clusters which are highly overlapping and not detectable using conventional feature extraction techniques such as principal component analysis or wavelets. Significance. By providing more accurate information about the activity of more number of individual neurons with high robustness to neural noise and outliers, the proposed unsupervised spike sorting algorithm facilitates more detailed and accurate analysis of single- and multi-unit activities in neuroscience and brain machine interface studies.

  4. Semi-supervised clustering for parcellating brain regions based on resting state fMRI data

    NASA Astrophysics Data System (ADS)

    Cheng, Hewei; Fan, Yong

    2014-03-01

    Many unsupervised clustering techniques have been adopted for parcellating brain regions of interest into functionally homogeneous subregions based on resting state fMRI data. However, the unsupervised clustering techniques are not able to take advantage of exiting knowledge of the functional neuroanatomy readily available from studies of cytoarchitectonic parcellation or meta-analysis of the literature. In this study, we propose a semi-supervised clustering method for parcellating amygdala into functionally homogeneous subregions based on resting state fMRI data. Particularly, the semi-supervised clustering is implemented under the framework of graph partitioning, and adopts prior information and spatial consistent constraints to obtain a spatially contiguous parcellation result. The graph partitioning problem is solved using an efficient algorithm similar to the well-known weighted kernel k-means algorithm. Our method has been validated for parcellating amygdala into 3 subregions based on resting state fMRI data of 28 subjects. The experiment results have demonstrated that the proposed method is more robust than unsupervised clustering and able to parcellate amygdala into centromedial, laterobasal, and superficial parts with improved functionally homogeneity compared with the cytoarchitectonic parcellation result. The validity of the parcellation results is also supported by distinctive functional and structural connectivity patterns of the subregions and high consistency between coactivation patterns derived from a meta-analysis and functional connectivity patterns of corresponding subregions.

  5. Unsupervised feature relevance analysis applied to improve ECG heartbeat clustering.

    PubMed

    Rodríguez-Sotelo, J L; Peluffo-Ordoñez, D; Cuesta-Frau, D; Castellanos-Domínguez, G

    2012-10-01

    The computer-assisted analysis of biomedical records has become an essential tool in clinical settings. However, current devices provide a growing amount of data that often exceeds the processing capacity of normal computers. As this amount of information rises, new demands for more efficient data extracting methods appear. This paper addresses the task of data mining in physiological records using a feature selection scheme. An unsupervised method based on relevance analysis is described. This scheme uses a least-squares optimization of the input feature matrix in a single iteration. The output of the algorithm is a feature weighting vector. The performance of the method was assessed using a heartbeat clustering test on real ECG records. The quantitative cluster validity measures yielded a correctly classified heartbeat rate of 98.69% (specificity), 85.88% (sensitivity) and 95.04% (general clustering performance), which is even higher than the performance achieved by other similar ECG clustering studies. The number of features was reduced on average from 100 to 18, and the temporal cost was a 43% lower than in previous ECG clustering schemes. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  6. Semi-Supervised Clustering for High-Dimensional and Sparse Features

    ERIC Educational Resources Information Center

    Yan, Su

    2010-01-01

    Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side…

  7. Performance analysis of unsupervised optimal fuzzy clustering algorithm for MRI brain tumor segmentation.

    PubMed

    Blessy, S A Praylin Selva; Sulochana, C Helen

    2015-01-01

    Segmentation of brain tumor from Magnetic Resonance Imaging (MRI) becomes very complicated due to the structural complexities of human brain and the presence of intensity inhomogeneities. To propose a method that effectively segments brain tumor from MR images and to evaluate the performance of unsupervised optimal fuzzy clustering (UOFC) algorithm for segmentation of brain tumor from MR images. Segmentation is done by preprocessing the MR image to standardize intensity inhomogeneities followed by feature extraction, feature fusion and clustering. Different validation measures are used to evaluate the performance of the proposed method using different clustering algorithms. The proposed method using UOFC algorithm produces high sensitivity (96%) and low specificity (4%) compared to other clustering methods. Validation results clearly show that the proposed method with UOFC algorithm effectively segments brain tumor from MR images.

  8. A novel unsupervised spike sorting algorithm for intracranial EEG.

    PubMed

    Yadav, R; Shah, A K; Loeb, J A; Swamy, M N S; Agarwal, R

    2011-01-01

    This paper presents a novel, unsupervised spike classification algorithm for intracranial EEG. The method combines template matching and principal component analysis (PCA) for building a dynamic patient-specific codebook without a priori knowledge of the spike waveforms. The problem of misclassification due to overlapping classes is resolved by identifying similar classes in the codebook using hierarchical clustering. Cluster quality is visually assessed by projecting inter- and intra-clusters onto a 3D plot. Intracranial EEG from 5 patients was utilized to optimize the algorithm. The resulting codebook retains 82.1% of the detected spikes in non-overlapping and disjoint clusters. Initial results suggest a definite role of this method for both rapid review and quantitation of interictal spikes that could enhance both clinical treatment and research studies on epileptic patients.

  9. Class imbalance in unsupervised change detection - A diagnostic analysis from urban remote sensing

    NASA Astrophysics Data System (ADS)

    Leichtle, Tobias; Geiß, Christian; Lakes, Tobia; Taubenböck, Hannes

    2017-08-01

    Automatic monitoring of changes on the Earth's surface is an intrinsic capability and simultaneously a persistent methodological challenge in remote sensing, especially regarding imagery with very-high spatial resolution (VHR) and complex urban environments. In order to enable a high level of automatization, the change detection problem is solved in an unsupervised way to alleviate efforts associated with collection of properly encoded prior knowledge. In this context, this paper systematically investigates the nature and effects of class distribution and class imbalance in an unsupervised binary change detection application based on VHR imagery over urban areas. For this purpose, a diagnostic framework for sensitivity analysis of a large range of possible degrees of class imbalance is presented, which is of particular importance with respect to unsupervised approaches where the content of images and thus the occurrence and the distribution of classes are generally unknown a priori. Furthermore, this framework can serve as a general technique to evaluate model transferability in any two-class classification problem. The applied change detection approach is based on object-based difference features calculated from VHR imagery and subsequent unsupervised two-class clustering using k-means, genetic k-means and self-organizing map (SOM) clustering. The results from two test sites with different structural characteristics of the built environment demonstrated that classification performance is generally worse in imbalanced class distribution settings while best results were reached in balanced or close to balanced situations. Regarding suitable accuracy measures for evaluating model performance in imbalanced settings, this study revealed that the Kappa statistics show significant response to class distribution while the true skill statistic was widely insensitive to imbalanced classes. In general, the genetic k-means clustering algorithm achieved the most robust results with respect to class imbalance while the SOM clustering exhibited a distinct optimization towards a balanced distribution of classes.

  10. An Efficient Optimization Method for Solving Unsupervised Data Classification Problems.

    PubMed

    Shabanzadeh, Parvaneh; Yusof, Rubiyah

    2015-01-01

    Unsupervised data classification (or clustering) analysis is one of the most useful tools and a descriptive task in data mining that seeks to classify homogeneous groups of objects based on similarity and is used in many medical disciplines and various applications. In general, there is no single algorithm that is suitable for all types of data, conditions, and applications. Each algorithm has its own advantages, limitations, and deficiencies. Hence, research for novel and effective approaches for unsupervised data classification is still active. In this paper a heuristic algorithm, Biogeography-Based Optimization (BBO) algorithm, was adapted for data clustering problems by modifying the main operators of BBO algorithm, which is inspired from the natural biogeography distribution of different species. Similar to other population-based algorithms, BBO algorithm starts with an initial population of candidate solutions to an optimization problem and an objective function that is calculated for them. To evaluate the performance of the proposed algorithm assessment was carried on six medical and real life datasets and was compared with eight well known and recent unsupervised data classification algorithms. Numerical results demonstrate that the proposed evolutionary optimization algorithm is efficient for unsupervised data classification.

  11. Using Machine Learning Techniques in the Analysis of Oceanographic Data

    NASA Astrophysics Data System (ADS)

    Falcinelli, K. E.; Abuomar, S.

    2017-12-01

    Acoustic Doppler Current Profilers (ADCPs) are oceanographic tools capable of collecting large amounts of current profile data. Using unsupervised machine learning techniques such as principal component analysis, fuzzy c-means clustering, and self-organizing maps, patterns and trends in an ADCP dataset are found. Cluster validity algorithms such as visual assessment of cluster tendency and clustering index are used to determine the optimal number of clusters in the ADCP dataset. These techniques prove to be useful in analysis of ADCP data and demonstrate potential for future use in other oceanographic applications.

  12. Visualization and unsupervised predictive clustering of high-dimensional multimodal neuroimaging data.

    PubMed

    Mwangi, Benson; Soares, Jair C; Hasan, Khader M

    2014-10-30

    Neuroimaging machine learning studies have largely utilized supervised algorithms - meaning they require both neuroimaging scan data and corresponding target variables (e.g. healthy vs. diseased) to be successfully 'trained' for a prediction task. Noticeably, this approach may not be optimal or possible when the global structure of the data is not well known and the researcher does not have an a priori model to fit the data. We set out to investigate the utility of an unsupervised machine learning technique; t-distributed stochastic neighbour embedding (t-SNE) in identifying 'unseen' sample population patterns that may exist in high-dimensional neuroimaging data. Multimodal neuroimaging scans from 92 healthy subjects were pre-processed using atlas-based methods, integrated and input into the t-SNE algorithm. Patterns and clusters discovered by the algorithm were visualized using a 2D scatter plot and further analyzed using the K-means clustering algorithm. t-SNE was evaluated against classical principal component analysis. Remarkably, based on unlabelled multimodal scan data, t-SNE separated study subjects into two very distinct clusters which corresponded to subjects' gender labels (cluster silhouette index value=0.79). The resulting clusters were used to develop an unsupervised minimum distance clustering model which identified 93.5% of subjects' gender. Notably, from a neuropsychiatric perspective this method may allow discovery of data-driven disease phenotypes or sub-types of treatment responders. Copyright © 2014 Elsevier B.V. All rights reserved.

  13. Massively parallel unsupervised single-particle cryo-EM data clustering via statistical manifold learning

    PubMed Central

    Wu, Jiayi; Ma, Yong-Bei; Congdon, Charles; Brett, Bevin; Chen, Shuobing; Xu, Yaofang; Ouyang, Qi

    2017-01-01

    Structural heterogeneity in single-particle cryo-electron microscopy (cryo-EM) data represents a major challenge for high-resolution structure determination. Unsupervised classification may serve as the first step in the assessment of structural heterogeneity. However, traditional algorithms for unsupervised classification, such as K-means clustering and maximum likelihood optimization, may classify images into wrong classes with decreasing signal-to-noise-ratio (SNR) in the image data, yet demand increased computational costs. Overcoming these limitations requires further development of clustering algorithms for high-performance cryo-EM data processing. Here we introduce an unsupervised single-particle clustering algorithm derived from a statistical manifold learning framework called generative topographic mapping (GTM). We show that unsupervised GTM clustering improves classification accuracy by about 40% in the absence of input references for data with lower SNRs. Applications to several experimental datasets suggest that our algorithm can detect subtle structural differences among classes via a hierarchical clustering strategy. After code optimization over a high-performance computing (HPC) environment, our software implementation was able to generate thousands of reference-free class averages within hours in a massively parallel fashion, which allows a significant improvement on ab initio 3D reconstruction and assists in the computational purification of homogeneous datasets for high-resolution visualization. PMID:28786986

  14. Massively parallel unsupervised single-particle cryo-EM data clustering via statistical manifold learning.

    PubMed

    Wu, Jiayi; Ma, Yong-Bei; Congdon, Charles; Brett, Bevin; Chen, Shuobing; Xu, Yaofang; Ouyang, Qi; Mao, Youdong

    2017-01-01

    Structural heterogeneity in single-particle cryo-electron microscopy (cryo-EM) data represents a major challenge for high-resolution structure determination. Unsupervised classification may serve as the first step in the assessment of structural heterogeneity. However, traditional algorithms for unsupervised classification, such as K-means clustering and maximum likelihood optimization, may classify images into wrong classes with decreasing signal-to-noise-ratio (SNR) in the image data, yet demand increased computational costs. Overcoming these limitations requires further development of clustering algorithms for high-performance cryo-EM data processing. Here we introduce an unsupervised single-particle clustering algorithm derived from a statistical manifold learning framework called generative topographic mapping (GTM). We show that unsupervised GTM clustering improves classification accuracy by about 40% in the absence of input references for data with lower SNRs. Applications to several experimental datasets suggest that our algorithm can detect subtle structural differences among classes via a hierarchical clustering strategy. After code optimization over a high-performance computing (HPC) environment, our software implementation was able to generate thousands of reference-free class averages within hours in a massively parallel fashion, which allows a significant improvement on ab initio 3D reconstruction and assists in the computational purification of homogeneous datasets for high-resolution visualization.

  15. Unsupervised spike sorting based on discriminative subspace learning.

    PubMed

    Keshtkaran, Mohammad Reza; Yang, Zhi

    2014-01-01

    Spike sorting is a fundamental preprocessing step for many neuroscience studies which rely on the analysis of spike trains. In this paper, we present two unsupervised spike sorting algorithms based on discriminative subspace learning. The first algorithm simultaneously learns the discriminative feature subspace and performs clustering. It uses histogram of features in the most discriminative projection to detect the number of neurons. The second algorithm performs hierarchical divisive clustering that learns a discriminative 1-dimensional subspace for clustering in each level of the hierarchy until achieving almost unimodal distribution in the subspace. The algorithms are tested on synthetic and in-vivo data, and are compared against two widely used spike sorting methods. The comparative results demonstrate that our spike sorting methods can achieve substantially higher accuracy in lower dimensional feature space, and they are highly robust to noise. Moreover, they provide significantly better cluster separability in the learned subspace than in the subspace obtained by principal component analysis or wavelet transform.

  16. An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data.

    PubMed

    Hsu, Arthur L; Tang, Sen-Lin; Halgamuge, Saman K

    2003-11-01

    Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). JAVA software of dynamic SOM tree algorithm is available upon request for academic use. A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf

  17. Unsupervised learning on scientific ocean drilling datasets from the South China Sea

    NASA Astrophysics Data System (ADS)

    Tse, Kevin C.; Chiu, Hon-Chim; Tsang, Man-Yin; Li, Yiliang; Lam, Edmund Y.

    2018-06-01

    Unsupervised learning methods were applied to explore data patterns in multivariate geophysical datasets collected from ocean floor sediment core samples coming from scientific ocean drilling in the South China Sea. Compared to studies on similar datasets, but using supervised learning methods which are designed to make predictions based on sample training data, unsupervised learning methods require no a priori information and focus only on the input data. In this study, popular unsupervised learning methods including K-means, self-organizing maps, hierarchical clustering and random forest were coupled with different distance metrics to form exploratory data clusters. The resulting data clusters were externally validated with lithologic units and geologic time scales assigned to the datasets by conventional methods. Compact and connected data clusters displayed varying degrees of correspondence with existing classification by lithologic units and geologic time scales. K-means and self-organizing maps were observed to perform better with lithologic units while random forest corresponded best with geologic time scales. This study sets a pioneering example of how unsupervised machine learning methods can be used as an automatic processing tool for the increasingly high volume of scientific ocean drilling data.

  18. An improved clustering algorithm based on reverse learning in intelligent transportation

    NASA Astrophysics Data System (ADS)

    Qiu, Guoqing; Kou, Qianqian; Niu, Ting

    2017-05-01

    With the development of artificial intelligence and data mining technology, big data has gradually entered people's field of vision. In the process of dealing with large data, clustering is an important processing method. By introducing the reverse learning method in the clustering process of PAM clustering algorithm, to further improve the limitations of one-time clustering in unsupervised clustering learning, and increase the diversity of clustering clusters, so as to improve the quality of clustering. The algorithm analysis and experimental results show that the algorithm is feasible.

  19. Unsupervised pattern recognition methods in ciders profiling based on GCE voltammetric signals.

    PubMed

    Jakubowska, Małgorzata; Sordoń, Wanda; Ciepiela, Filip

    2016-07-15

    This work presents a complete methodology of distinguishing between different brands of cider and ageing degrees, based on voltammetric signals, utilizing dedicated data preprocessing procedures and unsupervised multivariate analysis. It was demonstrated that voltammograms recorded on glassy carbon electrode in Britton-Robinson buffer at pH 2 are reproducible for each brand. By application of clustering algorithms and principal component analysis visible homogenous clusters were obtained. Advanced signal processing strategy which included automatic baseline correction, interval scaling and continuous wavelet transform with dedicated mother wavelet, was a key step in the correct recognition of the objects. The results show that voltammetry combined with optimized univariate and multivariate data processing is a sufficient tool to distinguish between ciders from various brands and to evaluate their freshness. Copyright © 2016 Elsevier Ltd. All rights reserved.

  20. Advanced Treatment Monitoring for Olympic-Level Athletes Using Unsupervised Modeling Techniques

    PubMed Central

    Siedlik, Jacob A.; Bergeron, Charles; Cooper, Michael; Emmons, Russell; Moreau, William; Nabhan, Dustin; Gallagher, Philip; Vardiman, John P.

    2016-01-01

    Context Analysis of injury and illness data collected at large international competitions provides the US Olympic Committee and the national governing bodies for each sport with information to best prepare for future competitions. Research in which authors have evaluated medical contacts to provide the expected level of medical care and sports medicine services at international competitions is limited. Objective To analyze the medical-contact data for athletes, staff, and coaches who participated in the 2011 Pan American Games in Guadalajara, Mexico, using unsupervised modeling techniques to identify underlying treatment patterns. Design Descriptive epidemiology study. Setting Pan American Games. Patients or Other Participants A total of 618 US athletes (337 males, 281 females) participated in the 2011 Pan American Games. Main Outcome Measure(s) Medical data were recorded from the injury-evaluation and injury-treatment forms used by clinicians assigned to the central US Olympic Committee Sport Medicine Clinic and satellite locations during the operational 17-day period of the 2011 Pan American Games. We used principal components analysis and agglomerative clustering algorithms to identify and define grouped modalities. Lift statistics were calculated for within-cluster subgroups. Results Principal component analyses identified 3 components, accounting for 72.3% of the variability in datasets. Plots of the principal components showed that individual contacts focused on 4 treatment clusters: massage, paired manipulation and mobilization, soft tissue therapy, and general medical. Conclusions Unsupervised modeling techniques were useful for visualizing complex treatment data and provided insights for improved treatment modeling in athletes. Given its ability to detect clinically relevant treatment pairings in large datasets, unsupervised modeling should be considered a feasible option for future analyses of medical-contact data from international competitions. PMID:26794628

  1. GibbsCluster: unsupervised clustering and alignment of peptide sequences.

    PubMed

    Andreatta, Massimo; Alvarez, Bruno; Nielsen, Morten

    2017-07-03

    Receptor interactions with short linear peptide fragments (ligands) are at the base of many biological signaling processes. Conserved and information-rich amino acid patterns, commonly called sequence motifs, shape and regulate these interactions. Because of the properties of a receptor-ligand system or of the assay used to interrogate it, experimental data often contain multiple sequence motifs. GibbsCluster is a powerful tool for unsupervised motif discovery because it can simultaneously cluster and align peptide data. The GibbsCluster 2.0 presented here is an improved version incorporating insertion and deletions accounting for variations in motif length in the peptide input. In basic terms, the program takes as input a set of peptide sequences and clusters them into meaningful groups. It returns the optimal number of clusters it identified, together with the sequence alignment and sequence motif characterizing each cluster. Several parameters are available to customize cluster analysis, including adjustable penalties for small clusters and overlapping groups and a trash cluster to remove outliers. As an example application, we used the server to deconvolute multiple specificities in large-scale peptidome data generated by mass spectrometry. The server is available at http://www.cbs.dtu.dk/services/GibbsCluster-2.0. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  2. A comparative analysis of pixel- and object-based detection of landslides from very high-resolution images

    NASA Astrophysics Data System (ADS)

    Keyport, Ren N.; Oommen, Thomas; Martha, Tapas R.; Sajinkumar, K. S.; Gierke, John S.

    2018-02-01

    A comparative analysis of landslides detected by pixel-based and object-oriented analysis (OOA) methods was performed using very high-resolution (VHR) remotely sensed aerial images for the San Juan La Laguna, Guatemala, which witnessed widespread devastation during the 2005 Hurricane Stan. A 3-band orthophoto of 0.5 m spatial resolution together with a 115 field-based landslide inventory were used for the analysis. A binary reference was assigned with a zero value for landslide and unity for non-landslide pixels. The pixel-based analysis was performed using unsupervised classification, which resulted in 11 different trial classes. Detection of landslides using OOA includes 2-step K-means clustering to eliminate regions based on brightness; elimination of false positives using object properties such as rectangular fit, compactness, length/width ratio, mean difference of objects, and slope angle. Both overall accuracy and F-score for OOA methods outperformed pixel-based unsupervised classification methods in both landslide and non-landslide classes. The overall accuracy for OOA and pixel-based unsupervised classification was 96.5% and 94.3%, respectively, whereas the best F-score for landslide identification for OOA and pixel-based unsupervised methods: were 84.3% and 77.9%, respectively.Results indicate that the OOA is able to identify the majority of landslides with a few false positive when compared to pixel-based unsupervised classification.

  3. Personalized Medicine in Veterans with Traumatic Brain Injuries

    DTIC Science & Technology

    2013-05-01

    Pair-Group Method using Arithmetic averages ( UPGMA ) based on cosine correlation of row mean centered log2 signal values; this was the top 50%-tile...cluster- ing was performed by the UPGMA method using Cosine correlation as the similarity metric. For comparative purposes, clustered heat maps included...non-mTBI cases were subjected to unsupervised hierarchical clustering analysis using the UPGMA algorithm with cosine correlation as the similarity

  4. Personalized Medicine in Veterans with Traumatic Brain Injuries

    DTIC Science & Technology

    2014-07-01

    9 control cases are subjected to unsupervised hierarchical clustering analysis using the UPGMA algorithm with cosine correlation as the similarity...in unsu- pervised hierarchical clustering by the Un- weighted Pair-Group Method using Arithmetic averages ( UPGMA ) based on cosine correlation of row...of log2 trans- formed MAS5.0 signal values; probe set cluster- ing was performed by the UPGMA method using Cosine correlation as the similarity

  5. Clustering and visualizing similarity networks of membrane proteins.

    PubMed

    Hu, Geng-Ming; Mai, Te-Lun; Chen, Chi-Ming

    2015-08-01

    We proposed a fast and unsupervised clustering method, minimum span clustering (MSC), for analyzing the sequence-structure-function relationship of biological networks, and demonstrated its validity in clustering the sequence/structure similarity networks (SSN) of 682 membrane protein (MP) chains. The MSC clustering of MPs based on their sequence information was found to be consistent with their tertiary structures and functions. For the largest seven clusters predicted by MSC, the consistency in chain function within the same cluster is found to be 100%. From analyzing the edge distribution of SSN for MPs, we found a characteristic threshold distance for the boundary between clusters, over which SSN of MPs could be properly clustered by an unsupervised sparsification of the network distance matrix. The clustering results of MPs from both MSC and the unsupervised sparsification methods are consistent with each other, and have high intracluster similarity and low intercluster similarity in sequence, structure, and function. Our study showed a strong sequence-structure-function relationship of MPs. We discussed evidence of convergent evolution of MPs and suggested applications in finding structural similarities and predicting biological functions of MP chains based on their sequence information. © 2015 Wiley Periodicals, Inc.

  6. Identification of chronic rhinosinusitis phenotypes using cluster analysis.

    PubMed

    Soler, Zachary M; Hyer, J Madison; Ramakrishnan, Viswanathan; Smith, Timothy L; Mace, Jess; Rudmik, Luke; Schlosser, Rodney J

    2015-05-01

    Current clinical classifications of chronic rhinosinusitis (CRS) have been largely defined based upon preconceived notions of factors thought to be important, such as polyp or eosinophil status. Unfortunately, these classification systems have little correlation with symptom severity or treatment outcomes. Unsupervised clustering can be used to identify phenotypic subgroups of CRS patients, describe clinical differences in these clusters and define simple algorithms for classification. A multi-institutional, prospective study of 382 patients with CRS who had failed initial medical therapy completed the Sino-Nasal Outcome Test (SNOT-22), Rhinosinusitis Disability Index (RSDI), Medical Outcomes Study Short Form-12 (SF-12), Pittsburgh Sleep Quality Index (PSQI), and Patient Health Questionnaire (PHQ-2). Objective measures of CRS severity included Brief Smell Identification Test (B-SIT), CT, and endoscopy scoring. All variables were reduced and unsupervised hierarchical clustering was performed. After clusters were defined, variations in medication usage were analyzed. Discriminant analysis was performed to develop a simplified, clinically useful algorithm for clustering. Clustering was largely determined by age, severity of patient reported outcome measures, depression, and fibromyalgia. CT and endoscopy varied somewhat among clusters. Traditional clinical measures, including polyp/atopic status, prior surgery, B-SIT and asthma, did not vary among clusters. A simplified algorithm based upon productivity loss, SNOT-22 score, and age predicted clustering with 89% accuracy. Medication usage among clusters did vary significantly. A simplified algorithm based upon hierarchical clustering is able to classify CRS patients and predict medication usage. Further studies are warranted to determine if such clustering predicts treatment outcomes. © 2015 ARS-AAOA, LLC.

  7. Unsupervised spatiotemporal analysis of fMRI data using graph-based visualizations of self-organizing maps.

    PubMed

    Katwal, Santosh B; Gore, John C; Marois, Rene; Rogers, Baxter P

    2013-09-01

    We present novel graph-based visualizations of self-organizing maps for unsupervised functional magnetic resonance imaging (fMRI) analysis. A self-organizing map is an artificial neural network model that transforms high-dimensional data into a low-dimensional (often a 2-D) map using unsupervised learning. However, a postprocessing scheme is necessary to correctly interpret similarity between neighboring node prototypes (feature vectors) on the output map and delineate clusters and features of interest in the data. In this paper, we used graph-based visualizations to capture fMRI data features based upon 1) the distribution of data across the receptive fields of the prototypes (density-based connectivity); and 2) temporal similarities (correlations) between the prototypes (correlation-based connectivity). We applied this approach to identify task-related brain areas in an fMRI reaction time experiment involving a visuo-manual response task, and we correlated the time-to-peak of the fMRI responses in these areas with reaction time. Visualization of self-organizing maps outperformed independent component analysis and voxelwise univariate linear regression analysis in identifying and classifying relevant brain regions. We conclude that the graph-based visualizations of self-organizing maps help in advanced visualization of cluster boundaries in fMRI data enabling the separation of regions with small differences in the timings of their brain responses.

  8. A taxonomy of epithelial human cancer and their metastases

    PubMed Central

    2009-01-01

    Background Microarray technology has allowed to molecularly characterize many different cancer sites. This technology has the potential to individualize therapy and to discover new drug targets. However, due to technological differences and issues in standardized sample collection no study has evaluated the molecular profile of epithelial human cancer in a large number of samples and tissues. Additionally, it has not yet been extensively investigated whether metastases resemble their tissue of origin or tissue of destination. Methods We studied the expression profiles of a series of 1566 primary and 178 metastases by unsupervised hierarchical clustering. The clustering profile was subsequently investigated and correlated with clinico-pathological data. Statistical enrichment of clinico-pathological annotations of groups of samples was investigated using Fisher exact test. Gene set enrichment analysis (GSEA) and DAVID functional enrichment analysis were used to investigate the molecular pathways. Kaplan-Meier survival analysis and log-rank tests were used to investigate prognostic significance of gene signatures. Results Large clusters corresponding to breast, gastrointestinal, ovarian and kidney primary tissues emerged from the data. Chromophobe renal cell carcinoma clustered together with follicular differentiated thyroid carcinoma, which supports recent morphological descriptions of thyroid follicular carcinoma-like tumors in the kidney and suggests that they represent a subtype of chromophobe carcinoma. We also found an expression signature identifying primary tumors of squamous cell histology in multiple tissues. Next, a subset of ovarian tumors enriched with endometrioid histology clustered together with endometrium tumors, confirming that they share their etiopathogenesis, which strongly differs from serous ovarian tumors. In addition, the clustering of colon and breast tumors correlated with clinico-pathological characteristics. Moreover, a signature was developed based on our unsupervised clustering of breast tumors and this was predictive for disease-specific survival in three independent studies. Next, the metastases from ovarian, breast, lung and vulva cluster with their tissue of origin while metastases from colon showed a bimodal distribution. A significant part clusters with tissue of origin while the remaining tumors cluster with the tissue of destination. Conclusion Our molecular taxonomy of epithelial human cancer indicates surprising correlations over tissues. This may have a significant impact on the classification of many cancer sites and may guide pathologists, both in research and daily practice. Moreover, these results based on unsupervised analysis yielded a signature predictive of clinical outcome in breast cancer. Additionally, we hypothesize that metastases from gastrointestinal origin either remember their tissue of origin or adapt to the tissue of destination. More specifically, colon metastases in the liver show strong evidence for such a bimodal tissue specific profile. PMID:20017941

  9. Copy number gain at 8q12.1-q22.1 is associated with a malignant tumor phenotype in salivary gland myoepitheliomas.

    PubMed

    Vékony, Hedy; Röser, Kerstin; Löning, Thomas; Ylstra, Bauke; Meijer, Gerrit A; van Wieringen, Wessel N; van de Wiel, Mark A; Carvalho, Beatriz; Kok, Klaas; Leemans, C René; van der Waal, Isaäc; Bloemena, Elisabeth

    2009-02-01

    Salivary gland myoepithelial tumors are relatively uncommon tumors with an unpredictable clinical course. More knowledge about their genetic profiles is necessary to identify novel predictors of disease. In this study, we subjected 27 primary tumors (15 myoepitheliomas and 12 myoepithelial carcinomas) to genome-wide microarray-based comparative genomic hybridization (array CGH). We set out to delineate known chromosomal aberrations in more detail and to unravel chromosomal differences between benign myoepitheliomas and myoepithelial carcinomas. Patterns of DNA copy number aberrations were analyzed by unsupervised hierarchical cluster analysis. Both benign and malignant tumors revealed a limited amount of chromosomal alterations (median of 5 and 7.5, respectively). In both tumor groups, high frequency gains (> or =20%) were found mainly at loci of growth factors and growth factor receptors (e.g., PDGF, FGF(R)s, and EGFR). In myoepitheliomas, high frequency losses (> or =20%) were detected at regions of proto-cadherins. Cluster analysis of the array CGH data identified three clusters. Differential copy numbers on chromosome arm 8q and chromosome 17 set the clusters apart. Cluster 1 contained a mixture of the two phenotypes (n = 10), cluster 2 included mostly benign tumors (n = 10), and cluster 3 only contained carcinomas (n = 7). Supervised analysis between malignant and benign tumors revealed a 36 Mbp-region at 8q being more frequently gained in malignant tumors (P = 0.007, FDR = 0.05). This is the first study investigating genomic differences between benign and malignant myoepithelial tumors of the salivary glands at a genomic level. Both unsupervised and supervised analysis of the genomic profiles revealed chromosome arm 8q to be involved in the malignant phenotype of salivary gland myoepitheliomas.

  10. Automatic Clustering Using FSDE-Forced Strategy Differential Evolution

    NASA Astrophysics Data System (ADS)

    Yasid, A.

    2018-01-01

    Clustering analysis is important in datamining for unsupervised data, cause no adequate prior knowledge. One of the important tasks is defining the number of clusters without user involvement that is known as automatic clustering. This study intends on acquiring cluster number automatically utilizing forced strategy differential evolution (AC-FSDE). Two mutation parameters, namely: constant parameter and variable parameter are employed to boost differential evolution performance. Four well-known benchmark datasets were used to evaluate the algorithm. Moreover, the result is compared with other state of the art automatic clustering methods. The experiment results evidence that AC-FSDE is better or competitive with other existing automatic clustering algorithm.

  11. Promising Ideas for Collective Advancement of Communal Knowledge Using Temporal Analytics and Cluster Analysis

    ERIC Educational Resources Information Center

    Lee, Alwyn Vwen Yen; Tan, Seng Chee

    2017-01-01

    Understanding ideas in a discourse is challenging, especially in textual discourse analysis. We propose using temporal analytics with unsupervised machine learning techniques to investigate promising ideas for the collective advancement of communal knowledge in an online knowledge building discourse. A discourse unit network was constructed and…

  12. SAR image segmentation using skeleton-based fuzzy clustering

    NASA Astrophysics Data System (ADS)

    Cao, Yun Yi; Chen, Yan Qiu

    2003-06-01

    SAR image segmentation can be converted to a clustering problem in which pixels or small patches are grouped together based on local feature information. In this paper, we present a novel framework for segmentation. The segmentation goal is achieved by unsupervised clustering upon characteristic descriptors extracted from local patches. The mixture model of characteristic descriptor, which combines intensity and texture feature, is investigated. The unsupervised algorithm is derived from the recently proposed Skeleton-Based Data Labeling method. Skeletons are constructed as prototypes of clusters to represent arbitrary latent structures in image data. Segmentation using Skeleton-Based Fuzzy Clustering is able to detect the types of surfaces appeared in SAR images automatically without any user input.

  13. Application of unsupervised pattern recognition approaches for exploration of rare earth elements in Se-Chahun iron ore, central Iran

    NASA Astrophysics Data System (ADS)

    Sarparandeh, Mohammadali; Hezarkhani, Ardeshir

    2017-12-01

    The use of efficient methods for data processing has always been of interest to researchers in the field of earth sciences. Pattern recognition techniques are appropriate methods for high-dimensional data such as geochemical data. Evaluation of the geochemical distribution of rare earth elements (REEs) requires the use of such methods. In particular, the multivariate nature of REE data makes them a good target for numerical analysis. The main subject of this paper is application of unsupervised pattern recognition approaches in evaluating geochemical distribution of REEs in the Kiruna type magnetite-apatite deposit of Se-Chahun. For this purpose, 42 bulk lithology samples were collected from the Se-Chahun iron ore deposit. In this study, 14 rare earth elements were measured with inductively coupled plasma mass spectrometry (ICP-MS). Pattern recognition makes it possible to evaluate the relations between the samples based on all these 14 features, simultaneously. In addition to providing easy solutions, discovery of the hidden information and relations of data samples is the advantage of these methods. Therefore, four clustering methods (unsupervised pattern recognition) - including a modified basic sequential algorithmic scheme (MBSAS), hierarchical (agglomerative) clustering, k-means clustering and self-organizing map (SOM) - were applied and results were evaluated using the silhouette criterion. Samples were clustered in four types. Finally, the results of this study were validated with geological facts and analysis results from, for example, scanning electron microscopy (SEM), X-ray diffraction (XRD), ICP-MS and optical mineralogy. The results of the k-means clustering and SOM methods have the best matches with reality, with experimental studies of samples and with field surveys. Since only the rare earth elements are used in this division, a good agreement of the results with lithology is considerable. It is concluded that the combination of the proposed methods and geological studies leads to finding some hidden information, and this approach has the best results compared to using only one of them.

  14. Clustering performance comparison using K-means and expectation maximization algorithms.

    PubMed

    Jung, Yong Gyu; Kang, Min Soo; Heo, Jun

    2014-11-14

    Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K -means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K -means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.

  15. Gene expression profiles of breast biopsies from healthy women identify a group with claudin-low features.

    PubMed

    Haakensen, Vilde D; Lingjaerde, Ole Christian; Lüders, Torben; Riis, Margit; Prat, Aleix; Troester, Melissa A; Holmen, Marit M; Frantzen, Jan Ole; Romundstad, Linda; Navjord, Dina; Bukholm, Ida K; Johannesen, Tom B; Perou, Charles M; Ursin, Giske; Kristensen, Vessela N; Børresen-Dale, Anne-Lise; Helland, Aslaug

    2011-11-01

    Increased understanding of the variability in normal breast biology will enable us to identify mechanisms of breast cancer initiation and the origin of different subtypes, and to better predict breast cancer risk. Gene expression patterns in breast biopsies from 79 healthy women referred to breast diagnostic centers in Norway were explored by unsupervised hierarchical clustering and supervised analyses, such as gene set enrichment analysis and gene ontology analysis and comparison with previously published genelists and independent datasets. Unsupervised hierarchical clustering identified two separate clusters of normal breast tissue based on gene-expression profiling, regardless of clustering algorithm and gene filtering used. Comparison of the expression profile of the two clusters with several published gene lists describing breast cells revealed that the samples in cluster 1 share characteristics with stromal cells and stem cells, and to a certain degree with mesenchymal cells and myoepithelial cells. The samples in cluster 1 also share many features with the newly identified claudin-low breast cancer intrinsic subtype, which also shows characteristics of stromal and stem cells. More women belonging to cluster 1 have a family history of breast cancer and there is a slight overrepresentation of nulliparous women in cluster 1. Similar findings were seen in a separate dataset consisting of histologically normal tissue from both breasts harboring breast cancer and from mammoplasty reductions. This is the first study to explore the variability of gene expression patterns in whole biopsies from normal breasts and identified distinct subtypes of normal breast tissue. Further studies are needed to determine the specific cell contribution to the variation in the biology of normal breasts, how the clusters identified relate to breast cancer risk and their possible link to the origin of the different molecular subtypes of breast cancer.

  16. A comparison of neural network and fuzzy clustering techniques in segmenting magnetic resonance images of the brain.

    PubMed

    Hall, L O; Bensaid, A M; Clarke, L P; Velthuizen, R P; Silbiger, M S; Bezdek, J C

    1992-01-01

    Magnetic resonance (MR) brain section images are segmented and then synthetically colored to give visual representations of the original data with three approaches: the literal and approximate fuzzy c-means unsupervised clustering algorithms, and a supervised computational neural network. Initial clinical results are presented on normal volunteers and selected patients with brain tumors surrounded by edema. Supervised and unsupervised segmentation techniques provide broadly similar results. Unsupervised fuzzy algorithms were visually observed to show better segmentation when compared with raw image data for volunteer studies. For a more complex segmentation problem with tumor/edema or cerebrospinal fluid boundary, where the tissues have similar MR relaxation behavior, inconsistency in rating among experts was observed, with fuzz-c-means approaches being slightly preferred over feedforward cascade correlation results. Various facets of both approaches, such as supervised versus unsupervised learning, time complexity, and utility for the diagnostic process, are compared.

  17. Accuracy of latent-variable estimation in Bayesian semi-supervised learning.

    PubMed

    Yamazaki, Keisuke

    2015-09-01

    Hierarchical probabilistic models, such as Gaussian mixture models, are widely used for unsupervised learning tasks. These models consist of observable and latent variables, which represent the observable data and the underlying data-generation process, respectively. Unsupervised learning tasks, such as cluster analysis, are regarded as estimations of latent variables based on the observable ones. The estimation of latent variables in semi-supervised learning, where some labels are observed, will be more precise than that in unsupervised, and one of the concerns is to clarify the effect of the labeled data. However, there has not been sufficient theoretical analysis of the accuracy of the estimation of latent variables. In a previous study, a distribution-based error function was formulated, and its asymptotic form was calculated for unsupervised learning with generative models. It has been shown that, for the estimation of latent variables, the Bayes method is more accurate than the maximum-likelihood method. The present paper reveals the asymptotic forms of the error function in Bayesian semi-supervised learning for both discriminative and generative models. The results show that the generative model, which uses all of the given data, performs better when the model is well specified. Copyright © 2015 Elsevier Ltd. All rights reserved.

  18. Insights into quasar UV spectra using unsupervised clustering analysis

    NASA Astrophysics Data System (ADS)

    Tammour, A.; Gallagher, S. C.; Daley, M.; Richards, G. T.

    2016-06-01

    Machine learning techniques can provide powerful tools to detect patterns in multidimensional parameter space. We use K-means - a simple yet powerful unsupervised clustering algorithm which picks out structure in unlabelled data - to study a sample of quasar UV spectra from the Quasar Catalog of the 10th Data Release of the Sloan Digital Sky Survey (SDSS-DR10) of Paris et al. Detecting patterns in large data sets helps us gain insights into the physical conditions and processes giving rise to the observed properties of quasars. We use K-means to find clusters in the parameter space of the equivalent width (EW), the blue- and red-half-width at half-maximum (HWHM) of the Mg II 2800 Å line, the C IV 1549 Å line, and the C III] 1908 Å blend in samples of broad absorption line (BAL) and non-BAL quasars at redshift 1.6-2.1. Using this method, we successfully recover correlations well-known in the UV regime such as the anti-correlation between the EW and blueshift of the C IV emission line and the shape of the ionizing spectra energy distribution (SED) probed by the strength of He II and the Si III]/C III] ratio. We find this to be particularly evident when the properties of C III] are used to find the clusters, while those of Mg II proved to be less strongly correlated with the properties of the other lines in the spectra such as the width of C IV or the Si III]/C III] ratio. We conclude that unsupervised clustering methods (such as K-means) are powerful methods for finding `natural' binning boundaries in multidimensional data sets and discuss caveats and future work.

  19. Employing broadband spectra and cluster analysis to assess thermal defoliation of cotton

    USDA-ARS?s Scientific Manuscript database

    Growers and field scouts need assistance in surveying cotton (Gossypium hirsutum L.) fields subjected to thermal defoliation to reap the benefits provided by this nonchemical defoliation method. A study was conducted to evaluate broadband spectral data and unsupervised classification as tools for s...

  20. AHIMSA - Ad hoc histogram information measure sensing algorithm for feature selection in the context of histogram inspired clustering techniques

    NASA Technical Reports Server (NTRS)

    Dasarathy, B. V.

    1976-01-01

    An algorithm is proposed for dimensionality reduction in the context of clustering techniques based on histogram analysis. The approach is based on an evaluation of the hills and valleys in the unidimensional histograms along the different features and provides an economical means of assessing the significance of the features in a nonparametric unsupervised data environment. The method has relevance to remote sensing applications.

  1. Graph Based Models for Unsupervised High Dimensional Data Clustering and Network Analysis

    DTIC Science & Technology

    2015-01-01

    ApprovedOMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for...algorithms we proposed improve the time e ciency signi cantly for large scale datasets. In the last chapter, we also propose an incremental reseeding...plume detection in hyper-spectral video data. These graph based clustering algorithms we proposed improve the time efficiency significantly for large

  2. DAFi: A directed recursive data filtering and clustering approach for improving and interpreting data clustering identification of cell populations from polychromatic flow cytometry data.

    PubMed

    Lee, Alexandra J; Chang, Ivan; Burel, Julie G; Lindestam Arlehamn, Cecilia S; Mandava, Aishwarya; Weiskopf, Daniela; Peters, Bjoern; Sette, Alessandro; Scheuermann, Richard H; Qian, Yu

    2018-04-17

    Computational methods for identification of cell populations from polychromatic flow cytometry data are changing the paradigm of cytometry bioinformatics. Data clustering is the most common computational approach to unsupervised identification of cell populations from multidimensional cytometry data. However, interpretation of the identified data clusters is labor-intensive. Certain types of user-defined cell populations are also difficult to identify by fully automated data clustering analysis. Both are roadblocks before a cytometry lab can adopt the data clustering approach for cell population identification in routine use. We found that combining recursive data filtering and clustering with constraints converted from the user manual gating strategy can effectively address these two issues. We named this new approach DAFi: Directed Automated Filtering and Identification of cell populations. Design of DAFi preserves the data-driven characteristics of unsupervised clustering for identifying novel cell subsets, but also makes the results interpretable to experimental scientists through mapping and merging the multidimensional data clusters into the user-defined two-dimensional gating hierarchy. The recursive data filtering process in DAFi helped identify small data clusters which are otherwise difficult to resolve by a single run of the data clustering method due to the statistical interference of the irrelevant major clusters. Our experiment results showed that the proportions of the cell populations identified by DAFi, while being consistent with those by expert centralized manual gating, have smaller technical variances across samples than those from individual manual gating analysis and the nonrecursive data clustering analysis. Compared with manual gating segregation, DAFi-identified cell populations avoided the abrupt cut-offs on the boundaries. DAFi has been implemented to be used with multiple data clustering methods including K-means, FLOCK, FlowSOM, and the ClusterR package. For cell population identification, DAFi supports multiple options including clustering, bisecting, slope-based gating, and reversed filtering to meet various autogating needs from different scientific use cases. © 2018 International Society for Advancement of Cytometry. © 2018 International Society for Advancement of Cytometry.

  3. Unsupervised consensus cluster analysis of [18F]-fluoroethyl-L-tyrosine positron emission tomography identified textural features for the diagnosis of pseudoprogression in high-grade glioma.

    PubMed

    Kebir, Sied; Khurshid, Zain; Gaertner, Florian C; Essler, Markus; Hattingen, Elke; Fimmers, Rolf; Scheffler, Björn; Herrlinger, Ulrich; Bundschuh, Ralph A; Glas, Martin

    2017-01-31

    Timely detection of pseudoprogression (PSP) is crucial for the management of patients with high-grade glioma (HGG) but remains difficult. Textural features of O-(2-[18F]fluoroethyl)-L-tyrosine positron emission tomography (FET-PET) mirror tumor uptake heterogeneity; some of them may be associated with tumor progression. Fourteen patients with HGG and suspected of PSP underwent FET-PET imaging. A set of 19 conventional and textural FET-PET features were evaluated and subjected to unsupervised consensus clustering. The final diagnosis of true progression vs. PSP was based on follow-up MRI using RANO criteria. Three robust clusters have been identified based on 10 predominantly textural FET-PET features. None of the patients with PSP fell into cluster 2, which was associated with high values for textural FET-PET markers of uptake heterogeneity. Three out of 4 patients with PSP were assigned to cluster 3 that was largely associated with low values of textural FET-PET features. By comparison, tumor-to-normal brain ratio (TNRmax) at the optimal cutoff 2.1 was less predictive of PSP (negative predictive value 57% for detecting true progression, p=0.07 vs. 75% with cluster 3, p=0.04). Clustering based on textural O-(2-[18F]fluoroethyl)-L-tyrosine PET features may provide valuable information in assessing the elusive phenomenon of pseudoprogression.

  4. Bacterial community comparisons by taxonomy-supervised analysis independent of sequence alignment and clustering

    PubMed Central

    Sul, Woo Jun; Cole, James R.; Jesus, Ederson da C.; Wang, Qiong; Farris, Ryan J.; Fish, Jordan A.; Tiedje, James M.

    2011-01-01

    High-throughput sequencing of 16S rRNA genes has increased our understanding of microbial community structure, but now even higher-throughput methods to the Illumina scale allow the creation of much larger datasets with more samples and orders-of-magnitude more sequences that swamp current analytic methods. We developed a method capable of handling these larger datasets on the basis of assignment of sequences into an existing taxonomy using a supervised learning approach (taxonomy-supervised analysis). We compared this method with a commonly used clustering approach based on sequence similarity (taxonomy-unsupervised analysis). We sampled 211 different bacterial communities from various habitats and obtained ∼1.3 million 16S rRNA sequences spanning the V4 hypervariable region by pyrosequencing. Both methodologies gave similar ecological conclusions in that β-diversity measures calculated by using these two types of matrices were significantly correlated to each other, as were the ordination configurations and hierarchical clustering dendrograms. In addition, our taxonomy-supervised analyses were also highly correlated with phylogenetic methods, such as UniFrac. The taxonomy-supervised analysis has the advantages that it is not limited by the exhaustive computation required for the alignment and clustering necessary for the taxonomy-unsupervised analysis, is more tolerant of sequencing errors, and allows comparisons when sequences are from different regions of the 16S rRNA gene. With the tremendous expansion in 16S rRNA data acquisition underway, the taxonomy-supervised approach offers the potential to provide more rapid and extensive community comparisons across habitats and samples. PMID:21873204

  5. Collected Notes on the Workshop for Pattern Discovery in Large Databases

    NASA Technical Reports Server (NTRS)

    Buntine, Wray (Editor); Delalto, Martha (Editor)

    1991-01-01

    These collected notes are a record of material presented at the Workshop. The core data analysis is addressed that have traditionally required statistical or pattern recognition techniques. Some of the core tasks include classification, discrimination, clustering, supervised and unsupervised learning, discovery and diagnosis, i.e., general pattern discovery.

  6. Unsupervised discovery of information structure in biomedical documents.

    PubMed

    Kiela, Douwe; Guo, Yufan; Stenius, Ulla; Korhonen, Anna

    2015-04-01

    Information structure (IS) analysis is a text mining technique, which classifies text in biomedical articles into categories that capture different types of information, such as objectives, methods, results and conclusions of research. It is a highly useful technique that can support a range of Biomedical Text Mining tasks and can help readers of biomedical literature find information of interest faster, accelerating the highly time-consuming process of literature review. Several approaches to IS analysis have been presented in the past, with promising results in real-world biomedical tasks. However, all existing approaches, even weakly supervised ones, require several hundreds of hand-annotated training sentences specific to the domain in question. Because biomedicine is subject to considerable domain variation, such annotations are expensive to obtain. This makes the application of IS analysis across biomedical domains difficult. In this article, we investigate an unsupervised approach to IS analysis and evaluate the performance of several unsupervised methods on a large corpus of biomedical abstracts collected from PubMed. Our best unsupervised algorithm (multilevel-weighted graph clustering algorithm) performs very well on the task, obtaining over 0.70 F scores for most IS categories when applied to well-known IS schemes. This level of performance is close to that of lightly supervised IS methods and has proven sufficient to aid a range of practical tasks. Thus, using an unsupervised approach, IS could be applied to support a wide range of tasks across sub-domains of biomedicine. We also demonstrate that unsupervised learning brings novel insights into IS of biomedical literature and discovers information categories that are not present in any of the existing IS schemes. The annotated corpus and software are available at http://www.cl.cam.ac.uk/∼dk427/bio14info.html. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  7. Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification.

    PubMed

    Wu, Dingming; Wang, Dongfang; Zhang, Michael Q; Gu, Jin

    2015-12-01

    One major goal of large-scale cancer omics study is to identify molecular subtypes for more accurate cancer diagnoses and treatments. To deal with high-dimensional cancer multi-omics data, a promising strategy is to find an effective low-dimensional subspace of the original data and then cluster cancer samples in the reduced subspace. However, due to data-type diversity and big data volume, few methods can integrative and efficiently find the principal low-dimensional manifold of the high-dimensional cancer multi-omics data. In this study, we proposed a novel low-rank approximation based integrative probabilistic model to fast find the shared principal subspace across multiple data types: the convexity of the low-rank regularized likelihood function of the probabilistic model ensures efficient and stable model fitting. Candidate molecular subtypes can be identified by unsupervised clustering hundreds of cancer samples in the reduced low-dimensional subspace. On testing datasets, our method LRAcluster (low-rank approximation based multi-omics data clustering) runs much faster with better clustering performances than the existing method. Then, we applied LRAcluster on large-scale cancer multi-omics data from TCGA. The pan-cancer analysis results show that the cancers of different tissue origins are generally grouped as independent clusters, except squamous-like carcinomas. While the single cancer type analysis suggests that the omics data have different subtyping abilities for different cancer types. LRAcluster is a very useful method for fast dimension reduction and unsupervised clustering of large-scale multi-omics data. LRAcluster is implemented in R and freely available via http://bioinfo.au.tsinghua.edu.cn/software/lracluster/ .

  8. A comparison of neural network and fuzzy clustering techniques in segmenting magnetic resonance images of the brain

    NASA Technical Reports Server (NTRS)

    Hall, Lawrence O.; Bensaid, Amine M.; Clarke, Laurence P.; Velthuizen, Robert P.; Silbiger, Martin S.; Bezdek, James C.

    1992-01-01

    Magnetic resonance (MR) brain section images are segmented and then synthetically colored to give visual representations of the original data with three approaches: the literal and approximate fuzzy c-means unsupervised clustering algorithms and a supervised computational neural network, a dynamic multilayered perception trained with the cascade correlation learning algorithm. Initial clinical results are presented on both normal volunteers and selected patients with brain tumors surrounded by edema. Supervised and unsupervised segmentation techniques provide broadly similar results. Unsupervised fuzzy algorithms were visually observed to show better segmentation when compared with raw image data for volunteer studies. However, for a more complex segmentation problem with tumor/edema or cerebrospinal fluid boundary, where the tissues have similar MR relaxation behavior, inconsistency in rating among experts was observed.

  9. Molecular subtyping of bladder cancer using Kohonen self-organizing maps

    PubMed Central

    Borkowska, Edyta M; Kruk, Andrzej; Jedrzejczyk, Adam; Rozniecki, Marek; Jablonowski, Zbigniew; Traczyk, Magdalena; Constantinou, Maria; Banaszkiewicz, Monika; Pietrusinski, Michal; Sosnowski, Marek; Hamdy, Freddie C; Peter, Stefan; Catto, James WF; Kaluzewski, Bogdan

    2014-01-01

    Kohonen self-organizing maps (SOMs) are unsupervised Artificial Neural Networks (ANNs) that are good for low-density data visualization. They easily deal with complex and nonlinear relationships between variables. We evaluated molecular events that characterize high- and low-grade BC pathways in the tumors from 104 patients. We compared the ability of statistical clustering with a SOM to stratify tumors according to the risk of progression to more advanced disease. In univariable analysis, tumor stage (log rank P = 0.006) and grade (P < 0.001), HPV DNA (P < 0.004), Chromosome 9 loss (P = 0.04) and the A148T polymorphism (rs 3731249) in CDKN2A (P = 0.02) were associated with progression. Multivariable analysis of these parameters identified that tumor grade (Cox regression, P = 0.001, OR.2.9 (95% CI 1.6–5.2)) and the presence of HPV DNA (P = 0.017, OR 3.8 (95% CI 1.3–11.4)) were the only independent predictors of progression. Unsupervised hierarchical clustering grouped the tumors into discreet branches but did not stratify according to progression free survival (log rank P = 0.39). These genetic variables were presented to SOM input neurons. SOMs are suitable for complex data integration, allow easy visualization of outcomes, and may stratify BC progression more robustly than hierarchical clustering. PMID:25142434

  10. Hierarchical clustering of HPV genotype patterns in the ASCUS-LSIL triage study

    PubMed Central

    Wentzensen, Nicolas; Wilson, Lauren E.; Wheeler, Cosette M.; Carreon, Joseph D.; Gravitt, Patti E.; Schiffman, Mark; Castle, Philip E.

    2010-01-01

    Anogenital cancers are associated with about 13 carcinogenic HPV types in a broader group that cause cervical intraepithelial neoplasia (CIN). Multiple concurrent cervical HPV infections are common which complicate the attribution of HPV types to different grades of CIN. Here we report the analysis of HPV genotype patterns in the ASCUS-LSIL triage study using unsupervised hierarchical clustering. Women who underwent colposcopy at baseline (n = 2780) were grouped into 20 disease categories based on histology and cytology. Disease groups and HPV genotypes were clustered using complete linkage. Risk of 2-year cumulative CIN3+, viral load, colposcopic impression, and age were compared between disease groups and major clusters. Hierarchical clustering yielded four major disease clusters: Cluster 1 included all CIN3 histology with abnormal cytology; Cluster 2 included CIN3 histology with normal cytology and combinations with either CIN2 or high-grade squamous intraepithelial lesion (HSIL) cytology; Cluster 3 included older women with normal or low grade histology/cytology and low viral load; Cluster 4 included younger women with low grade histology/cytology, multiple infections, and the highest viral load. Three major groups of HPV genotypes were identified: Group 1 included only HPV16; Group 2 included nine carcinogenic types plus non-carcinogenic HPV53 and HPV66; and Group 3 included non-carcinogenic types plus carcinogenic HPV33 and HPV45. Clustering results suggested that colposcopy missed a prevalent precancer in many women with no biopsy/normal histology and HSIL. This result was confirmed by an elevated 2-year risk of CIN3+ in these groups. Our novel approach to study multiple genotype infections in cervical disease using unsupervised hierarchical clustering can address complex genotype distributions on a population level. PMID:20959485

  11. Statistical Significance for Hierarchical Clustering

    PubMed Central

    Kimes, Patrick K.; Liu, Yufeng; Hayes, D. Neil; Marron, J. S.

    2017-01-01

    Summary Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this paper, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets. PMID:28099990

  12. Unsupervised consensus cluster analysis of [18F]-fluoroethyl-L-tyrosine positron emission tomography identified textural features for the diagnosis of pseudoprogression in high-grade glioma

    PubMed Central

    Kebir, Sied; Khurshid, Zain; Gaertner, Florian C.; Essler, Markus; Hattingen, Elke; Fimmers, Rolf; Scheffler, Björn; Herrlinger, Ulrich; Bundschuh, Ralph A.; Glas, Martin

    2017-01-01

    Rationale Timely detection of pseudoprogression (PSP) is crucial for the management of patients with high-grade glioma (HGG) but remains difficult. Textural features of O-(2-[18F]fluoroethyl)-L-tyrosine positron emission tomography (FET-PET) mirror tumor uptake heterogeneity; some of them may be associated with tumor progression. Methods Fourteen patients with HGG and suspected of PSP underwent FET-PET imaging. A set of 19 conventional and textural FET-PET features were evaluated and subjected to unsupervised consensus clustering. The final diagnosis of true progression vs. PSP was based on follow-up MRI using RANO criteria. Results Three robust clusters have been identified based on 10 predominantly textural FET-PET features. None of the patients with PSP fell into cluster 2, which was associated with high values for textural FET-PET markers of uptake heterogeneity. Three out of 4 patients with PSP were assigned to cluster 3 that was largely associated with low values of textural FET-PET features. By comparison, tumor-to-normal brain ratio (TNRmax) at the optimal cutoff 2.1 was less predictive of PSP (negative predictive value 57% for detecting true progression, p=0.07 vs. 75% with cluster 3, p=0.04). Principal Conclusions Clustering based on textural O-(2-[18F]fluoroethyl)-L-tyrosine PET features may provide valuable information in assessing the elusive phenomenon of pseudoprogression. PMID:28030820

  13. A comparison of unsupervised classification procedures on LANDSAT MSS data for an area of complex surface conditions in Basilicata, Southern Italy

    NASA Technical Reports Server (NTRS)

    Justice, C.; Townshend, J. (Principal Investigator)

    1981-01-01

    Two unsupervised classification procedures were applied to ratioed and unratioed LANDSAT multispectral scanner data of an area of spatially complex vegetation and terrain. An objective accuracy assessment was undertaken on each classification and comparison was made of the classification accuracies. The two unsupervised procedures use the same clustering algorithm. By on procedure the entire area is clustered and by the other a representative sample of the area is clustered and the resulting statistics are extrapolated to the remaining area using a maximum likelihood classifier. Explanation is given of the major steps in the classification procedures including image preprocessing; classification; interpretation of cluster classes; and accuracy assessment. Of the four classifications undertaken, the monocluster block approach on the unratioed data gave the highest accuracy of 80% for five coarse cover classes. This accuracy was increased to 84% by applying a 3 x 3 contextual filter to the classified image. A detailed description and partial explanation is provided for the major misclassification. The classification of the unratioed data produced higher percentage accuracies than for the ratioed data and the monocluster block approach gave higher accuracies than clustering the entire area. The moncluster block approach was additionally the most economical in terms of computing time.

  14. Mastication Evaluation With Unsupervised Learning: Using an Inertial Sensor-Based System.

    PubMed

    Lucena, Caroline Vieira; Lacerda, Marcelo; Caldas, Rafael; De Lima Neto, Fernando Buarque; Rativa, Diego

    2018-01-01

    There is a direct relationship between the prevalence of musculoskeletal disorders of the temporomandibular joint and orofacial disorders. A well-elaborated analysis of the jaw movements provides relevant information for healthcare professionals to conclude their diagnosis. Different approaches have been explored to track jaw movements such that the mastication analysis is getting less subjective; however, all methods are still highly subjective, and the quality of the assessments depends much on the experience of the health professional. In this paper, an accurate and non-invasive method based on a commercial low-cost inertial sensor (MPU6050) to measure jaw movements is proposed. The jaw-movement feature values are compared to the obtained with clinical analysis, showing no statistically significant difference between both methods. Moreover, We propose to use unsupervised paradigm approaches to cluster mastication patterns of healthy subjects and simulated patients with facial trauma. Two techniques were used in this paper to instantiate the method: Kohonen's Self-Organizing Maps and K-Means Clustering. Both algorithms have excellent performances to process jaw-movements data, showing encouraging results and potential to bring a full assessment of the masticatory function. The proposed method can be applied in real-time providing relevant dynamic information for health-care professionals.

  15. Automated segmentation of white matter fiber bundles using diffusion tensor imaging data and a new density based clustering algorithm.

    PubMed

    Kamali, Tahereh; Stashuk, Daniel

    2016-10-01

    Robust and accurate segmentation of brain white matter (WM) fiber bundles assists in diagnosing and assessing progression or remission of neuropsychiatric diseases such as schizophrenia, autism and depression. Supervised segmentation methods are infeasible in most applications since generating gold standards is too costly. Hence, there is a growing interest in designing unsupervised methods. However, most conventional unsupervised methods require the number of clusters be known in advance which is not possible in most applications. The purpose of this study is to design an unsupervised segmentation algorithm for brain white matter fiber bundles which can automatically segment fiber bundles using intrinsic diffusion tensor imaging data information without considering any prior information or assumption about data distributions. Here, a new density based clustering algorithm called neighborhood distance entropy consistency (NDEC), is proposed which discovers natural clusters within data by simultaneously utilizing both local and global density information. The performance of NDEC is compared with other state of the art clustering algorithms including chameleon, spectral clustering, DBSCAN and k-means using Johns Hopkins University publicly available diffusion tensor imaging data. The performance of NDEC and other employed clustering algorithms were evaluated using dice ratio as an external evaluation criteria and density based clustering validation (DBCV) index as an internal evaluation metric. Across all employed clustering algorithms, NDEC obtained the highest average dice ratio (0.94) and DBCV value (0.71). NDEC can find clusters with arbitrary shapes and densities and consequently can be used for WM fiber bundle segmentation where there is no distinct boundary between various bundles. NDEC may also be used as an effective tool in other pattern recognition and medical diagnostic systems in which discovering natural clusters within data is a necessity. Copyright © 2016 Elsevier B.V. All rights reserved.

  16. Gene expression profiles of breast biopsies from healthy women identify a group with claudin-low features

    PubMed Central

    2011-01-01

    Background Increased understanding of the variability in normal breast biology will enable us to identify mechanisms of breast cancer initiation and the origin of different subtypes, and to better predict breast cancer risk. Methods Gene expression patterns in breast biopsies from 79 healthy women referred to breast diagnostic centers in Norway were explored by unsupervised hierarchical clustering and supervised analyses, such as gene set enrichment analysis and gene ontology analysis and comparison with previously published genelists and independent datasets. Results Unsupervised hierarchical clustering identified two separate clusters of normal breast tissue based on gene-expression profiling, regardless of clustering algorithm and gene filtering used. Comparison of the expression profile of the two clusters with several published gene lists describing breast cells revealed that the samples in cluster 1 share characteristics with stromal cells and stem cells, and to a certain degree with mesenchymal cells and myoepithelial cells. The samples in cluster 1 also share many features with the newly identified claudin-low breast cancer intrinsic subtype, which also shows characteristics of stromal and stem cells. More women belonging to cluster 1 have a family history of breast cancer and there is a slight overrepresentation of nulliparous women in cluster 1. Similar findings were seen in a separate dataset consisting of histologically normal tissue from both breasts harboring breast cancer and from mammoplasty reductions. Conclusion This is the first study to explore the variability of gene expression patterns in whole biopsies from normal breasts and identified distinct subtypes of normal breast tissue. Further studies are needed to determine the specific cell contribution to the variation in the biology of normal breasts, how the clusters identified relate to breast cancer risk and their possible link to the origin of the different molecular subtypes of breast cancer. PMID:22044755

  17. Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data

    PubMed Central

    Andreev, Victor P; Gillespie, Brenda W; Helfand, Brian T; Merion, Robert M

    2016-01-01

    Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular signatures of 40 differentially abundant proteins (effect size 1.5) from among the 1129-protein panel. PMID:27524871

  18. Spectral gene set enrichment (SGSE).

    PubMed

    Frost, H Robert; Li, Zhigang; Moore, Jason H

    2015-03-03

    Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.

  19. Application of diffusion maps to identify human factors of self-reported anomalies in aviation.

    PubMed

    Andrzejczak, Chris; Karwowski, Waldemar; Mikusinski, Piotr

    2012-01-01

    A study investigating what factors are present leading to pilots submitting voluntary anomaly reports regarding their flight performance was conducted. Diffusion Maps (DM) were selected as the method of choice for performing dimensionality reduction on text records for this study. Diffusion Maps have seen successful use in other domains such as image classification and pattern recognition. High-dimensionality data in the form of narrative text reports from the NASA Aviation Safety Reporting System (ASRS) were clustered and categorized by way of dimensionality reduction. Supervised analyses were performed to create a baseline document clustering system. Dimensionality reduction techniques identified concepts or keywords within records, and allowed the creation of a framework for an unsupervised document classification system. Results from the unsupervised clustering algorithm performed similarly to the supervised methods outlined in the study. The dimensionality reduction was performed on 100 of the most commonly occurring words within 126,000 text records describing commercial aviation incidents. This study demonstrates that unsupervised machine clustering and organization of incident reports is possible based on unbiased inputs. Findings from this study reinforced traditional views on what factors contribute to civil aviation anomalies, however, new associations between previously unrelated factors and conditions were also found.

  20. Semisupervised Clustering by Iterative Partition and Regression with Neuroscience Applications

    PubMed Central

    Qian, Guoqi; Wu, Yuehua; Ferrari, Davide; Qiao, Puxue; Hollande, Frédéric

    2016-01-01

    Regression clustering is a mixture of unsupervised and supervised statistical learning and data mining method which is found in a wide range of applications including artificial intelligence and neuroscience. It performs unsupervised learning when it clusters the data according to their respective unobserved regression hyperplanes. The method also performs supervised learning when it fits regression hyperplanes to the corresponding data clusters. Applying regression clustering in practice requires means of determining the underlying number of clusters in the data, finding the cluster label of each data point, and estimating the regression coefficients of the model. In this paper, we review the estimation and selection issues in regression clustering with regard to the least squares and robust statistical methods. We also provide a model selection based technique to determine the number of regression clusters underlying the data. We further develop a computing procedure for regression clustering estimation and selection. Finally, simulation studies are presented for assessing the procedure, together with analyzing a real data set on RGB cell marking in neuroscience to illustrate and interpret the method. PMID:27212939

  1. Malignant pleural mesothelioma and mesothelial hyperplasia: A new molecular tool for the differential diagnosis.

    PubMed

    Bruno, Rossella; Alì, Greta; Giannini, Riccardo; Proietti, Agnese; Lucchi, Marco; Chella, Antonio; Melfi, Franca; Mussi, Alfredo; Fontanini, Gabriella

    2017-01-10

    Malignant pleural mesothelioma (MPM) is a rare asbestos related cancer, aggressive and unresponsive to therapies. Histological examination of pleural lesions is the gold standard of MPM diagnosis, although it is sometimes hard to discriminate the epithelioid type of MPM from benign mesothelial hyperplasia (MH).This work aims to define a new molecular tool for the differential diagnosis of MPM, using the expression profile of 117 genes deregulated in this tumour.The gene expression analysis was performed by nanoString System on tumour tissues from 36 epithelioid MPM and 17 MH patients, and on 14 mesothelial pleural samples analysed in a blind way. Data analysis included raw nanoString data normalization, unsupervised cluster analysis by Pearson correlation, non-parametric Mann Whitney U-test and molecular classification by the Uncorrelated Shrunken Centroid (USC) Algorithm.The Mann-Whitney U-test found 35 genes upregulated and 31 downregulated in MPM. The unsupervised cluster analysis revealed two clusters, one composed only of MPM and one only of MH samples, thus revealing class-specific gene profiles. The Uncorrelated Shrunken Centroid algorithm identified two classifiers, one including 22 genes and the other 40 genes, able to properly classify all the samples as benign or malignant using gene expression data; both classifiers were also able to correctly determine, in a blind analysis, the diagnostic categories of all the 14 unknown samples.In conclusion we delineated a diagnostic tool combining molecular data (gene expression) and computational analysis (USC algorithm), which can be applied in the clinical practice for the differential diagnosis of MPM.

  2. Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition

    PubMed Central

    Saeed, Isaam; Tang, Sen-Lin; Halgamuge, Saman K.

    2012-01-01

    An approach to infer the unknown microbial population structure within a metagenome is to cluster nucleotide sequences based on common patterns in base composition, otherwise referred to as binning. When functional roles are assigned to the identified populations, a deeper understanding of microbial communities can be attained, more so than gene-centric approaches that explore overall functionality. In this study, we propose an unsupervised, model-based binning method with two clustering tiers, which uses a novel transformation of the oligonucleotide frequency-derived error gradient and GC content to generate coarse groups at the first tier of clustering; and tetranucleotide frequency to refine these groups at the secondary clustering tier. The proposed method has a demonstrated improvement over PhyloPythia, S-GSOM, TACOA and TaxSOM on all three benchmarks that were used for evaluation in this study. The proposed method is then applied to a pyrosequenced metagenomic library of mud volcano sediment sampled in southwestern Taiwan, with the inferred population structure validated against complementary sequencing of 16S ribosomal RNA marker genes. Finally, the proposed method was further validated against four publicly available metagenomes, including a highly complex Antarctic whale-fall bone sample, which was previously assumed to be too complex for binning prior to functional analysis. PMID:22180538

  3. Improved regional-scale Brazilian cropping systems' mapping based on a semi-automatic object-based clustering approach

    NASA Astrophysics Data System (ADS)

    Bellón, Beatriz; Bégué, Agnès; Lo Seen, Danny; Lebourgeois, Valentine; Evangelista, Balbino Antônio; Simões, Margareth; Demonte Ferraz, Rodrigo Peçanha

    2018-06-01

    Cropping systems' maps at fine scale over large areas provide key information for further agricultural production and environmental impact assessments, and thus represent a valuable tool for effective land-use planning. There is, therefore, a growing interest in mapping cropping systems in an operational manner over large areas, and remote sensing approaches based on vegetation index time series analysis have proven to be an efficient tool. However, supervised pixel-based approaches are commonly adopted, requiring resource consuming field campaigns to gather training data. In this paper, we present a new object-based unsupervised classification approach tested on an annual MODIS 16-day composite Normalized Difference Vegetation Index time series and a Landsat 8 mosaic of the State of Tocantins, Brazil, for the 2014-2015 growing season. Two variants of the approach are compared: an hyperclustering approach, and a landscape-clustering approach involving a previous stratification of the study area into landscape units on which the clustering is then performed. The main cropping systems of Tocantins, characterized by the crop types and cropping patterns, were efficiently mapped with the landscape-clustering approach. Results show that stratification prior to clustering significantly improves the classification accuracies for underrepresented and sparsely distributed cropping systems. This study illustrates the potential of unsupervised classification for large area cropping systems' mapping and contributes to the development of generic tools for supporting large-scale agricultural monitoring across regions.

  4. Characterization of Rhinitis According to the Asthma Status in Adults Using an Unsupervised Approach in the EGEA Study.

    PubMed

    Burte, Emilie; Bousquet, Jean; Varraso, Raphaëlle; Gormand, Frédéric; Just, Jocelyne; Matran, Régis; Pin, Isabelle; Siroux, Valérie; Jacquemin, Bénédicte; Nadif, Rachel

    2015-01-01

    The classification of rhinitis in adults is missing in epidemiological studies. To identify phenotypes of adult rhinitis using an unsupervised approach (data-driven) compared with a classical hypothesis-driven approach. 983 adults of the French Epidemiological Study on the Genetics and Environment of Asthma (EGEA) were studied. Self-reported symptoms related to rhinitis such as nasal symptoms, hay fever, sinusitis, conjunctivitis, and sensitivities to different triggers (dust, animals, hay/flowers, cold air…) were used. Allergic sensitization was defined by at least one positive skin prick test to 12 aeroallergens. Mixture model was used to cluster participants, independently in those without (Asthma-, n = 582) and with asthma (Asthma+, n = 401). Three clusters were identified in both groups: 1) Cluster A (55% in Asthma-, and 22% in Asthma+) mainly characterized by the absence of nasal symptoms, 2) Cluster B (23% in Asthma-, 36% in Asthma+) mainly characterized by nasal symptoms all over the year, sinusitis and a low prevalence of positive skin prick tests, and 3) Cluster C (22% in Asthma-, 42% in Asthma+) mainly characterized by a peak of nasal symptoms during spring, a high prevalence of positive skin prick tests and a high report of hay fever, allergic rhinitis and conjunctivitis. The highest rate of polysensitization (80%) was found in participants with comorbid asthma and allergic rhinitis. This cluster analysis highlighted three clusters of rhinitis with similar characteristics than those known by clinicians but differing according to allergic sensitization, and this whatever the asthma status. These clusters could be easily rebuilt using a small number of variables.

  5. Data mining with unsupervised clustering using photonic micro-ring resonators

    NASA Astrophysics Data System (ADS)

    McAulay, Alastair D.

    2013-09-01

    Data is commonly moved through optical fiber in modern data centers and may be stored optically. We propose an optical method of data mining for future data centers to enhance performance. For example, in clustering, a form of unsupervised learning, we propose that parameters corresponding to information in a database are converted from analog values to frequencies, as in the brain's neurons, where similar data will have close frequencies. We describe the Wilson-Cowan model for oscillating neurons. In optics we implement the frequencies with micro ring resonators. Due to the influence of weak coupling, a group of resonators will form clusters of similar frequencies that will indicate the desired parameters having close relations. Fewer clusters are formed as clustering proceeds, which allows the creation of a tree showing topics of importance and their relationships in the database. The tree can be used for instance to target advertising and for planning.

  6. The evaluation of alternate methodologies for land cover classification in an urbanizing area

    NASA Technical Reports Server (NTRS)

    Smekofski, R. M.

    1981-01-01

    The usefulness of LANDSAT in classifying land cover and in identifying and classifying land use change was investigated using an urbanizing area as the study area. The question of what was the best technique for classification was the primary focus of the study. The many computer-assisted techniques available to analyze LANDSAT data were evaluated. Techniques of statistical training (polygons from CRT, unsupervised clustering, polygons from digitizer and binary masks) were tested with minimum distance to the mean, maximum likelihood and canonical analysis with minimum distance to the mean classifiers. The twelve output images were compared to photointerpreted samples, ground verified samples and a current land use data base. Results indicate that for a reconnaissance inventory, the unsupervised training with canonical analysis-minimum distance classifier is the most efficient. If more detailed ground truth and ground verification is available, the polygons from the digitizer training with the canonical analysis minimum distance is more accurate.

  7. Unsupervised classification of major depression using functional connectivity MRI.

    PubMed

    Zeng, Ling-Li; Shen, Hui; Liu, Li; Hu, Dewen

    2014-04-01

    The current diagnosis of psychiatric disorders including major depressive disorder based largely on self-reported symptoms and clinical signs may be prone to patients' behaviors and psychiatrists' bias. This study aims at developing an unsupervised machine learning approach for the accurate identification of major depression based on single resting-state functional magnetic resonance imaging scans in the absence of clinical information. Twenty-four medication-naive patients with major depression and 29 demographically similar healthy individuals underwent resting-state functional magnetic resonance imaging. We first clustered the voxels within the perigenual cingulate cortex into two subregions, a subgenual region and a pregenual region, according to their distinct resting-state functional connectivity patterns and showed that a maximum margin clustering-based unsupervised machine learning approach extracted sufficient information from the subgenual cingulate functional connectivity map to differentiate depressed patients from healthy controls with a group-level clustering consistency of 92.5% and an individual-level classification consistency of 92.5%. It was also revealed that the subgenual cingulate functional connectivity network with the highest discriminative power primarily included the ventrolateral and ventromedial prefrontal cortex, superior temporal gyri and limbic areas, indicating that these connections may play critical roles in the pathophysiology of major depression. The current study suggests that subgenual cingulate functional connectivity network signatures may provide promising objective biomarkers for the diagnosis of major depression and that maximum margin clustering-based unsupervised machine learning approaches may have the potential to inform clinical practice and aid in research on psychiatric disorders. Copyright © 2013 Wiley Periodicals, Inc.

  8. Exploring supervised and unsupervised methods to detect topics in biomedical text

    PubMed Central

    Lee, Minsuk; Wang, Weiqing; Yu, Hong

    2006-01-01

    Background Topic detection is a task that automatically identifies topics (e.g., "biochemistry" and "protein structure") in scientific articles based on information content. Topic detection will benefit many other natural language processing tasks including information retrieval, text summarization and question answering; and is a necessary step towards the building of an information system that provides an efficient way for biologists to seek information from an ocean of literature. Results We have explored the methods of Topic Spotting, a task of text categorization that applies the supervised machine-learning technique naïve Bayes to assign automatically a document into one or more predefined topics; and Topic Clustering, which apply unsupervised hierarchical clustering algorithms to aggregate documents into clusters such that each cluster represents a topic. We have applied our methods to detect topics of more than fifteen thousand of articles that represent over sixteen thousand entries in the Online Mendelian Inheritance in Man (OMIM) database. We have explored bag of words as the features. Additionally, we have explored semantic features; namely, the Medical Subject Headings (MeSH) that are assigned to the MEDLINE records, and the Unified Medical Language System (UMLS) semantic types that correspond to the MeSH terms, in addition to bag of words, to facilitate the tasks of topic detection. Our results indicate that incorporating the MeSH terms and the UMLS semantic types as additional features enhances the performance of topic detection and the naïve Bayes has the highest accuracy, 66.4%, for predicting the topic of an OMIM article as one of the total twenty-five topics. Conclusion Our results indicate that the supervised topic spotting methods outperformed the unsupervised topic clustering; on the other hand, the unsupervised topic clustering methods have the advantages of being robust and applicable in real world settings. PMID:16539745

  9. Effective implementation of hierarchical clustering

    NASA Astrophysics Data System (ADS)

    Verma, Mudita; Vijayarajan, V.; Sivashanmugam, G.; Bessie Amali, D. Geraldine

    2017-11-01

    Hierarchical clustering is generally used for cluster analysis in which we build up a hierarchy of clusters. In order to find that which cluster should be split a large amount of observations are being carried out. Here the data set of US based personalities has been considered for clustering. After implementation of hierarchical clustering on the data set we group it in three different clusters one is of politician, sports person and musicians. Training set is the main parameter which decides the category which has to be assigned to the observations that are being collected. The category of these observations must be known. Recognition comes from the formulation of classification. Supervised learning has the main instance in the form of classification. While on the other hand Clustering is an instance of unsupervised procedure. Clustering consists of grouping of data that have similar properties which are either their own or are inherited from some other sources.

  10. Multispectral and Panchromatic used Enhancement Resolution and Study Effective Enhancement on Supervised and Unsupervised Classification Land – Cover

    NASA Astrophysics Data System (ADS)

    Salman, S. S.; Abbas, W. A.

    2018-05-01

    The goal of the study is to support analysis Enhancement of Resolution and study effect on classification methods on bands spectral information of specific and quantitative approaches. In this study introduce a method to enhancement resolution Landsat 8 of combining the bands spectral of 30 meters resolution with panchromatic band 8 of 15 meters resolution, because of importance multispectral imagery to extracting land - cover. Classification methods used in this study to classify several lands -covers recorded from OLI- 8 imagery. Two methods of Data mining can be classified as either supervised or unsupervised. In supervised methods, there is a particular predefined target, that means the algorithm learn which values of the target are associated with which values of the predictor sample. K-nearest neighbors and maximum likelihood algorithms examine in this work as supervised methods. In other hand, no sample identified as target in unsupervised methods, the algorithm of data extraction searches for structure and patterns between all the variables, represented by Fuzzy C-mean clustering method as one of the unsupervised methods, NDVI vegetation index used to compare the results of classification method, the percent of dense vegetation in maximum likelihood method give a best results.

  11. Classification of neocortical interneurons using affinity propagation.

    PubMed

    Santana, Roberto; McGarry, Laura M; Bielza, Concha; Larrañaga, Pedro; Yuste, Rafael

    2013-01-01

    In spite of over a century of research on cortical circuits, it is still unknown how many classes of cortical neurons exist. In fact, neuronal classification is a difficult problem because it is unclear how to designate a neuronal cell class and what are the best characteristics to define them. Recently, unsupervised classifications using cluster analysis based on morphological, physiological, or molecular characteristics, have provided quantitative and unbiased identification of distinct neuronal subtypes, when applied to selected datasets. However, better and more robust classification methods are needed for increasingly complex and larger datasets. Here, we explored the use of affinity propagation, a recently developed unsupervised classification algorithm imported from machine learning, which gives a representative example or exemplar for each cluster. As a case study, we applied affinity propagation to a test dataset of 337 interneurons belonging to four subtypes, previously identified based on morphological and physiological characteristics. We found that affinity propagation correctly classified most of the neurons in a blind, non-supervised manner. Affinity propagation outperformed Ward's method, a current standard clustering approach, in classifying the neurons into 4 subtypes. Affinity propagation could therefore be used in future studies to validly classify neurons, as a first step to help reverse engineer neural circuits.

  12. A semi-supervised classification algorithm using the TAD-derived background as training data

    NASA Astrophysics Data System (ADS)

    Fan, Lei; Ambeau, Brittany; Messinger, David W.

    2013-05-01

    In general, spectral image classification algorithms fall into one of two categories: supervised and unsupervised. In unsupervised approaches, the algorithm automatically identifies clusters in the data without a priori information about those clusters (except perhaps the expected number of them). Supervised approaches require an analyst to identify training data to learn the characteristics of the clusters such that they can then classify all other pixels into one of the pre-defined groups. The classification algorithm presented here is a semi-supervised approach based on the Topological Anomaly Detection (TAD) algorithm. The TAD algorithm defines background components based on a mutual k-Nearest Neighbor graph model of the data, along with a spectral connected components analysis. Here, the largest components produced by TAD are used as regions of interest (ROI's),or training data for a supervised classification scheme. By combining those ROI's with a Gaussian Maximum Likelihood (GML) or a Minimum Distance to the Mean (MDM) algorithm, we are able to achieve a semi supervised classification method. We test this classification algorithm against data collected by the HyMAP sensor over the Cooke City, MT area and University of Pavia scene.

  13. Glaucomatous patterns in Frequency Doubling Technology (FDT) perimetry data identified by unsupervised machine learning classifiers.

    PubMed

    Bowd, Christopher; Weinreb, Robert N; Balasubramanian, Madhusudhanan; Lee, Intae; Jang, Giljin; Yousefi, Siamak; Zangwill, Linda M; Medeiros, Felipe A; Girkin, Christopher A; Liebmann, Jeffrey M; Goldbaum, Michael H

    2014-01-01

    The variational Bayesian independent component analysis-mixture model (VIM), an unsupervised machine-learning classifier, was used to automatically separate Matrix Frequency Doubling Technology (FDT) perimetry data into clusters of healthy and glaucomatous eyes, and to identify axes representing statistically independent patterns of defect in the glaucoma clusters. FDT measurements were obtained from 1,190 eyes with normal FDT results and 786 eyes with abnormal FDT results from the UCSD-based Diagnostic Innovations in Glaucoma Study (DIGS) and African Descent and Glaucoma Evaluation Study (ADAGES). For all eyes, VIM input was 52 threshold test points from the 24-2 test pattern, plus age. FDT mean deviation was -1.00 dB (S.D. = 2.80 dB) and -5.57 dB (S.D. = 5.09 dB) in FDT-normal eyes and FDT-abnormal eyes, respectively (p<0.001). VIM identified meaningful clusters of FDT data and positioned a set of statistically independent axes through the mean of each cluster. The optimal VIM model separated the FDT fields into 3 clusters. Cluster N contained primarily normal fields (1109/1190, specificity 93.1%) and clusters G1 and G2 combined, contained primarily abnormal fields (651/786, sensitivity 82.8%). For clusters G1 and G2 the optimal number of axes were 2 and 5, respectively. Patterns automatically generated along axes within the glaucoma clusters were similar to those known to be indicative of glaucoma. Fields located farther from the normal mean on each glaucoma axis showed increasing field defect severity. VIM successfully separated FDT fields from healthy and glaucoma eyes without a priori information about class membership, and identified familiar glaucomatous patterns of loss.

  14. Hierarchical Adaptive Means (HAM) clustering for hardware-efficient, unsupervised and real-time spike sorting.

    PubMed

    Paraskevopoulou, Sivylla E; Wu, Di; Eftekhar, Amir; Constandinou, Timothy G

    2014-09-30

    This work presents a novel unsupervised algorithm for real-time adaptive clustering of neural spike data (spike sorting). The proposed Hierarchical Adaptive Means (HAM) clustering method combines centroid-based clustering with hierarchical cluster connectivity to classify incoming spikes using groups of clusters. It is described how the proposed method can adaptively track the incoming spike data without requiring any past history, iteration or training and autonomously determines the number of spike classes. Its performance (classification accuracy) has been tested using multiple datasets (both simulated and recorded) achieving a near-identical accuracy compared to k-means (using 10-iterations and provided with the number of spike classes). Also, its robustness in applying to different feature extraction methods has been demonstrated by achieving classification accuracies above 80% across multiple datasets. Last but crucially, its low complexity, that has been quantified through both memory and computation requirements makes this method hugely attractive for future hardware implementation. Copyright © 2014 Elsevier B.V. All rights reserved.

  15. The clustering-based case-based reasoning for imbalanced business failure prediction: a hybrid approach through integrating unsupervised process with supervised process

    NASA Astrophysics Data System (ADS)

    Li, Hui; Yu, Jun-Ling; Yu, Le-An; Sun, Jie

    2014-05-01

    Case-based reasoning (CBR) is one of the main forecasting methods in business forecasting, which performs well in prediction and holds the ability of giving explanations for the results. In business failure prediction (BFP), the number of failed enterprises is relatively small, compared with the number of non-failed ones. However, the loss is huge when an enterprise fails. Therefore, it is necessary to develop methods (trained on imbalanced samples) which forecast well for this small proportion of failed enterprises and performs accurately on total accuracy meanwhile. Commonly used methods constructed on the assumption of balanced samples do not perform well in predicting minority samples on imbalanced samples consisting of the minority/failed enterprises and the majority/non-failed ones. This article develops a new method called clustering-based CBR (CBCBR), which integrates clustering analysis, an unsupervised process, with CBR, a supervised process, to enhance the efficiency of retrieving information from both minority and majority in CBR. In CBCBR, various case classes are firstly generated through hierarchical clustering inside stored experienced cases, and class centres are calculated out by integrating cases information in the same clustered class. When predicting the label of a target case, its nearest clustered case class is firstly retrieved by ranking similarities between the target case and each clustered case class centre. Then, nearest neighbours of the target case in the determined clustered case class are retrieved. Finally, labels of the nearest experienced cases are used in prediction. In the empirical experiment with two imbalanced samples from China, the performance of CBCBR was compared with the classical CBR, a support vector machine, a logistic regression and a multi-variant discriminate analysis. The results show that compared with the other four methods, CBCBR performed significantly better in terms of sensitivity for identifying the minority samples and generated high total accuracy meanwhile. The proposed approach makes CBR useful in imbalanced forecasting.

  16. Flow Cytometry Data Preparation Guidelines for Improved Automated Phenotypic Analysis.

    PubMed

    Jimenez-Carretero, Daniel; Ligos, José M; Martínez-López, María; Sancho, David; Montoya, María C

    2018-05-15

    Advances in flow cytometry (FCM) increasingly demand adoption of computational analysis tools to tackle the ever-growing data dimensionality. In this study, we tested different data input modes to evaluate how cytometry acquisition configuration and data compensation procedures affect the performance of unsupervised phenotyping tools. An analysis workflow was set up and tested for the detection of changes in reference bead subsets and in a rare subpopulation of murine lymph node CD103 + dendritic cells acquired by conventional or spectral cytometry. Raw spectral data or pseudospectral data acquired with the full set of available detectors by conventional cytometry consistently outperformed datasets acquired and compensated according to FCM standards. Our results thus challenge the paradigm of one-fluorochrome/one-parameter acquisition in FCM for unsupervised cluster-based analysis. Instead, we propose to configure instrument acquisition to use all available fluorescence detectors and to avoid integration and compensation procedures, thereby using raw spectral or pseudospectral data for improved automated phenotypic analysis. Copyright © 2018 by The American Association of Immunologists, Inc.

  17. Mapping of rock types using a joint approach by combining the multivariate statistics, self-organizing map and Bayesian neural networks: an example from IODP 323 site

    NASA Astrophysics Data System (ADS)

    Karmakar, Mampi; Maiti, Saumen; Singh, Amrita; Ojha, Maheswar; Maity, Bhabani Sankar

    2017-07-01

    Modeling and classification of the subsurface lithology is very important to understand the evolution of the earth system. However, precise classification and mapping of lithology using a single framework are difficult due to the complexity and the nonlinearity of the problem driven by limited core sample information. Here, we implement a joint approach by combining the unsupervised and the supervised methods in a single framework for better classification and mapping of rock types. In the unsupervised method, we use the principal component analysis (PCA), K-means cluster analysis (K-means), dendrogram analysis, Fuzzy C-means (FCM) cluster analysis and self-organizing map (SOM). In the supervised method, we use the Bayesian neural networks (BNN) optimized by the Hybrid Monte Carlo (HMC) (BNN-HMC) and the scaled conjugate gradient (SCG) (BNN-SCG) techniques. We use P-wave velocity, density, neutron porosity, resistivity and gamma ray logs of the well U1343E of the Integrated Ocean Drilling Program (IODP) Expedition 323 in the Bering Sea slope region. While the SOM algorithm allows us to visualize the clustering results in spatial domain, the combined classification schemes (supervised and unsupervised) uncover the different patterns of lithology such of as clayey-silt, diatom-silt and silty-clay from an un-cored section of the drilled hole. In addition, the BNN approach is capable of estimating uncertainty in the predictive modeling of three types of rocks over the entire lithology section at site U1343. Alternate succession of clayey-silt, diatom-silt and silty-clay may be representative of crustal inhomogeneity in general and thus could be a basis for detail study related to the productivity of methane gas in the oceans worldwide. Moreover, at the 530 m depth down below seafloor (DSF), the transition from Pliocene to Pleistocene could be linked to lithological alternation between the clayey-silt and the diatom-silt. The present results could provide the basis for the detailed study to get deeper insight into the Bering Sea' sediment deposition and sequence.

  18. Unsupervised classification of Space Acceleration Measurement System (SAMS) data using ART2-A

    NASA Technical Reports Server (NTRS)

    Smith, A. D.; Sinha, A.

    1999-01-01

    The Space Acceleration Measurement System (SAMS) has been developed by NASA to monitor the microgravity acceleration environment aboard the space shuttle. The amount of data collected by a SAMS unit during a shuttle mission is in the several gigabytes range. Adaptive Resonance Theory 2-A (ART2-A), an unsupervised neural network, has been used to cluster these data and to develop cause and effect relationships among disturbances and the acceleration environment. Using input patterns formed on the basis of power spectral densities (psd), data collected from two missions, STS-050 and STS-057, have been clustered.

  19. Molecular subtyping of bladder cancer using Kohonen self-organizing maps.

    PubMed

    Borkowska, Edyta M; Kruk, Andrzej; Jedrzejczyk, Adam; Rozniecki, Marek; Jablonowski, Zbigniew; Traczyk, Magdalena; Constantinou, Maria; Banaszkiewicz, Monika; Pietrusinski, Michal; Sosnowski, Marek; Hamdy, Freddie C; Peter, Stefan; Catto, James W F; Kaluzewski, Bogdan

    2014-10-01

    Kohonen self-organizing maps (SOMs) are unsupervised Artificial Neural Networks (ANNs) that are good for low-density data visualization. They easily deal with complex and nonlinear relationships between variables. We evaluated molecular events that characterize high- and low-grade BC pathways in the tumors from 104 patients. We compared the ability of statistical clustering with a SOM to stratify tumors according to the risk of progression to more advanced disease. In univariable analysis, tumor stage (log rank P = 0.006) and grade (P < 0.001), HPV DNA (P < 0.004), Chromosome 9 loss (P = 0.04) and the A148T polymorphism (rs 3731249) in CDKN2A (P = 0.02) were associated with progression. Multivariable analysis of these parameters identified that tumor grade (Cox regression, P = 0.001, OR.2.9 (95% CI 1.6-5.2)) and the presence of HPV DNA (P = 0.017, OR 3.8 (95% CI 1.3-11.4)) were the only independent predictors of progression. Unsupervised hierarchical clustering grouped the tumors into discreet branches but did not stratify according to progression free survival (log rank P = 0.39). These genetic variables were presented to SOM input neurons. SOMs are suitable for complex data integration, allow easy visualization of outcomes, and may stratify BC progression more robustly than hierarchical clustering. © 2014 The Authors. Cancer Medicine published by John Wiley & Sons Ltd.

  20. Unsupervised EEG analysis for automated epileptic seizure detection

    NASA Astrophysics Data System (ADS)

    Birjandtalab, Javad; Pouyan, Maziyar Baran; Nourani, Mehrdad

    2016-07-01

    Epilepsy is a neurological disorder which can, if not controlled, potentially cause unexpected death. It is extremely crucial to have accurate automatic pattern recognition and data mining techniques to detect the onset of seizures and inform care-givers to help the patients. EEG signals are the preferred biosignals for diagnosis of epileptic patients. Most of the existing pattern recognition techniques used in EEG analysis leverage the notion of supervised machine learning algorithms. Since seizure data are heavily under-represented, such techniques are not always practical particularly when the labeled data is not sufficiently available or when disease progression is rapid and the corresponding EEG footprint pattern will not be robust. Furthermore, EEG pattern change is highly individual dependent and requires experienced specialists to annotate the seizure and non-seizure events. In this work, we present an unsupervised technique to discriminate seizures and non-seizures events. We employ power spectral density of EEG signals in different frequency bands that are informative features to accurately cluster seizure and non-seizure events. The experimental results tried so far indicate achieving more than 90% accuracy in clustering seizure and non-seizure events without having any prior knowledge on patient's history.

  1. Clustering algorithm evaluation and the development of a replacement for procedure 1. [for crop inventories

    NASA Technical Reports Server (NTRS)

    Lennington, R. K.; Johnson, J. K.

    1979-01-01

    An efficient procedure which clusters data using a completely unsupervised clustering algorithm and then uses labeled pixels to label the resulting clusters or perform a stratified estimate using the clusters as strata is developed. Three clustering algorithms, CLASSY, AMOEBA, and ISOCLS, are compared for efficiency. Three stratified estimation schemes and three labeling schemes are also considered and compared.

  2. The cascaded moving k-means and fuzzy c-means clustering algorithms for unsupervised segmentation of malaria images

    NASA Astrophysics Data System (ADS)

    Abdul-Nasir, Aimi Salihah; Mashor, Mohd Yusoff; Halim, Nurul Hazwani Abd; Mohamed, Zeehaida

    2015-05-01

    Malaria is a life-threatening parasitic infectious disease that corresponds for nearly one million deaths each year. Due to the requirement of prompt and accurate diagnosis of malaria, the current study has proposed an unsupervised pixel segmentation based on clustering algorithm in order to obtain the fully segmented red blood cells (RBCs) infected with malaria parasites based on the thin blood smear images of P. vivax species. In order to obtain the segmented infected cell, the malaria images are first enhanced by using modified global contrast stretching technique. Then, an unsupervised segmentation technique based on clustering algorithm has been applied on the intensity component of malaria image in order to segment the infected cell from its blood cells background. In this study, cascaded moving k-means (MKM) and fuzzy c-means (FCM) clustering algorithms has been proposed for malaria slide image segmentation. After that, median filter algorithm has been applied to smooth the image as well as to remove any unwanted regions such as small background pixels from the image. Finally, seeded region growing area extraction algorithm has been applied in order to remove large unwanted regions that are still appeared on the image due to their size in which cannot be cleaned by using median filter. The effectiveness of the proposed cascaded MKM and FCM clustering algorithms has been analyzed qualitatively and quantitatively by comparing the proposed cascaded clustering algorithm with MKM and FCM clustering algorithms. Overall, the results indicate that segmentation using the proposed cascaded clustering algorithm has produced the best segmentation performances by achieving acceptable sensitivity as well as high specificity and accuracy values compared to the segmentation results provided by MKM and FCM algorithms.

  3. Mastication Evaluation With Unsupervised Learning: Using an Inertial Sensor-Based System

    PubMed Central

    Lucena, Caroline Vieira; Lacerda, Marcelo; Caldas, Rafael; De Lima Neto, Fernando Buarque

    2018-01-01

    There is a direct relationship between the prevalence of musculoskeletal disorders of the temporomandibular joint and orofacial disorders. A well-elaborated analysis of the jaw movements provides relevant information for healthcare professionals to conclude their diagnosis. Different approaches have been explored to track jaw movements such that the mastication analysis is getting less subjective; however, all methods are still highly subjective, and the quality of the assessments depends much on the experience of the health professional. In this paper, an accurate and non-invasive method based on a commercial low-cost inertial sensor (MPU6050) to measure jaw movements is proposed. The jaw-movement feature values are compared to the obtained with clinical analysis, showing no statistically significant difference between both methods. Moreover, We propose to use unsupervised paradigm approaches to cluster mastication patterns of healthy subjects and simulated patients with facial trauma. Two techniques were used in this paper to instantiate the method: Kohonen’s Self-Organizing Maps and K-Means Clustering. Both algorithms have excellent performances to process jaw-movements data, showing encouraging results and potential to bring a full assessment of the masticatory function. The proposed method can be applied in real-time providing relevant dynamic information for health-care professionals. PMID:29651365

  4. Rough-Fuzzy Clustering and Unsupervised Feature Selection for Wavelet Based MR Image Segmentation

    PubMed Central

    Maji, Pradipta; Roy, Shaswati

    2015-01-01

    Image segmentation is an indispensable process in the visualization of human tissues, particularly during clinical analysis of brain magnetic resonance (MR) images. For many human experts, manual segmentation is a difficult and time consuming task, which makes an automated brain MR image segmentation method desirable. In this regard, this paper presents a new segmentation method for brain MR images, integrating judiciously the merits of rough-fuzzy computing and multiresolution image analysis technique. The proposed method assumes that the major brain tissues, namely, gray matter, white matter, and cerebrospinal fluid from the MR images are considered to have different textural properties. The dyadic wavelet analysis is used to extract the scale-space feature vector for each pixel, while the rough-fuzzy clustering is used to address the uncertainty problem of brain MR image segmentation. An unsupervised feature selection method is introduced, based on maximum relevance-maximum significance criterion, to select relevant and significant textural features for segmentation problem, while the mathematical morphology based skull stripping preprocessing step is proposed to remove the non-cerebral tissues like skull. The performance of the proposed method, along with a comparison with related approaches, is demonstrated on a set of synthetic and real brain MR images using standard validity indices. PMID:25848961

  5. On the Implementation of a Land Cover Classification System for SAR Images Using Khoros

    NASA Technical Reports Server (NTRS)

    Medina Revera, Edwin J.; Espinosa, Ramon Vasquez

    1997-01-01

    The Synthetic Aperture Radar (SAR) sensor is widely used to record data about the ground under all atmospheric conditions. The SAR acquired images have very good resolution which necessitates the development of a classification system that process the SAR images to extract useful information for different applications. In this work, a complete system for the land cover classification was designed and programmed using the Khoros, a data flow visual language environment, taking full advantages of the polymorphic data services that it provides. Image analysis was applied to SAR images to improve and automate the processes of recognition and classification of the different regions like mountains and lakes. Both unsupervised and supervised classification utilities were used. The unsupervised classification routines included the use of several Classification/Clustering algorithms like the K-means, ISO2, Weighted Minimum Distance, and the Localized Receptive Field (LRF) training/classifier. Different texture analysis approaches such as Invariant Moments, Fractal Dimension and Second Order statistics were implemented for supervised classification of the images. The results and conclusions for SAR image classification using the various unsupervised and supervised procedures are presented based on their accuracy and performance.

  6. Symmetric nonnegative matrix factorization: algorithms and applications to probabilistic clustering.

    PubMed

    He, Zhaoshui; Xie, Shengli; Zdunek, Rafal; Zhou, Guoxu; Cichocki, Andrzej

    2011-12-01

    Nonnegative matrix factorization (NMF) is an unsupervised learning method useful in various applications including image processing and semantic analysis of documents. This paper focuses on symmetric NMF (SNMF), which is a special case of NMF decomposition. Three parallel multiplicative update algorithms using level 3 basic linear algebra subprograms directly are developed for this problem. First, by minimizing the Euclidean distance, a multiplicative update algorithm is proposed, and its convergence under mild conditions is proved. Based on it, we further propose another two fast parallel methods: α-SNMF and β -SNMF algorithms. All of them are easy to implement. These algorithms are applied to probabilistic clustering. We demonstrate their effectiveness for facial image clustering, document categorization, and pattern clustering in gene expression.

  7. Overcoming confounded controls in the analysis of gene expression data from microarray experiments.

    PubMed

    Bhattacharya, Soumyaroop; Long, Dang; Lyons-Weiler, James

    2003-01-01

    A potential limitation of data from microarray experiments exists when improper control samples are used. In cancer research, comparisons of tumour expression profiles to those from normal samples is challenging due to tissue heterogeneity (mixed cell populations). A specific example exists in a published colon cancer dataset, in which tissue heterogeneity was reported among the normal samples. In this paper, we show how to overcome or avoid the problem of using normal samples that do not derive from the same tissue of origin as the tumour. We advocate an exploratory unsupervised bootstrap analysis that can reveal unexpected and undesired, but strongly supported, clusters of samples that reflect tissue differences instead of tumour versus normal differences. All of the algorithms used in the analysis, including the maximum difference subset algorithm, unsupervised bootstrap analysis, pooled variance t-test for finding differentially expressed genes and the jackknife to reduce false positives, are incorporated into our online Gene Expression Data Analyzer ( http:// bioinformatics.upmc.edu/GE2/GEDA.html ).

  8. Segmentation of fluorescence microscopy cell images using unsupervised mining.

    PubMed

    Du, Xian; Dua, Sumeet

    2010-05-28

    The accurate measurement of cell and nuclei contours are critical for the sensitive and specific detection of changes in normal cells in several medical informatics disciplines. Within microscopy, this task is facilitated using fluorescence cell stains, and segmentation is often the first step in such approaches. Due to the complex nature of cell issues and problems inherent to microscopy, unsupervised mining approaches of clustering can be incorporated in the segmentation of cells. In this study, we have developed and evaluated the performance of multiple unsupervised data mining techniques in cell image segmentation. We adapt four distinctive, yet complementary, methods for unsupervised learning, including those based on k-means clustering, EM, Otsu's threshold, and GMAC. Validation measures are defined, and the performance of the techniques is evaluated both quantitatively and qualitatively using synthetic and recently published real data. Experimental results demonstrate that k-means, Otsu's threshold, and GMAC perform similarly, and have more precise segmentation results than EM. We report that EM has higher recall values and lower precision results from under-segmentation due to its Gaussian model assumption. We also demonstrate that these methods need spatial information to segment complex real cell images with a high degree of efficacy, as expected in many medical informatics applications.

  9. Unsupervised color image segmentation using a lattice algebra clustering technique

    NASA Astrophysics Data System (ADS)

    Urcid, Gonzalo; Ritter, Gerhard X.

    2011-08-01

    In this paper we introduce a lattice algebra clustering technique for segmenting digital images in the Red-Green- Blue (RGB) color space. The proposed technique is a two step procedure. Given an input color image, the first step determines the finite set of its extreme pixel vectors within the color cube by means of the scaled min-W and max-M lattice auto-associative memory matrices, including the minimum and maximum vector bounds. In the second step, maximal rectangular boxes enclosing each extreme color pixel are found using the Chebychev distance between color pixels; afterwards, clustering is performed by assigning each image pixel to its corresponding maximal box. The two steps in our proposed method are completely unsupervised or autonomous. Illustrative examples are provided to demonstrate the color segmentation results including a brief numerical comparison with two other non-maximal variations of the same clustering technique.

  10. Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions.

    PubMed

    Yang, Yang; Saleemi, Imran; Shah, Mubarak

    2013-07-01

    This paper proposes a novel representation of articulated human actions and gestures and facial expressions. The main goals of the proposed approach are: 1) to enable recognition using very few examples, i.e., one or k-shot learning, and 2) meaningful organization of unlabeled datasets by unsupervised clustering. Our proposed representation is obtained by automatically discovering high-level subactions or motion primitives, by hierarchical clustering of observed optical flow in four-dimensional, spatial, and motion flow space. The completely unsupervised proposed method, in contrast to state-of-the-art representations like bag of video words, provides a meaningful representation conducive to visual interpretation and textual labeling. Each primitive action depicts an atomic subaction, like directional motion of limb or torso, and is represented by a mixture of four-dimensional Gaussian distributions. For one--shot and k-shot learning, the sequence of primitive labels discovered in a test video are labeled using KL divergence, and can then be represented as a string and matched against similar strings of training videos. The same sequence can also be collapsed into a histogram of primitives or be used to learn a Hidden Markov model to represent classes. We have performed extensive experiments on recognition by one and k-shot learning as well as unsupervised action clustering on six human actions and gesture datasets, a composite dataset, and a database of facial expressions. These experiments confirm the validity and discriminative nature of the proposed representation.

  11. Gastric cancer differentiation using Fourier transform near-infrared spectroscopy with unsupervised pattern recognition

    NASA Astrophysics Data System (ADS)

    Yi, Wei-song; Cui, Dian-sheng; Li, Zhi; Wu, Lan-lan; Shen, Ai-guo; Hu, Ji-ming

    2013-01-01

    The manuscript has investigated the application of near-infrared (NIR) spectroscopy for differentiation gastric cancer. The 90 spectra from cancerous and normal tissues were collected from a total of 30 surgical specimens using Fourier transform near-infrared spectroscopy (FT-NIR) equipped with a fiber-optic probe. Major spectral differences were observed in the CH-stretching second overtone (9000-7000 cm-1), CH-stretching first overtone (6000-5200 cm-1), and CH-stretching combination (4500-4000 cm-1) regions. By use of unsupervised pattern recognition, such as principal component analysis (PCA) and cluster analysis (CA), all spectra were classified into cancerous and normal tissue groups with accuracy up to 81.1%. The sensitivity and specificity was 100% and 68.2%, respectively. These present results indicate that CH-stretching first, combination band and second overtone regions can serve as diagnostic markers for gastric cancer.

  12. Generalized Self-Organizing Maps for Automatic Determination of the Number of Clusters and Their Multiprototypes in Cluster Analysis.

    PubMed

    Gorzalczany, Marian B; Rudzinski, Filip

    2017-06-07

    This paper presents a generalization of self-organizing maps with 1-D neighborhoods (neuron chains) that can be effectively applied to complex cluster analysis problems. The essence of the generalization consists in introducing mechanisms that allow the neuron chain--during learning--to disconnect into subchains, to reconnect some of the subchains again, and to dynamically regulate the overall number of neurons in the system. These features enable the network--working in a fully unsupervised way (i.e., using unlabeled data without a predefined number of clusters)--to automatically generate collections of multiprototypes that are able to represent a broad range of clusters in data sets. First, the operation of the proposed approach is illustrated on some synthetic data sets. Then, this technique is tested using several real-life, complex, and multidimensional benchmark data sets available from the University of California at Irvine (UCI) Machine Learning repository and the Knowledge Extraction based on Evolutionary Learning data set repository. A sensitivity analysis of our approach to changes in control parameters and a comparative analysis with an alternative approach are also performed.

  13. Cluster analysis of polymers using laser-induced breakdown spectroscopy with K-means

    NASA Astrophysics Data System (ADS)

    Yangmin, GUO; Yun, TANG; Yu, DU; Shisong, TANG; Lianbo, GUO; Xiangyou, LI; Yongfeng, LU; Xiaoyan, ZENG

    2018-06-01

    Laser-induced breakdown spectroscopy (LIBS) combined with K-means algorithm was employed to automatically differentiate industrial polymers under atmospheric conditions. The unsupervised learning algorithm K-means were utilized for the clustering of LIBS dataset measured from twenty kinds of industrial polymers. To prevent the interference from metallic elements, three atomic emission lines (C I 247.86 nm , H I 656.3 nm, and O I 777.3 nm) and one molecular line C–N (0, 0) 388.3 nm were used. The cluster analysis results were obtained through an iterative process. The Davies–Bouldin index was employed to determine the initial number of clusters. The average relative standard deviation values of characteristic spectral lines were used as the iterative criterion. With the proposed approach, the classification accuracy for twenty kinds of industrial polymers achieved 99.6%. The results demonstrated that this approach has great potential for industrial polymers recycling by LIBS.

  14. Semi-supervised clustering methods.

    PubMed

    Bair, Eric

    2013-01-01

    Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as "semi-supervised clustering" methods) that can be applied in these situations. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided.

  15. Cloud classification from satellite data using a fuzzy sets algorithm: A polar example

    NASA Technical Reports Server (NTRS)

    Key, J. R.; Maslanik, J. A.; Barry, R. G.

    1988-01-01

    Where spatial boundaries between phenomena are diffuse, classification methods which construct mutually exclusive clusters seem inappropriate. The Fuzzy c-means (FCM) algorithm assigns each observation to all clusters, with membership values as a function of distance to the cluster center. The FCM algorithm is applied to AVHRR data for the purpose of classifying polar clouds and surfaces. Careful analysis of the fuzzy sets can provide information on which spectral channels are best suited to the classification of particular features, and can help determine likely areas of misclassification. General agreement in the resulting classes and cloud fraction was found between the FCM algorithm, a manual classification, and an unsupervised maximum likelihood classifier.

  16. Clustervision: Visual Supervision of Unsupervised Clustering.

    PubMed

    Kwon, Bum Chul; Eysenbach, Ben; Verma, Janu; Ng, Kenney; De Filippi, Christopher; Stewart, Walter F; Perer, Adam

    2018-01-01

    Clustering, the process of grouping together similar items into distinct partitions, is a common type of unsupervised machine learning that can be useful for summarizing and aggregating complex multi-dimensional data. However, data can be clustered in many ways, and there exist a large body of algorithms designed to reveal different patterns. While having access to a wide variety of algorithms is helpful, in practice, it is quite difficult for data scientists to choose and parameterize algorithms to get the clustering results relevant for their dataset and analytical tasks. To alleviate this problem, we built Clustervision, a visual analytics tool that helps ensure data scientists find the right clustering among the large amount of techniques and parameters available. Our system clusters data using a variety of clustering techniques and parameters and then ranks clustering results utilizing five quality metrics. In addition, users can guide the system to produce more relevant results by providing task-relevant constraints on the data. Our visual user interface allows users to find high quality clustering results, explore the clusters using several coordinated visualization techniques, and select the cluster result that best suits their task. We demonstrate this novel approach using a case study with a team of researchers in the medical domain and showcase that our system empowers users to choose an effective representation of their complex data.

  17. A Fast Implementation of the ISOCLUS Algorithm

    NASA Technical Reports Server (NTRS)

    Memarsadeghi, Nargess; Mount, David M.; Netanyahu, Nathan S.; LeMoigne, Jacqueline

    2003-01-01

    Unsupervised clustering is a fundamental tool in numerous image processing and remote sensing applications. For example, unsupervised clustering is often used to obtain vegetation maps of an area of interest. This approach is useful when reliable training data are either scarce or expensive, and when relatively little a priori information about the data is available. Unsupervised clustering methods play a significant role in the pursuit of unsupervised classification. One of the most popular and widely used clustering schemes for remote sensing applications is the ISOCLUS algorithm, which is based on the ISODATA method. The algorithm is given a set of n data points (or samples) in d-dimensional space, an integer k indicating the initial number of clusters, and a number of additional parameters. The general goal is to compute a set of cluster centers in d-space. Although there is no specific optimization criterion, the algorithm is similar in spirit to the well known k-means clustering method in which the objective is to minimize the average squared distance of each point to its nearest center, called the average distortion. One significant feature of ISOCLUS over k-means is that clusters may be merged or split, and so the final number of clusters may be different from the number k supplied as part of the input. This algorithm will be described in later in this paper. The ISOCLUS algorithm can run very slowly, particularly on large data sets. Given its wide use in remote sensing, its efficient computation is an important goal. We have developed a fast implementation of the ISOCLUS algorithm. Our improvement is based on a recent acceleration to the k-means algorithm, the filtering algorithm, by Kanungo et al.. They showed that, by storing the data in a kd-tree, it was possible to significantly reduce the running time of k-means. We have adapted this method for the ISOCLUS algorithm. For technical reasons, which are explained later, it is necessary to make a minor modification to the ISOCLUS specification. We provide empirical evidence, on both synthetic and Landsat image data sets, that our algorithm's performance is essentially the same as that of ISOCLUS, but with significantly lower running times. We show that our algorithm runs from 3 to 30 times faster than a straightforward implementation of ISOCLUS. Our adaptation of the filtering algorithm involves the efficient computation of a number of cluster statistics that are needed for ISOCLUS, but not for k-means.

  18. Unsupervised clustering of gene expression data points at hypoxia as possible trigger for metabolic syndrome.

    PubMed

    Ptitsyn, Andrey; Hulver, Matthew; Cefalu, William; York, David; Smith, Steven R

    2006-12-19

    Classification of large volumes of data produced in a microarray experiment allows for the extraction of important clues as to the nature of a disease. Using multi-dimensional unsupervised FOREL (FORmal ELement) algorithm we have re-analyzed three public datasets of skeletal muscle gene expression in connection with insulin resistance and type 2 diabetes (DM2). Our analysis revealed the major line of variation between expression profiles of normal, insulin resistant, and diabetic skeletal muscle. A cluster of most "metabolically sound" samples occupied one end of this line. The distance along this line coincided with the classic markers of diabetes risk, namely obesity and insulin resistance, but did not follow the accepted clinical diagnosis of DM2 as defined by the presence or absence of hyperglycemia. Genes implicated in this expression pattern are those controlling skeletal muscle fiber type and glycolytic metabolism. Additionally myoglobin and hemoglobin were upregulated and ribosomal genes deregulated in insulin resistant patients. Our findings are concordant with the changes seen in skeletal muscle with altitude hypoxia. This suggests that hypoxia and shift to glycolytic metabolism may also drive insulin resistance.

  19. Semi-supervised clustering methods

    PubMed Central

    Bair, Eric

    2013-01-01

    Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as “semi-supervised clustering” methods) that can be applied in these situations. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided. PMID:24729830

  20. Unsupervised detection and removal of muscle artifacts from scalp EEG recordings using canonical correlation analysis, wavelets and random forests.

    PubMed

    Anastasiadou, Maria N; Christodoulakis, Manolis; Papathanasiou, Eleftherios S; Papacostas, Savvas S; Mitsis, Georgios D

    2017-09-01

    This paper proposes supervised and unsupervised algorithms for automatic muscle artifact detection and removal from long-term EEG recordings, which combine canonical correlation analysis (CCA) and wavelets with random forests (RF). The proposed algorithms first perform CCA and continuous wavelet transform of the canonical components to generate a number of features which include component autocorrelation values and wavelet coefficient magnitude values. A subset of the most important features is subsequently selected using RF and labelled observations (supervised case) or synthetic data constructed from the original observations (unsupervised case). The proposed algorithms are evaluated using realistic simulation data as well as 30min epochs of non-invasive EEG recordings obtained from ten patients with epilepsy. We assessed the performance of the proposed algorithms using classification performance and goodness-of-fit values for noisy and noise-free signal windows. In the simulation study, where the ground truth was known, the proposed algorithms yielded almost perfect performance. In the case of experimental data, where expert marking was performed, the results suggest that both the supervised and unsupervised algorithm versions were able to remove artifacts without affecting noise-free channels considerably, outperforming standard CCA, independent component analysis (ICA) and Lagged Auto-Mutual Information Clustering (LAMIC). The proposed algorithms achieved excellent performance for both simulation and experimental data. Importantly, for the first time to our knowledge, we were able to perform entirely unsupervised artifact removal, i.e. without using already marked noisy data segments, achieving performance that is comparable to the supervised case. Overall, the results suggest that the proposed algorithms yield significant future potential for improving EEG signal quality in research or clinical settings without the need for marking by expert neurophysiologists, EMG signal recording and user visual inspection. Copyright © 2017 International Federation of Clinical Neurophysiology. Published by Elsevier B.V. All rights reserved.

  1. Clustering approach for unsupervised segmentation of malarial Plasmodium vivax parasite

    NASA Astrophysics Data System (ADS)

    Abdul-Nasir, Aimi Salihah; Mashor, Mohd Yusoff; Mohamed, Zeehaida

    2017-10-01

    Malaria is a global health problem, particularly in Africa and south Asia where it causes countless deaths and morbidity cases. Efficient control and prompt of this disease require early detection and accurate diagnosis due to the large number of cases reported yearly. To achieve this aim, this paper proposes an image segmentation approach via unsupervised pixel segmentation of malaria parasite to automate the diagnosis of malaria. In this study, a modified clustering algorithm namely enhanced k-means (EKM) clustering, is proposed for malaria image segmentation. In the proposed EKM clustering, the concept of variance and a new version of transferring process for clustered members are used to assist the assignation of data to the proper centre during the process of clustering, so that good segmented malaria image can be generated. The effectiveness of the proposed EKM clustering has been analyzed qualitatively and quantitatively by comparing this algorithm with two popular image segmentation techniques namely Otsu's thresholding and k-means clustering. The experimental results show that the proposed EKM clustering has successfully segmented 100 malaria images of P. vivax species with segmentation accuracy, sensitivity and specificity of 99.20%, 87.53% and 99.58%, respectively. Hence, the proposed EKM clustering can be considered as an image segmentation tool for segmenting the malaria images.

  2. Detection of molecular signatures of oral squamous cell carcinoma and normal epithelium - application of a novel methodology for unsupervised segmentation of imaging mass spectrometry data.

    PubMed

    Widlak, Piotr; Mrukwa, Grzegorz; Kalinowska, Magdalena; Pietrowska, Monika; Chekan, Mykola; Wierzgon, Janusz; Gawin, Marta; Drazek, Grzegorz; Polanska, Joanna

    2016-06-01

    Intra-tumor heterogeneity is a vivid problem of molecular oncology that could be addressed by imaging mass spectrometry. Here we aimed to assess molecular heterogeneity of oral squamous cell carcinoma and to detect signatures discriminating normal and cancerous epithelium. Tryptic peptides were analyzed by MALDI-IMS in tissue specimens from five patients with oral cancer. Novel algorithm of IMS data analysis was developed and implemented, which included Gaussian mixture modeling for detection of spectral components and iterative k-means algorithm for unsupervised spectra clustering performed in domain reduced to a subset of the most dispersed components. About 4% of the detected peptides showed significantly different abundances between normal epithelium and tumor, and could be considered as a molecular signature of oral cancer. Moreover, unsupervised clustering revealed two major sub-regions within expert-defined tumor areas. One of them showed molecular similarity with histologically normal epithelium. The other one showed similarity with connective tissue, yet was markedly different from normal epithelium. Pathologist's re-inspection of tissue specimens confirmed distinct features in both tumor sub-regions: foci of actual cancer cells or cancer microenvironment-related cells prevailed in corresponding areas. Hence, molecular differences detected during automated segmentation of IMS data had an apparent reflection in real structures present in tumor. © 2016 The Authors. Proteomics Published by Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.

  3. Ellipsoidal fuzzy learning for smart car platoons

    NASA Astrophysics Data System (ADS)

    Dickerson, Julie A.; Kosko, Bart

    1993-12-01

    A neural-fuzzy system combined supervised and unsupervised learning to find and tune the fuzzy-rules. An additive fuzzy system approximates a function by covering its graph with fuzzy rules. A fuzzy rule patch can take the form of an ellipsoid in the input-output space. Unsupervised competitive learning found the statistics of data clusters. The covariance matrix of each synaptic quantization vector defined on ellipsoid centered at the centroid of the data cluster. Tightly clustered data gave smaller ellipsoids or more certain rules. Sparse data gave larger ellipsoids or less certain rules. Supervised learning tuned the ellipsoids to improve the approximation. The supervised neural system used gradient descent to find the ellipsoidal fuzzy patches. It locally minimized the mean-squared error of the fuzzy approximation. Hybrid ellipsoidal learning estimated the control surface for a smart car controller.

  4. Characterizing Heterogeneity within Head and Neck Lesions Using Cluster Analysis of Multi-Parametric MRI Data.

    PubMed

    Borri, Marco; Schmidt, Maria A; Powell, Ceri; Koh, Dow-Mu; Riddell, Angela M; Partridge, Mike; Bhide, Shreerang A; Nutting, Christopher M; Harrington, Kevin J; Newbold, Katie L; Leach, Martin O

    2015-01-01

    To describe a methodology, based on cluster analysis, to partition multi-parametric functional imaging data into groups (or clusters) of similar functional characteristics, with the aim of characterizing functional heterogeneity within head and neck tumour volumes. To evaluate the performance of the proposed approach on a set of longitudinal MRI data, analysing the evolution of the obtained sub-sets with treatment. The cluster analysis workflow was applied to a combination of dynamic contrast-enhanced and diffusion-weighted imaging MRI data from a cohort of squamous cell carcinoma of the head and neck patients. Cumulative distributions of voxels, containing pre and post-treatment data and including both primary tumours and lymph nodes, were partitioned into k clusters (k = 2, 3 or 4). Principal component analysis and cluster validation were employed to investigate data composition and to independently determine the optimal number of clusters. The evolution of the resulting sub-regions with induction chemotherapy treatment was assessed relative to the number of clusters. The clustering algorithm was able to separate clusters which significantly reduced in voxel number following induction chemotherapy from clusters with a non-significant reduction. Partitioning with the optimal number of clusters (k = 4), determined with cluster validation, produced the best separation between reducing and non-reducing clusters. The proposed methodology was able to identify tumour sub-regions with distinct functional properties, independently separating clusters which were affected differently by treatment. This work demonstrates that unsupervised cluster analysis, with no prior knowledge of the data, can be employed to provide a multi-parametric characterization of functional heterogeneity within tumour volumes.

  5. Meta-Analytical Online Repository of Gene Expression Profiles of MDS Stem Cells

    DTIC Science & Technology

    2015-12-01

    Myelodysplastic syndrome , AML: Acute myeloid leukemia, ALL: Acute lymphoblastic leukemia Numbers in brackets are reference numbers. doi:10.1371/journal.pone...disorders such as acute leukemias and myelodysplastic syndromes would be distinguishable in our analysis. Unsupervised clustering showed that even though...al. Angiogenesis in acute and chronic leukemias and myelodysplastic syndromes . Blood. 2000;96:2240–2245. [PubMed: 10979972] 17. Yoon SY, Li CY, Lloyd

  6. Chemical modeling of groundwater in the Banat Plain, southwestern Romania, with elevated As content and co-occurring species by combining diagrams and unsupervised multivariate statistical approaches.

    PubMed

    Butaciu, Sinziana; Senila, Marin; Sarbu, Costel; Ponta, Michaela; Tanaselia, Claudiu; Cadar, Oana; Roman, Marius; Radu, Emil; Sima, Mihaela; Frentiu, Tiberiu

    2017-04-01

    The study proposes a combined model based on diagrams (Gibbs, Piper, Stuyfzand Hydrogeochemical Classification System) and unsupervised statistical approaches (Cluster Analysis, Principal Component Analysis, Fuzzy Principal Component Analysis, Fuzzy Hierarchical Cross-Clustering) to describe natural enrichment of inorganic arsenic and co-occurring species in groundwater in the Banat Plain, southwestern Romania. Speciation of inorganic As (arsenite, arsenate), ion concentrations (Na + , K + , Ca 2+ , Mg 2+ , HCO 3 - , Cl - , F - , SO 4 2- , PO 4 3- , NO 3 - ), pH, redox potential, conductivity and total dissolved substances were performed. Classical diagrams provided the hydrochemical characterization, while statistical approaches were helpful to establish (i) the mechanism of naturally occurring of As and F - species and the anthropogenic one for NO 3 - , SO 4 2- , PO 4 3- and K + and (ii) classification of groundwater based on content of arsenic species. The HCO 3 - type of local groundwater and alkaline pH (8.31-8.49) were found to be responsible for the enrichment of arsenic species and occurrence of F - but by different paths. The PO 4 3- -AsO 4 3- ion exchange, water-rock interaction (silicates hydrolysis and desorption from clay) were associated to arsenate enrichment in the oxidizing aquifer. Fuzzy Hierarchical Cross-Clustering was the strongest tool for the rapid simultaneous classification of groundwaters as a function of arsenic content and hydrogeochemical characteristics. The approach indicated the Na + -F - -pH cluster as marker for groundwater with naturally elevated As and highlighted which parameters need to be monitored. A chemical conceptual model illustrating the natural and anthropogenic paths and enrichment of As and co-occurring species in the local groundwater supported by mineralogical analysis of rocks was established. Copyright © 2016 Elsevier Ltd. All rights reserved.

  7. Subtyping of Children with Developmental Dyslexia via Bootstrap Aggregated Clustering and the Gap Statistic: Comparison with the Double-Deficit Hypothesis

    ERIC Educational Resources Information Center

    King, Wayne M.; Giess, Sally A.; Lombardino, Linda J.

    2007-01-01

    Background: The marked degree of heterogeneity in persons with developmental dyslexia has motivated the investigation of possible subtypes. Attempts have proceeded both from theoretical models of reading and the application of unsupervised learning (clustering) methods. Previous cluster analyses of data obtained from persons with reading…

  8. Geospatiotemporal Data Mining of Remotely Sensed Phenology for Unsupervised Forest Threat Detection

    NASA Astrophysics Data System (ADS)

    Mills, R. T.; Hoffman, F. M.; Kumar, J.; Vulli, S. S.; Hargrove, W. W.; Spruce, J.

    2010-12-01

    Hargrove and Hoffman have previously developed and applied a scalable geospatiotemporal data mining approach to define a set of categorical, multivariate classes or states for describing and tracking the behavior of ecosystem properties through time within a multi-dimensional phase or state space. The method employs a standard k-means cluster analysis with enhancements that reduce the number of required comparisons, dramatically accelerating iterative convergence. In support of efforts by the USDA Forest Service to develop a National Early Warning System for Forest Disturbances, we have applied this geospatiotemporal cluster analysis procedure to annual phenology patterns derived from Moderate Resolution Imaging Spectroradiometer (MODIS) Normalized Difference Vegetation Index (NDVI) for unsupervised change detection. We will present initial results from the analysis of seven years of 250-m MODIS NDVI data for the conterminous United States. While determining what constitutes a "normal" phenological pattern for any given location is challenging due to interannual climate variability, a spatially varying climate change trend, and the relatively short record of MODIS NDVI observations, these results demonstrate the utility of the method for detecting significant mortality events, like the progressive damage from mountain pine beetle, and suggest that the technique may be successfully implemented as a key component in an early warning system for identifying forest threats from natural and anthropogenic disturbances at a continental scale.

  9. Supervised versus unsupervised categorization: two sides of the same coin?

    PubMed

    Pothos, Emmanuel M; Edwards, Darren J; Perlman, Amotz

    2011-09-01

    Supervised and unsupervised categorization have been studied in separate research traditions. A handful of studies have attempted to explore a possible convergence between the two. The present research builds on these studies, by comparing the unsupervised categorization results of Pothos et al. ( 2011 ; Pothos et al., 2008 ) with the results from two procedures of supervised categorization. In two experiments, we tested 375 participants with nine different stimulus sets and examined the relation between ease of learning of a classification, memory for a classification, and spontaneous preference for a classification. After taking into account the role of the number of category labels (clusters) in supervised learning, we found the three variables to be closely associated with each other. Our results provide encouragement for researchers seeking unified theoretical explanations for supervised and unsupervised categorization, but raise a range of challenging theoretical questions.

  10. Noise-enhanced clustering and competitive learning algorithms.

    PubMed

    Osoba, Osonde; Kosko, Bart

    2013-01-01

    Noise can provably speed up convergence in many centroid-based clustering algorithms. This includes the popular k-means clustering algorithm. The clustering noise benefit follows from the general noise benefit for the expectation-maximization algorithm because many clustering algorithms are special cases of the expectation-maximization algorithm. Simulations show that noise also speeds up convergence in stochastic unsupervised competitive learning, supervised competitive learning, and differential competitive learning. Copyright © 2012 Elsevier Ltd. All rights reserved.

  11. Shadow detection and removal in RGB VHR images for land use unsupervised classification

    NASA Astrophysics Data System (ADS)

    Movia, A.; Beinat, A.; Crosilla, F.

    2016-09-01

    Nowadays, high resolution aerial images are widely available thanks to the diffusion of advanced technologies such as UAVs (Unmanned Aerial Vehicles) and new satellite missions. Although these developments offer new opportunities for accurate land use analysis and change detection, cloud and terrain shadows actually limit benefits and possibilities of modern sensors. Focusing on the problem of shadow detection and removal in VHR color images, the paper proposes new solutions and analyses how they can enhance common unsupervised classification procedures for identifying land use classes related to the CO2 absorption. To this aim, an improved fully automatic procedure has been developed for detecting image shadows using exclusively RGB color information, and avoiding user interaction. Results show a significant accuracy enhancement with respect to similar methods using RGB based indexes. Furthermore, novel solutions derived from Procrustes analysis have been applied to remove shadows and restore brightness in the images. In particular, two methods implementing the so called "anisotropic Procrustes" and the "not-centered oblique Procrustes" algorithms have been developed and compared with the linear correlation correction method based on the Cholesky decomposition. To assess how shadow removal can enhance unsupervised classifications, results obtained with classical methods such as k-means, maximum likelihood, and self-organizing maps, have been compared to each other and with a supervised clustering procedure.

  12. Automatic identification of the number of food items in a meal using clustering techniques based on the monitoring of swallowing and chewing.

    PubMed

    Lopez-Meyer, Paulo; Schuckers, Stephanie; Makeyev, Oleksandr; Fontana, Juan M; Sazonov, Edward

    2012-09-01

    The number of distinct foods consumed in a meal is of significant clinical concern in the study of obesity and other eating disorders. This paper proposes the use of information contained in chewing and swallowing sequences for meal segmentation by food types. Data collected from experiments of 17 volunteers were analyzed using two different clustering techniques. First, an unsupervised clustering technique, Affinity Propagation (AP), was used to automatically identify the number of segments within a meal. Second, performance of the unsupervised AP method was compared to a supervised learning approach based on Agglomerative Hierarchical Clustering (AHC). While the AP method was able to obtain 90% accuracy in predicting the number of food items, the AHC achieved an accuracy >95%. Experimental results suggest that the proposed models of automatic meal segmentation may be utilized as part of an integral application for objective Monitoring of Ingestive Behavior in free living conditions.

  13. Astrophysical properties of star clusters in the Magellanic Clouds homogeneously estimated by ASteCA

    NASA Astrophysics Data System (ADS)

    Perren, G. I.; Piatti, A. E.; Vázquez, R. A.

    2017-06-01

    Aims: We seek to produce a homogeneous catalog of astrophysical parameters of 239 resolved star clusters, located in the Small and Large Magellanic Clouds, observed in the Washington photometric system. Methods: The cluster sample was processed with the recently introduced Automated Stellar Cluster Analysis (ASteCA) package, which ensures both an automatized and a fully reproducible treatment, together with a statistically based analysis of their fundamental parameters and associated uncertainties. The fundamental parameters determined for each cluster with this tool, via a color-magnitude diagram (CMD) analysis, are metallicity, age, reddening, distance modulus, and total mass. Results: We generated a homogeneous catalog of structural and fundamental parameters for the studied cluster sample and performed a detailed internal error analysis along with a thorough comparison with values taken from 26 published articles. We studied the distribution of cluster fundamental parameters in both Clouds and obtained their age-metallicity relationships. Conclusions: The ASteCA package can be applied to an unsupervised determination of fundamental cluster parameters, which is a task of increasing relevance as more data becomes available through upcoming surveys. A table with the estimated fundamental parameters for the 239 clusters analyzed is only available at the CDS via anonymous ftp to http://cdsarc.u-strasbg.fr (http://130.79.128.5) or via http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/602/A89

  14. Structure-related clustering of gene expression fingerprints of thp-1 cells exposed to smaller polycyclic aromatic hydrocarbons.

    PubMed

    Wan, B; Yarbrough, J W; Schultz, T W

    2008-01-01

    This study was undertaken to test the hypothesis that structurally similar PAHs induce similar gene expression profiles. THP-1 cells were exposed to a series of 12 selected PAHs at 50 microM for 24 hours and gene expressions profiles were analyzed using both unsupervised and supervised methods. Clustering analysis of gene expression profiles revealed that the 12 tested chemicals were grouped into five clusters. Within each cluster, the gene expression profiles are more similar to each other than to the ones outside the cluster. One-methylanthracene and 1-methylfluorene were found to have the most similar profiles; dibenzothiophene and dibenzofuran were found to share common profiles with fluorine. As expression pattern comparisons were expanded, similarity in genomic fingerprint dropped off dramatically. Prediction analysis of microarrays (PAM) based on the clustering pattern generated 49 predictor genes that can be used for sample discrimination. Moreover, a significant analysis of Microarrays (SAM) identified 598 genes being modulated by tested chemicals with a variety of biological processes, such as cell cycle, metabolism, and protein binding and KEGG pathways being significantly (p < 0.05) affected. It is feasible to distinguish structurally different PAHs based on their genomic fingerprints, which are mechanism based.

  15. Unsupervised classification of surface defects in wire rod production obtained by eddy current sensors.

    PubMed

    Saludes-Rodil, Sergio; Baeyens, Enrique; Rodríguez-Juan, Carlos P

    2015-04-29

    An unsupervised approach to classify surface defects in wire rod manufacturing is developed in this paper. The defects are extracted from an eddy current signal and classified using a clustering technique that uses the dynamic time warping distance as the dissimilarity measure. The new approach has been successfully tested using industrial data. It is shown that it outperforms other classification alternatives, such as the modified Fourier descriptors.

  16. Self-organizing neural networks--an alternative way of cluster analysis in clinical chemistry.

    PubMed

    Reibnegger, G; Wachter, H

    1996-04-15

    Supervised learning schemes have been employed by several workers for training neural networks designed to solve clinical problems. We demonstrate that unsupervised techniques can also produce interesting and meaningful results. Using a data set on the chemical composition of milk from 22 different mammals, we demonstrate that self-organizing feature maps (Kohonen networks) as well as a modified version of error backpropagation technique yield results mimicking conventional cluster analysis. Both techniques are able to project a potentially multi-dimensional input vector onto a two-dimensional space whereby neighborhood relationships remain conserved. Thus, these techniques can be used for reducing dimensionality of complicated data sets and for enhancing comprehensibility of features hidden in the data matrix.

  17. Unsupervised Learning and Pattern Recognition of Biological Data Structures with Density Functional Theory and Machine Learning.

    PubMed

    Chen, Chien-Chang; Juan, Hung-Hui; Tsai, Meng-Yuan; Lu, Henry Horng-Shing

    2018-01-11

    By introducing the methods of machine learning into the density functional theory, we made a detour for the construction of the most probable density function, which can be estimated by learning relevant features from the system of interest. Using the properties of universal functional, the vital core of density functional theory, the most probable cluster numbers and the corresponding cluster boundaries in a studying system can be simultaneously and automatically determined and the plausibility is erected on the Hohenberg-Kohn theorems. For the method validation and pragmatic applications, interdisciplinary problems from physical to biological systems were enumerated. The amalgamation of uncharged atomic clusters validated the unsupervised searching process of the cluster numbers and the corresponding cluster boundaries were exhibited likewise. High accurate clustering results of the Fisher's iris dataset showed the feasibility and the flexibility of the proposed scheme. Brain tumor detections from low-dimensional magnetic resonance imaging datasets and segmentations of high-dimensional neural network imageries in the Brainbow system were also used to inspect the method practicality. The experimental results exhibit the successful connection between the physical theory and the machine learning methods and will benefit the clinical diagnoses.

  18. A harmonic linear dynamical system for prominent ECG feature extraction.

    PubMed

    Thi, Ngoc Anh Nguyen; Yang, Hyung-Jeong; Kim, SunHee; Do, Luu Ngoc

    2014-01-01

    Unsupervised mining of electrocardiography (ECG) time series is a crucial task in biomedical applications. To have efficiency of the clustering results, the prominent features extracted from preprocessing analysis on multiple ECG time series need to be investigated. In this paper, a Harmonic Linear Dynamical System is applied to discover vital prominent features via mining the evolving hidden dynamics and correlations in ECG time series. The discovery of the comprehensible and interpretable features of the proposed feature extraction methodology effectively represents the accuracy and the reliability of clustering results. Particularly, the empirical evaluation results of the proposed method demonstrate the improved performance of clustering compared to the previous main stream feature extraction approaches for ECG time series clustering tasks. Furthermore, the experimental results on real-world datasets show scalability with linear computation time to the duration of the time series.

  19. Unsupervised Structure Detection in Biomedical Data.

    PubMed

    Vogt, Julia E

    2015-01-01

    A major challenge in computational biology is to find simple representations of high-dimensional data that best reveal the underlying structure. In this work, we present an intuitive and easy-to-implement method based on ranked neighborhood comparisons that detects structure in unsupervised data. The method is based on ordering objects in terms of similarity and on the mutual overlap of nearest neighbors. This basic framework was originally introduced in the field of social network analysis to detect actor communities. We demonstrate that the same ideas can successfully be applied to biomedical data sets in order to reveal complex underlying structure. The algorithm is very efficient and works on distance data directly without requiring a vectorial embedding of data. Comprehensive experiments demonstrate the validity of this approach. Comparisons with state-of-the-art clustering methods show that the presented method outperforms hierarchical methods as well as density based clustering methods and model-based clustering. A further advantage of the method is that it simultaneously provides a visualization of the data. Especially in biomedical applications, the visualization of data can be used as a first pre-processing step when analyzing real world data sets to get an intuition of the underlying data structure. We apply this model to synthetic data as well as to various biomedical data sets which demonstrate the high quality and usefulness of the inferred structure.

  20. Characterizing Heterogeneity within Head and Neck Lesions Using Cluster Analysis of Multi-Parametric MRI Data

    PubMed Central

    Borri, Marco; Schmidt, Maria A.; Powell, Ceri; Koh, Dow-Mu; Riddell, Angela M.; Partridge, Mike; Bhide, Shreerang A.; Nutting, Christopher M.; Harrington, Kevin J.; Newbold, Katie L.; Leach, Martin O.

    2015-01-01

    Purpose To describe a methodology, based on cluster analysis, to partition multi-parametric functional imaging data into groups (or clusters) of similar functional characteristics, with the aim of characterizing functional heterogeneity within head and neck tumour volumes. To evaluate the performance of the proposed approach on a set of longitudinal MRI data, analysing the evolution of the obtained sub-sets with treatment. Material and Methods The cluster analysis workflow was applied to a combination of dynamic contrast-enhanced and diffusion-weighted imaging MRI data from a cohort of squamous cell carcinoma of the head and neck patients. Cumulative distributions of voxels, containing pre and post-treatment data and including both primary tumours and lymph nodes, were partitioned into k clusters (k = 2, 3 or 4). Principal component analysis and cluster validation were employed to investigate data composition and to independently determine the optimal number of clusters. The evolution of the resulting sub-regions with induction chemotherapy treatment was assessed relative to the number of clusters. Results The clustering algorithm was able to separate clusters which significantly reduced in voxel number following induction chemotherapy from clusters with a non-significant reduction. Partitioning with the optimal number of clusters (k = 4), determined with cluster validation, produced the best separation between reducing and non-reducing clusters. Conclusion The proposed methodology was able to identify tumour sub-regions with distinct functional properties, independently separating clusters which were affected differently by treatment. This work demonstrates that unsupervised cluster analysis, with no prior knowledge of the data, can be employed to provide a multi-parametric characterization of functional heterogeneity within tumour volumes. PMID:26398888

  1. SC3 - consensus clustering of single-cell RNA-Seq data

    PubMed Central

    Kiselev, Vladimir Yu.; Kirschner, Kristina; Schaub, Michael T.; Andrews, Tallulah; Yiu, Andrew; Chandra, Tamir; Natarajan, Kedar N; Reik, Wolf; Barahona, Mauricio; Green, Anthony R; Hemberg, Martin

    2017-01-01

    Single-cell RNA-seq (scRNA-seq) enables a quantitative cell-type characterisation based on global transcriptome profiles. We present Single-Cell Consensus Clustering (SC3), a user-friendly tool for unsupervised clustering which achieves high accuracy and robustness by combining multiple clustering solutions through a consensus approach. We demonstrate that SC3 is capable of identifying subclones based on the transcriptomes from neoplastic cells collected from patients. PMID:28346451

  2. Searching Remote Homology with Spectral Clustering with Symmetry in Neighborhood Cluster Kernels

    PubMed Central

    Maulik, Ujjwal; Sarkar, Anasua

    2013-01-01

    Remote homology detection among proteins utilizing only the unlabelled sequences is a central problem in comparative genomics. The existing cluster kernel methods based on neighborhoods and profiles and the Markov clustering algorithms are currently the most popular methods for protein family recognition. The deviation from random walks with inflation or dependency on hard threshold in similarity measure in those methods requires an enhancement for homology detection among multi-domain proteins. We propose to combine spectral clustering with neighborhood kernels in Markov similarity for enhancing sensitivity in detecting homology independent of “recent” paralogs. The spectral clustering approach with new combined local alignment kernels more effectively exploits the unsupervised protein sequences globally reducing inter-cluster walks. When combined with the corrections based on modified symmetry based proximity norm deemphasizing outliers, the technique proposed in this article outperforms other state-of-the-art cluster kernels among all twelve implemented kernels. The comparison with the state-of-the-art string and mismatch kernels also show the superior performance scores provided by the proposed kernels. Similar performance improvement also is found over an existing large dataset. Therefore the proposed spectral clustering framework over combined local alignment kernels with modified symmetry based correction achieves superior performance for unsupervised remote homolog detection even in multi-domain and promiscuous domain proteins from Genolevures database families with better biological relevance. Source code available upon request. Contact: sarkar@labri.fr. PMID:23457439

  3. Searching remote homology with spectral clustering with symmetry in neighborhood cluster kernels.

    PubMed

    Maulik, Ujjwal; Sarkar, Anasua

    2013-01-01

    Remote homology detection among proteins utilizing only the unlabelled sequences is a central problem in comparative genomics. The existing cluster kernel methods based on neighborhoods and profiles and the Markov clustering algorithms are currently the most popular methods for protein family recognition. The deviation from random walks with inflation or dependency on hard threshold in similarity measure in those methods requires an enhancement for homology detection among multi-domain proteins. We propose to combine spectral clustering with neighborhood kernels in Markov similarity for enhancing sensitivity in detecting homology independent of "recent" paralogs. The spectral clustering approach with new combined local alignment kernels more effectively exploits the unsupervised protein sequences globally reducing inter-cluster walks. When combined with the corrections based on modified symmetry based proximity norm deemphasizing outliers, the technique proposed in this article outperforms other state-of-the-art cluster kernels among all twelve implemented kernels. The comparison with the state-of-the-art string and mismatch kernels also show the superior performance scores provided by the proposed kernels. Similar performance improvement also is found over an existing large dataset. Therefore the proposed spectral clustering framework over combined local alignment kernels with modified symmetry based correction achieves superior performance for unsupervised remote homolog detection even in multi-domain and promiscuous domain proteins from Genolevures database families with better biological relevance. Source code available upon request. sarkar@labri.fr.

  4. Semi-automatic ground truth generation using unsupervised clustering and limited manual labeling: Application to handwritten character recognition

    PubMed Central

    Vajda, Szilárd; Rangoni, Yves; Cecotti, Hubert

    2015-01-01

    For training supervised classifiers to recognize different patterns, large data collections with accurate labels are necessary. In this paper, we propose a generic, semi-automatic labeling technique for large handwritten character collections. In order to speed up the creation of a large scale ground truth, the method combines unsupervised clustering and minimal expert knowledge. To exploit the potential discriminant complementarities across features, each character is projected into five different feature spaces. After clustering the images in each feature space, the human expert labels the cluster centers. Each data point inherits the label of its cluster’s center. A majority (or unanimity) vote decides the label of each character image. The amount of human involvement (labeling) is strictly controlled by the number of clusters – produced by the chosen clustering approach. To test the efficiency of the proposed approach, we have compared, and evaluated three state-of-the art clustering methods (k-means, self-organizing maps, and growing neural gas) on the MNIST digit data set, and a Lampung Indonesian character data set, respectively. Considering a k-nn classifier, we show that labeling manually only 1.3% (MNIST), and 3.2% (Lampung) of the training data, provides the same range of performance than a completely labeled data set would. PMID:25870463

  5. Unsupervised text mining for assessing and augmenting GWAS results.

    PubMed

    Ailem, Melissa; Role, François; Nadif, Mohamed; Demenais, Florence

    2016-04-01

    Text mining can assist in the analysis and interpretation of large-scale biomedical data, helping biologists to quickly and cheaply gain confirmation of hypothesized relationships between biological entities. We set this question in the context of genome-wide association studies (GWAS), an actively emerging field that contributed to identify many genes associated with multifactorial diseases. These studies allow to identify groups of genes associated with the same phenotype, but provide no information about the relationships between these genes. Therefore, our objective is to leverage unsupervised text mining techniques using text-based cosine similarity comparisons and clustering applied to candidate and random gene vectors, in order to augment the GWAS results. We propose a generic framework which we used to characterize the relationships between 10 genes reported associated with asthma by a previous GWAS. The results of this experiment showed that the similarities between these 10 genes were significantly stronger than would be expected by chance (one-sided p-value<0.01). The clustering of observed and randomly selected gene also allowed to generate hypotheses about potential functional relationships between these genes and thus contributed to the discovery of new candidate genes for asthma. Copyright © 2016 Elsevier Inc. All rights reserved.

  6. Colour image segmentation using unsupervised clustering technique for acute leukemia images

    NASA Astrophysics Data System (ADS)

    Halim, N. H. Abd; Mashor, M. Y.; Nasir, A. S. Abdul; Mustafa, N.; Hassan, R.

    2015-05-01

    Colour image segmentation has becoming more popular for computer vision due to its important process in most medical analysis tasks. This paper proposes comparison between different colour components of RGB(red, green, blue) and HSI (hue, saturation, intensity) colour models that will be used in order to segment the acute leukemia images. First, partial contrast stretching is applied on leukemia images to increase the visual aspect of the blast cells. Then, an unsupervised moving k-means clustering algorithm is applied on the various colour components of RGB and HSI colour models for the purpose of segmentation of blast cells from the red blood cells and background regions in leukemia image. Different colour components of RGB and HSI colour models have been analyzed in order to identify the colour component that can give the good segmentation performance. The segmented images are then processed using median filter and region growing technique to reduce noise and smooth the images. The results show that segmentation using saturation component of HSI colour model has proven to be the best in segmenting nucleus of the blast cells in acute leukemia image as compared to the other colour components of RGB and HSI colour models.

  7. An Unsupervised Online Spike-Sorting Framework.

    PubMed

    Knieling, Simeon; Sridharan, Kousik S; Belardinelli, Paolo; Naros, Georgios; Weiss, Daniel; Mormann, Florian; Gharabaghi, Alireza

    2016-08-01

    Extracellular neuronal microelectrode recordings can include action potentials from multiple neurons. To separate spikes from different neurons, they can be sorted according to their shape, a procedure referred to as spike-sorting. Several algorithms have been reported to solve this task. However, when clustering outcomes are unsatisfactory, most of them are difficult to adjust to achieve the desired results. We present an online spike-sorting framework that uses feature normalization and weighting to maximize the distinctiveness between different spike shapes. Furthermore, multiple criteria are applied to either facilitate or prevent cluster fusion, thereby enabling experimenters to fine-tune the sorting process. We compare our method to established unsupervised offline (Wave_Clus (WC)) and online (OSort (OS)) algorithms by examining their performance in sorting various test datasets using two different scoring systems (AMI and the Adamos metric). Furthermore, we evaluate sorting capabilities on intra-operative recordings using established quality metrics. Compared to WC and OS, our algorithm achieved comparable or higher scores on average and produced more convincing sorting results for intra-operative datasets. Thus, the presented framework is suitable for both online and offline analysis and could substantially improve the quality of microelectrode-based data evaluation for research and clinical application.

  8. Unsupervised classification of variable stars

    NASA Astrophysics Data System (ADS)

    Valenzuela, Lucas; Pichara, Karim

    2018-03-01

    During the past 10 years, a considerable amount of effort has been made to develop algorithms for automatic classification of variable stars. That has been primarily achieved by applying machine learning methods to photometric data sets where objects are represented as light curves. Classifiers require training sets to learn the underlying patterns that allow the separation among classes. Unfortunately, building training sets is an expensive process that demands a lot of human efforts. Every time data come from new surveys; the only available training instances are the ones that have a cross-match with previously labelled objects, consequently generating insufficient training sets compared with the large amounts of unlabelled sources. In this work, we present an algorithm that performs unsupervised classification of variable stars, relying only on the similarity among light curves. We tackle the unsupervised classification problem by proposing an untraditional approach. Instead of trying to match classes of stars with clusters found by a clustering algorithm, we propose a query-based method where astronomers can find groups of variable stars ranked by similarity. We also develop a fast similarity function specific for light curves, based on a novel data structure that allows scaling the search over the entire data set of unlabelled objects. Experiments show that our unsupervised model achieves high accuracy in the classification of different types of variable stars and that the proposed algorithm scales up to massive amounts of light curves.

  9. Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion.

    PubMed

    Zhou, Feng; De la Torre, Fernando; Hodgins, Jessica K

    2013-03-01

    Temporal segmentation of human motion into plausible motion primitives is central to understanding and building computational models of human motion. Several issues contribute to the challenge of discovering motion primitives: the exponential nature of all possible movement combinations, the variability in the temporal scale of human actions, and the complexity of representing articulated motion. We pose the problem of learning motion primitives as one of temporal clustering, and derive an unsupervised hierarchical bottom-up framework called hierarchical aligned cluster analysis (HACA). HACA finds a partition of a given multidimensional time series into m disjoint segments such that each segment belongs to one of k clusters. HACA combines kernel k-means with the generalized dynamic time alignment kernel to cluster time series data. Moreover, it provides a natural framework to find a low-dimensional embedding for time series. HACA is efficiently optimized with a coordinate descent strategy and dynamic programming. Experimental results on motion capture and video data demonstrate the effectiveness of HACA for segmenting complex motions and as a visualization tool. We also compare the performance of HACA to state-of-the-art algorithms for temporal clustering on data of a honey bee dance. The HACA code is available online.

  10. Systematic exploration of unsupervised methods for mapping behavior

    NASA Astrophysics Data System (ADS)

    Todd, Jeremy G.; Kain, Jamey S.; de Bivort, Benjamin L.

    2017-02-01

    To fully understand the mechanisms giving rise to behavior, we need to be able to precisely measure it. When coupled with large behavioral data sets, unsupervised clustering methods offer the potential of unbiased mapping of behavioral spaces. However, unsupervised techniques to map behavioral spaces are in their infancy, and there have been few systematic considerations of all the methodological options. We compared the performance of seven distinct mapping methods in clustering a wavelet-transformed data set consisting of the x- and y-positions of the six legs of individual flies. Legs were automatically tracked by small pieces of fluorescent dye, while the fly was tethered and walking on an air-suspended ball. We find that there is considerable variation in the performance of these mapping methods, and that better performance is attained when clustering is done in higher dimensional spaces (which are otherwise less preferable because they are hard to visualize). High dimensionality means that some algorithms, including the non-parametric watershed cluster assignment algorithm, cannot be used. We developed an alternative watershed algorithm which can be used in high-dimensional spaces when a probability density estimate can be computed directly. With these tools in hand, we examined the behavioral space of fly leg postural dynamics and locomotion. We find a striking division of behavior into modes involving the fore legs and modes involving the hind legs, with few direct transitions between them. By computing behavioral clusters using the data from all flies simultaneously, we show that this division appears to be common to all flies. We also identify individual-to-individual differences in behavior and behavioral transitions. Lastly, we suggest a computational pipeline that can achieve satisfactory levels of performance without the taxing computational demands of a systematic combinatorial approach.

  11. Hydrometeor classification through statistical clustering of polarimetric radar measurements: a semi-supervised approach

    NASA Astrophysics Data System (ADS)

    Besic, Nikola; Ventura, Jordi Figueras i.; Grazioli, Jacopo; Gabella, Marco; Germann, Urs; Berne, Alexis

    2016-09-01

    Polarimetric radar-based hydrometeor classification is the procedure of identifying different types of hydrometeors by exploiting polarimetric radar observations. The main drawback of the existing supervised classification methods, mostly based on fuzzy logic, is a significant dependency on a presumed electromagnetic behaviour of different hydrometeor types. Namely, the results of the classification largely rely upon the quality of scattering simulations. When it comes to the unsupervised approach, it lacks the constraints related to the hydrometeor microphysics. The idea of the proposed method is to compensate for these drawbacks by combining the two approaches in a way that microphysical hypotheses can, to a degree, adjust the content of the classes obtained statistically from the observations. This is done by means of an iterative approach, performed offline, which, in a statistical framework, examines clustered representative polarimetric observations by comparing them to the presumed polarimetric properties of each hydrometeor class. Aside from comparing, a routine alters the content of clusters by encouraging further statistical clustering in case of non-identification. By merging all identified clusters, the multi-dimensional polarimetric signatures of various hydrometeor types are obtained for each of the studied representative datasets, i.e. for each radar system of interest. These are depicted by sets of centroids which are then employed in operational labelling of different hydrometeors. The method has been applied on three C-band datasets, each acquired by different operational radar from the MeteoSwiss Rad4Alp network, as well as on two X-band datasets acquired by two research mobile radars. The results are discussed through a comparative analysis which includes a corresponding supervised and unsupervised approach, emphasising the operational potential of the proposed method.

  12. Using Unsupervised Learning to Unlock the Potential of Hydrologic Similarity

    NASA Astrophysics Data System (ADS)

    Chaney, N.; Newman, A. J.

    2017-12-01

    By clustering environmental data into representative hydrologic response units (HRUs), hydrologic similarity aims to harness the covariance between a system's physical environment and its hydrologic response to create reduced-order models. This is the primary approach through which sub-grid hydrologic processes are represented in large-scale models (e.g., Earth System Models). Although the possibilities of hydrologic similarity are extensive, its practical implementations have been limited to 1-d bins of oversimplistic metrics of hydrologic response (e.g., topographic index)—this is a missed opportunity. In this presentation we will show how unsupervised learning is unlocking the potential of hydrologic similarity; clustering methods enable generalized frameworks to effectively and efficiently harness the petabytes of global environmental data to robustly characterize sub-grid heterogeneity in large-scale models. To illustrate the potential that unsupervised learning has towards advancing hydrologic similarity, we introduce a hierarchical clustering algorithm (HCA) that clusters very high resolution (30-100 meters) elevation, soil, climate, and land cover data to assemble a domain's representative HRUs. These HRUs are then used to parameterize the sub-grid heterogeneity in land surface models; for this study we use the GFDL LM4 model—the land component of the GFDL Earth System Model. To explore HCA and its impacts on the hydrologic system we use a ¼ grid cell in southeastern California as a test site. HCA is used to construct an ensemble of 9 different HRU configurations—each configuration has a different number of HRUs; for each ensemble member LM4 is run between 2002 and 2014 with a 26 year spinup. The analysis of the ensemble of model simulations show that: 1) clustering the high-dimensional environmental data space leads to a robust representation of the role of the physical environment in the coupled water, energy, and carbon cycles at a relatively low number of HRUs; 2) the reduced-order model with around 300 HRUs effectively reproduces the fully distributed model simulation (30 meters) with less than 1/1000 of computational expense; 3) assigning each grid cell of the fully distributed grid to an HRU via HCA enables novel visualization methods for large-scale models—this has significant implications for how these models are applied and evaluated. We will conclude by outlining the potential that this work has within operational prediction systems including numerical weather prediction, Earth System models, and Early Warning systems.

  13. GO-PCA: An Unsupervised Method to Explore Gene Expression Data Using Prior Knowledge

    PubMed Central

    Wagner, Florian

    2015-01-01

    Method Genome-wide expression profiling is a widely used approach for characterizing heterogeneous populations of cells, tissues, biopsies, or other biological specimen. The exploratory analysis of such data typically relies on generic unsupervised methods, e.g. principal component analysis (PCA) or hierarchical clustering. However, generic methods fail to exploit prior knowledge about the molecular functions of genes. Here, I introduce GO-PCA, an unsupervised method that combines PCA with nonparametric GO enrichment analysis, in order to systematically search for sets of genes that are both strongly correlated and closely functionally related. These gene sets are then used to automatically generate expression signatures with functional labels, which collectively aim to provide a readily interpretable representation of biologically relevant similarities and differences. The robustness of the results obtained can be assessed by bootstrapping. Results I first applied GO-PCA to datasets containing diverse hematopoietic cell types from human and mouse, respectively. In both cases, GO-PCA generated a small number of signatures that represented the majority of lineages present, and whose labels reflected their respective biological characteristics. I then applied GO-PCA to human glioblastoma (GBM) data, and recovered signatures associated with four out of five previously defined GBM subtypes. My results demonstrate that GO-PCA is a powerful and versatile exploratory method that reduces an expression matrix containing thousands of genes to a much smaller set of interpretable signatures. In this way, GO-PCA aims to facilitate hypothesis generation, design of further analyses, and functional comparisons across datasets. PMID:26575370

  14. GO-PCA: An Unsupervised Method to Explore Gene Expression Data Using Prior Knowledge.

    PubMed

    Wagner, Florian

    2015-01-01

    Genome-wide expression profiling is a widely used approach for characterizing heterogeneous populations of cells, tissues, biopsies, or other biological specimen. The exploratory analysis of such data typically relies on generic unsupervised methods, e.g. principal component analysis (PCA) or hierarchical clustering. However, generic methods fail to exploit prior knowledge about the molecular functions of genes. Here, I introduce GO-PCA, an unsupervised method that combines PCA with nonparametric GO enrichment analysis, in order to systematically search for sets of genes that are both strongly correlated and closely functionally related. These gene sets are then used to automatically generate expression signatures with functional labels, which collectively aim to provide a readily interpretable representation of biologically relevant similarities and differences. The robustness of the results obtained can be assessed by bootstrapping. I first applied GO-PCA to datasets containing diverse hematopoietic cell types from human and mouse, respectively. In both cases, GO-PCA generated a small number of signatures that represented the majority of lineages present, and whose labels reflected their respective biological characteristics. I then applied GO-PCA to human glioblastoma (GBM) data, and recovered signatures associated with four out of five previously defined GBM subtypes. My results demonstrate that GO-PCA is a powerful and versatile exploratory method that reduces an expression matrix containing thousands of genes to a much smaller set of interpretable signatures. In this way, GO-PCA aims to facilitate hypothesis generation, design of further analyses, and functional comparisons across datasets.

  15. Visualization of multiple influences on ocellar flight control in giant honeybees with the data-mining tool Viscovery SOMine.

    PubMed

    Kastberger, G; Kranner, G

    2000-02-01

    Viscovery SOMine is a software tool for advanced analysis and monitoring of numerical data sets. It was developed for professional use in business, industry, and science and to support dependency analysis, deviation detection, unsupervised clustering, nonlinear regression, data association, pattern recognition, and animated monitoring. Based on the concept of self-organizing maps (SOMs), it employs a robust variant of unsupervised neural networks--namely, Kohonen's Batch-SOM, which is further enhanced with a new scaling technique for speeding up the learning process. This tool provides a powerful means by which to analyze complex data sets without prior statistical knowledge. The data representation contained in the trained SOM is systematically converted to be used in a spectrum of visualization techniques, such as evaluating dependencies between components, investigating geometric properties of the data distribution, searching for clusters, or monitoring new data. We have used this software tool to analyze and visualize multiple influences of the ocellar system on free-flight behavior in giant honeybees. Occlusion of ocelli will affect orienting reactivities in relation to flight target, level of disturbance, and position of the bee in the flight chamber; it will induce phototaxis and make orienting imprecise and dependent on motivational settings. Ocelli permit the adjustment of orienting strategies to environmental demands by enforcing abilities such as centering or flight kinetics and by providing independent control of posture and flight course.

  16. An artificial intelligence approach fit for tRNA gene studies in the era of big sequence data.

    PubMed

    Iwasaki, Yuki; Abe, Takashi; Wada, Kennosuke; Wada, Yoshiko; Ikemura, Toshimichi

    2017-09-12

    Unsupervised data mining capable of extracting a wide range of knowledge from big data without prior knowledge or particular models is a timely application in the era of big sequence data accumulation in genome research. By handling oligonucleotide compositions as high-dimensional data, we have previously modified the conventional self-organizing map (SOM) for genome informatics and established BLSOM, which can analyze more than ten million sequences simultaneously. Here, we develop BLSOM specialized for tRNA genes (tDNAs) that can cluster (self-organize) more than one million microbial tDNAs according to their cognate amino acid solely depending on tetra- and pentanucleotide compositions. This unsupervised clustering can reveal combinatorial oligonucleotide motifs that are responsible for the amino acid-dependent clustering, as well as other functionally and structurally important consensus motifs, which have been evolutionarily conserved. BLSOM is also useful for identifying tDNAs as phylogenetic markers for special phylotypes. When we constructed BLSOM with 'species-unknown' tDNAs from metagenomic sequences plus 'species-known' microbial tDNAs, a large portion of metagenomic tDNAs self-organized with species-known tDNAs, yielding information on microbial communities in environmental samples. BLSOM can also enhance accuracy in the tDNA database obtained from big sequence data. This unsupervised data mining should become important for studying numerous functionally unclear RNAs obtained from a wide range of organisms.

  17. Weighted Distance Functions Improve Analysis of High-Dimensional Data: Application to Molecular Dynamics Simulations.

    PubMed

    Blöchliger, Nicolas; Caflisch, Amedeo; Vitalis, Andreas

    2015-11-10

    Data mining techniques depend strongly on how the data are represented and how distance between samples is measured. High-dimensional data often contain a large number of irrelevant dimensions (features) for a given query. These features act as noise and obfuscate relevant information. Unsupervised approaches to mine such data require distance measures that can account for feature relevance. Molecular dynamics simulations produce high-dimensional data sets describing molecules observed in time. Here, we propose to globally or locally weight simulation features based on effective rates. This emphasizes, in a data-driven manner, slow degrees of freedom that often report on the metastable states sampled by the molecular system. We couple this idea to several unsupervised learning protocols. Our approach unmasks slow side chain dynamics within the native state of a miniprotein and reveals additional metastable conformations of a protein. The approach can be combined with most algorithms for clustering or dimensionality reduction.

  18. Unsupervised active learning based on hierarchical graph-theoretic clustering.

    PubMed

    Hu, Weiming; Hu, Wei; Xie, Nianhua; Maybank, Steve

    2009-10-01

    Most existing active learning approaches are supervised. Supervised active learning has the following problems: inefficiency in dealing with the semantic gap between the distribution of samples in the feature space and their labels, lack of ability in selecting new samples that belong to new categories that have not yet appeared in the training samples, and lack of adaptability to changes in the semantic interpretation of sample categories. To tackle these problems, we propose an unsupervised active learning framework based on hierarchical graph-theoretic clustering. In the framework, two promising graph-theoretic clustering algorithms, namely, dominant-set clustering and spectral clustering, are combined in a hierarchical fashion. Our framework has some advantages, such as ease of implementation, flexibility in architecture, and adaptability to changes in the labeling. Evaluations on data sets for network intrusion detection, image classification, and video classification have demonstrated that our active learning framework can effectively reduce the workload of manual classification while maintaining a high accuracy of automatic classification. It is shown that, overall, our framework outperforms the support-vector-machine-based supervised active learning, particularly in terms of dealing much more efficiently with new samples whose categories have not yet appeared in the training samples.

  19. Comparative study of feature selection with ensemble learning using SOM variants

    NASA Astrophysics Data System (ADS)

    Filali, Ameni; Jlassi, Chiraz; Arous, Najet

    2017-03-01

    Ensemble learning has succeeded in the growth of stability and clustering accuracy, but their runtime prohibits them from scaling up to real-world applications. This study deals the problem of selecting a subset of the most pertinent features for every cluster from a dataset. The proposed method is another extension of the Random Forests approach using self-organizing maps (SOM) variants to unlabeled data that estimates the out-of-bag feature importance from a set of partitions. Every partition is created using a various bootstrap sample and a random subset of the features. Then, we show that the process internal estimates are used to measure variable pertinence in Random Forests are also applicable to feature selection in unsupervised learning. This approach aims to the dimensionality reduction, visualization and cluster characterization at the same time. Hence, we provide empirical results on nineteen benchmark data sets indicating that RFS can lead to significant improvement in terms of clustering accuracy, over several state-of-the-art unsupervised methods, with a very limited subset of features. The approach proves promise to treat with very broad domains.

  20. A Granular Self-Organizing Map for Clustering and Gene Selection in Microarray Data.

    PubMed

    Ray, Shubhra Sankar; Ganivada, Avatharam; Pal, Sankar K

    2016-09-01

    A new granular self-organizing map (GSOM) is developed by integrating the concept of a fuzzy rough set with the SOM. While training the GSOM, the weights of a winning neuron and the neighborhood neurons are updated through a modified learning procedure. The neighborhood is newly defined using the fuzzy rough sets. The clusters (granules) evolved by the GSOM are presented to a decision table as its decision classes. Based on the decision table, a method of gene selection is developed. The effectiveness of the GSOM is shown in both clustering samples and developing an unsupervised fuzzy rough feature selection (UFRFS) method for gene selection in microarray data. While the superior results of the GSOM, as compared with the related clustering methods, are provided in terms of β -index, DB-index, Dunn-index, and fuzzy rough entropy, the genes selected by the UFRFS are not only better in terms of classification accuracy and a feature evaluation index, but also statistically more significant than the related unsupervised methods. The C-codes of the GSOM and UFRFS are available online at http://avatharamg.webs.com/software-code.

  1. Unsupervised learning in probabilistic neural networks with multi-state metal-oxide memristive synapses

    NASA Astrophysics Data System (ADS)

    Serb, Alexander; Bill, Johannes; Khiat, Ali; Berdan, Radu; Legenstein, Robert; Prodromakis, Themis

    2016-09-01

    In an increasingly data-rich world the need for developing computing systems that cannot only process, but ideally also interpret big data is becoming continuously more pressing. Brain-inspired concepts have shown great promise towards addressing this need. Here we demonstrate unsupervised learning in a probabilistic neural network that utilizes metal-oxide memristive devices as multi-state synapses. Our approach can be exploited for processing unlabelled data and can adapt to time-varying clusters that underlie incoming data by supporting the capability of reversible unsupervised learning. The potential of this work is showcased through the demonstration of successful learning in the presence of corrupted input data and probabilistic neurons, thus paving the way towards robust big-data processors.

  2. Effect of scene illumination conditions on digital enhancement techniques of multispectral scanner LANDSAT images

    NASA Technical Reports Server (NTRS)

    Parada, N. D. J.; Novo, E. M. L. M.

    1983-01-01

    Two sets of MSS/LANDSAT data with solar elevation ranging from 22 deg to 41 deg were used at the Image-100 System to implement the Eliason et alii technique for extracting the topographic modulation component. An unsupervised cluster analysis was used to obtain an average brightness image for each channel. Analysis of the enhanced imaged shows that the technique for extracting topographic modulation component is more appropriated to MSS data obtained under high sun elevation ngles. Low sun elevation increases the variance of each cluster so that the average brightness doesn't represent its albedo proprties. The topographic modulation component applied to low sun elevation angle damages rather than enhance topographic information. Better results were produced for channels 4 and 5 than for channels 6 and 7.

  3. SUSTAIN: a network model of category learning.

    PubMed

    Love, Bradley C; Medin, Douglas L; Gureckis, Todd M

    2004-04-01

    SUSTAIN (Supervised and Unsupervised STratified Adaptive Incremental Network) is a model of how humans learn categories from examples. SUSTAIN initially assumes a simple category structure. If simple solutions prove inadequate and SUSTAIN is confronted with a surprising event (e.g., it is told that a bat is a mammal instead of a bird), SUSTAIN recruits an additional cluster to represent the surprising event. Newly recruited clusters are available to explain future events and can themselves evolve into prototypes-attractors-rules. SUSTAIN's discovery of category substructure is affected not only by the structure of the world but by the nature of the learning task and the learner's goals. SUSTAIN successfully extends category learning models to studies of inference learning, unsupervised learning, category construction, and contexts in which identification learning is faster than classification learning.

  4. Unsupervised, Robust Estimation-based Clustering for Multispectral Images

    NASA Technical Reports Server (NTRS)

    Netanyahu, Nathan S.

    1997-01-01

    To prepare for the challenge of handling the archiving and querying of terabyte-sized scientific spatial databases, the NASA Goddard Space Flight Center's Applied Information Sciences Branch (AISB, Code 935) developed a number of characterization algorithms that rely on supervised clustering techniques. The research reported upon here has been aimed at continuing the evolution of some of these supervised techniques, namely the neural network and decision tree-based classifiers, plus extending the approach to incorporating unsupervised clustering algorithms, such as those based on robust estimation (RE) techniques. The algorithms developed under this task should be suited for use by the Intelligent Information Fusion System (IIFS) metadata extraction modules, and as such these algorithms must be fast, robust, and anytime in nature. Finally, so that the planner/schedule module of the IlFS can oversee the use and execution of these algorithms, all information required by the planner/scheduler must be provided to the IIFS development team to ensure the timely integration of these algorithms into the overall system.

  5. 2-Way k-Means as a Model for Microbiome Samples.

    PubMed

    Jackson, Weston J; Agarwal, Ipsita; Pe'er, Itsik

    2017-01-01

    Motivation . Microbiome sequencing allows defining clusters of samples with shared composition. However, this paradigm poorly accounts for samples whose composition is a mixture of cluster-characterizing ones and which therefore lie in between them in the cluster space. This paper addresses unsupervised learning of 2-way clusters. It defines a mixture model that allows 2-way cluster assignment and describes a variant of generalized k -means for learning such a model. We demonstrate applicability to microbial 16S rDNA sequencing data from the Human Vaginal Microbiome Project.

  6. 2-Way k-Means as a Model for Microbiome Samples

    PubMed Central

    2017-01-01

    Motivation. Microbiome sequencing allows defining clusters of samples with shared composition. However, this paradigm poorly accounts for samples whose composition is a mixture of cluster-characterizing ones and which therefore lie in between them in the cluster space. This paper addresses unsupervised learning of 2-way clusters. It defines a mixture model that allows 2-way cluster assignment and describes a variant of generalized k-means for learning such a model. We demonstrate applicability to microbial 16S rDNA sequencing data from the Human Vaginal Microbiome Project. PMID:29177026

  7. Sputum neutrophil counts are associated with more severe asthma phenotypes using cluster analysis.

    PubMed

    Moore, Wendy C; Hastie, Annette T; Li, Xingnan; Li, Huashi; Busse, William W; Jarjour, Nizar N; Wenzel, Sally E; Peters, Stephen P; Meyers, Deborah A; Bleecker, Eugene R

    2014-06-01

    Clinical cluster analysis from the Severe Asthma Research Program (SARP) identified 5 asthma subphenotypes that represent the severity spectrum of early-onset allergic asthma, late-onset severe asthma, and severe asthma with chronic obstructive pulmonary disease characteristics. Analysis of induced sputum from a subset of SARP subjects showed 4 sputum inflammatory cellular patterns. Subjects with concurrent increases in eosinophil (≥2%) and neutrophil (≥40%) percentages had characteristics of very severe asthma. To better understand interactions between inflammation and clinical subphenotypes, we integrated inflammatory cellular measures and clinical variables in a new cluster analysis. Participants in SARP who underwent sputum induction at 3 clinical sites were included in this analysis (n = 423). Fifteen variables, including clinical characteristics and blood and sputum inflammatory cell assessments, were selected using factor analysis for unsupervised cluster analysis. Four phenotypic clusters were identified. Cluster A (n = 132) and B (n = 127) subjects had mild-to-moderate early-onset allergic asthma with paucigranulocytic or eosinophilic sputum inflammatory cell patterns. In contrast, these inflammatory patterns were present in only 7% of cluster C (n = 117) and D (n = 47) subjects who had moderate-to-severe asthma with frequent health care use despite treatment with high doses of inhaled or oral corticosteroids and, in cluster D, reduced lung function. The majority of these subjects (>83%) had sputum neutrophilia either alone or with concurrent sputum eosinophilia. Baseline lung function and sputum neutrophil percentages were the most important variables determining cluster assignment. This multivariate approach identified 4 asthma subphenotypes representing the severity spectrum from mild-to-moderate allergic asthma with minimal or eosinophil-predominant sputum inflammation to moderate-to-severe asthma with neutrophil-predominant or mixed granulocytic inflammation. Published by Mosby, Inc.

  8. Sputum neutrophils are associated with more severe asthma phenotypes using cluster analysis

    PubMed Central

    Moore, Wendy C.; Hastie, Annette T.; Li, Xingnan; Li, Huashi; Busse, William W.; Jarjour, Nizar N.; Wenzel, Sally E.; Peters, Stephen P.; Meyers, Deborah A.; Bleecker, Eugene R.

    2013-01-01

    Background Clinical cluster analysis from the Severe Asthma Research Program (SARP) identified five asthma subphenotypes that represent the severity spectrum of early onset allergic asthma, late onset severe asthma and severe asthma with COPD characteristics. Analysis of induced sputum from a subset of SARP subjects showed four sputum inflammatory cellular patterns. Subjects with concurrent increases in eosinophils (≥2%) and neutrophils (≥40%) had characteristics of very severe asthma. Objective To better understand interactions between inflammation and clinical subphenotypes we integrated inflammatory cellular measures and clinical variables in a new cluster analysis. Methods Participants in SARP at three clinical sites who underwent sputum induction were included in this analysis (n=423). Fifteen variables including clinical characteristics and blood and sputum inflammatory cell assessments were selected by factor analysis for unsupervised cluster analysis. Results Four phenotypic clusters were identified. Cluster A (n=132) and B (n=127) subjects had mild-moderate early onset allergic asthma with paucigranulocytic or eosinophilic sputum inflammatory cell patterns. In contrast, these inflammatory patterns were present in only 7% of Cluster C (n=117) and D (n=47) subjects who had moderate-severe asthma with frequent health care utilization despite treatment with high doses of inhaled or oral corticosteroids, and in Cluster D, reduced lung function. The majority these subjects (>83%) had sputum neutrophilia either alone or with concurrent sputum eosinophilia. Baseline lung function and sputum neutrophils were the most important variables determining cluster assignment. Conclusion This multivariate approach identified four asthma subphenotypes representing the severity spectrum from mild-moderate allergic asthma with minimal or eosinophilic predominant sputum inflammation to moderate-severe asthma with neutrophilic predominant or mixed granulocytic inflammation. PMID:24332216

  9. Surface mapping via unsupervised classification of remote sensing: application to MESSENGER/MASCS and DAWN/VIRS data.

    NASA Astrophysics Data System (ADS)

    D'Amore, M.; Le Scaon, R.; Helbert, J.; Maturilli, A.

    2017-12-01

    Machine-learning achieved unprecedented results in high-dimensional data processing tasks with wide applications in various fields. Due to the growing number of complex nonlinear systems that have to be investigated in science and the bare raw size of data nowadays available, ML offers the unique ability to extract knowledge, regardless the specific application field. Examples are image segmentation, supervised/unsupervised/ semi-supervised classification, feature extraction, data dimensionality analysis/reduction.The MASCS instrument has mapped Mercury surface in the 400-1145 nm wavelength range during orbital observations by the MESSENGER spacecraft. We have conducted k-means unsupervised hierarchical clustering to identify and characterize spectral units from MASCS observations. The results display a dichotomy: a polar and equatorial units, possibly linked to compositional differences or weathering due to irradiation. To explore possible relations between composition and spectral behavior, we have compared the spectral provinces with elemental abundance maps derived from MESSENGER's X-Ray Spectrometer (XRS).For the Vesta application on DAWN Visible and infrared spectrometer (VIR) data, we explored several Machine Learning techniques: image segmentation method, stream algorithm and hierarchical clustering.The algorithm successfully separates the Olivine outcrops around two craters on Vesta's surface [1]. New maps summarizing the spectral and chemical signature of the surface could be automatically produced.We conclude that instead of hand digging in data, scientist could choose a subset of algorithms with well known feature (i.e. efficacy on the particular problem, speed, accuracy) and focus their effort in understanding what important characteristic of the groups found in the data mean. [1] E Ammannito et al. "Olivine in an unexpected location on Vesta's surface". In: Nature 504.7478 (2013), pp. 122-125.

  10. Comparing digital data processing techniques for surface mine and reclamation monitoring

    NASA Technical Reports Server (NTRS)

    Witt, R. G.; Bly, B. G.; Campbell, W. J.; Bloemer, H. H. L.; Brumfield, J. O.

    1982-01-01

    The results of three techniques used for processing Landsat digital data are compared for their utility in delineating areas of surface mining and subsequent reclamation. An unsupervised clustering algorithm (ISOCLS), a maximum-likelihood classifier (CLASFY), and a hybrid approach utilizing canonical analysis (ISOCLS/KLTRANS/ISOCLS) were compared by means of a detailed accuracy assessment with aerial photography at NASA's Goddard Space Flight Center. Results show that the hybrid approach was superior to the traditional techniques in distinguishing strip mined and reclaimed areas.

  11. A Novel Clustering Methodology Based on Modularity Optimisation for Detecting Authorship Affinities in Shakespearean Era Plays

    PubMed Central

    Craig, Hugh; Berretta, Regina; Moscato, Pablo

    2016-01-01

    In this study we propose a novel, unsupervised clustering methodology for analyzing large datasets. This new, efficient methodology converts the general clustering problem into the community detection problem in graph by using the Jensen-Shannon distance, a dissimilarity measure originating in Information Theory. Moreover, we use graph theoretic concepts for the generation and analysis of proximity graphs. Our methodology is based on a newly proposed memetic algorithm (iMA-Net) for discovering clusters of data elements by maximizing the modularity function in proximity graphs of literary works. To test the effectiveness of this general methodology, we apply it to a text corpus dataset, which contains frequencies of approximately 55,114 unique words across all 168 written in the Shakespearean era (16th and 17th centuries), to analyze and detect clusters of similar plays. Experimental results and comparison with state-of-the-art clustering methods demonstrate the remarkable performance of our new method for identifying high quality clusters which reflect the commonalities in the literary style of the plays. PMID:27571416

  12. Semi-supervised and unsupervised extreme learning machines.

    PubMed

    Huang, Gao; Song, Shiji; Gupta, Jatinder N D; Wu, Cheng

    2014-12-01

    Extreme learning machines (ELMs) have proven to be efficient and effective learning mechanisms for pattern classification and regression. However, ELMs are primarily applied to supervised learning problems. Only a few existing research papers have used ELMs to explore unlabeled data. In this paper, we extend ELMs for both semi-supervised and unsupervised tasks based on the manifold regularization, thus greatly expanding the applicability of ELMs. The key advantages of the proposed algorithms are as follows: 1) both the semi-supervised ELM (SS-ELM) and the unsupervised ELM (US-ELM) exhibit learning capability and computational efficiency of ELMs; 2) both algorithms naturally handle multiclass classification or multicluster clustering; and 3) both algorithms are inductive and can handle unseen data at test time directly. Moreover, it is shown in this paper that all the supervised, semi-supervised, and unsupervised ELMs can actually be put into a unified framework. This provides new perspectives for understanding the mechanism of random feature mapping, which is the key concept in ELM theory. Empirical study on a wide range of data sets demonstrates that the proposed algorithms are competitive with the state-of-the-art semi-supervised or unsupervised learning algorithms in terms of accuracy and efficiency.

  13. Noninvasive analysis of the sputum transcriptome discriminates clinical phenotypes of asthma.

    PubMed

    Yan, Xiting; Chu, Jen-Hwa; Gomez, Jose; Koenigs, Maria; Holm, Carole; He, Xiaoxuan; Perez, Mario F; Zhao, Hongyu; Mane, Shrikant; Martinez, Fernando D; Ober, Carole; Nicolae, Dan L; Barnes, Kathleen C; London, Stephanie J; Gilliland, Frank; Weiss, Scott T; Raby, Benjamin A; Cohn, Lauren; Chupp, Geoffrey L

    2015-05-15

    The airway transcriptome includes genes that contribute to the pathophysiologic heterogeneity seen in individuals with asthma. We analyzed sputum gene expression for transcriptomic endotypes of asthma (TEA), gene signatures that discriminate phenotypes of disease. Gene expression in the sputum and blood of patients with asthma was measured using Affymetrix microarrays. Unsupervised clustering analysis based on pathways from the Kyoto Encyclopedia of Genes and Genomes was used to identify TEA clusters. Logistic regression analysis of matched blood samples defined an expression profile in the circulation to determine the TEA cluster assignment in a cohort of children with asthma to replicate clinical phenotypes. Three TEA clusters were identified. TEA cluster 1 had the most subjects with a history of intubation (P = 0.05), a lower prebronchodilator FEV1 (P = 0.006), a higher bronchodilator response (P = 0.03), and higher exhaled nitric oxide levels (P = 0.04) compared with the other TEA clusters. TEA cluster 2, the smallest cluster, had the most subjects that were hospitalized for asthma (P = 0.04). TEA cluster 3, the largest cluster, had normal lung function, low exhaled nitric oxide levels, and lower inhaled steroid requirements. Evaluation of TEA clusters in children confirmed that TEA clusters 1 and 2 are associated with a history of intubation (P = 5.58 × 10(-6)) and hospitalization (P = 0.01), respectively. There are common patterns of gene expression in the sputum and blood of children and adults that are associated with near-fatal, severe, and milder asthma.

  14. Noninvasive Analysis of the Sputum Transcriptome Discriminates Clinical Phenotypes of Asthma

    PubMed Central

    Yan, Xiting; Chu, Jen-Hwa; Gomez, Jose; Koenigs, Maria; Holm, Carole; He, Xiaoxuan; Perez, Mario F.; Zhao, Hongyu; Mane, Shrikant; Martinez, Fernando D.; Ober, Carole; Nicolae, Dan L.; Barnes, Kathleen C.; London, Stephanie J.; Gilliland, Frank; Weiss, Scott T.; Raby, Benjamin A.; Cohn, Lauren

    2015-01-01

    Rationale: The airway transcriptome includes genes that contribute to the pathophysiologic heterogeneity seen in individuals with asthma. Objectives: We analyzed sputum gene expression for transcriptomic endotypes of asthma (TEA), gene signatures that discriminate phenotypes of disease. Methods: Gene expression in the sputum and blood of patients with asthma was measured using Affymetrix microarrays. Unsupervised clustering analysis based on pathways from the Kyoto Encyclopedia of Genes and Genomes was used to identify TEA clusters. Logistic regression analysis of matched blood samples defined an expression profile in the circulation to determine the TEA cluster assignment in a cohort of children with asthma to replicate clinical phenotypes. Measurements and Main Results: Three TEA clusters were identified. TEA cluster 1 had the most subjects with a history of intubation (P = 0.05), a lower prebronchodilator FEV1 (P = 0.006), a higher bronchodilator response (P = 0.03), and higher exhaled nitric oxide levels (P = 0.04) compared with the other TEA clusters. TEA cluster 2, the smallest cluster, had the most subjects that were hospitalized for asthma (P = 0.04). TEA cluster 3, the largest cluster, had normal lung function, low exhaled nitric oxide levels, and lower inhaled steroid requirements. Evaluation of TEA clusters in children confirmed that TEA clusters 1 and 2 are associated with a history of intubation (P = 5.58 × 10−6) and hospitalization (P = 0.01), respectively. Conclusions: There are common patterns of gene expression in the sputum and blood of children and adults that are associated with near-fatal, severe, and milder asthma. PMID:25763605

  15. Cluster analysis and prediction of treatment outcomes for chronic rhinosinusitis.

    PubMed

    Soler, Zachary M; Hyer, J Madison; Rudmik, Luke; Ramakrishnan, Viswanathan; Smith, Timothy L; Schlosser, Rodney J

    2016-04-01

    Current clinical classifications of chronic rhinosinusitis (CRS) have weak prognostic utility regarding treatment outcomes. Simplified discriminant analysis based on unsupervised clustering has identified novel phenotypic subgroups of CRS, but prognostic utility is unknown. We sought to determine whether discriminant analysis allows prognostication in patients choosing surgery versus continued medical management. A multi-institutional prospective study of patients with CRS in whom initial medical therapy failed who then self-selected continued medical management or surgical treatment was used to separate patients into 5 clusters based on a previously described discriminant analysis using total Sino-Nasal Outcome Test-22 (SNOT-22) score, age, and missed productivity. Patients completed the SNOT-22 at baseline and for 18 months of follow-up. Baseline demographic and objective measures included olfactory testing, computed tomography, and endoscopy scoring. SNOT-22 outcomes for surgical versus continued medical treatment were compared across clusters. Data were available on 690 patients. Baseline differences in demographics, comorbidities, objective disease measures, and patient-reported outcomes were similar to previous clustering reports. Three of 5 clusters identified by means of discriminant analysis had improved SNOT-22 outcomes with surgical intervention when compared with continued medical management (surgery was a mean of 21.2 points better across these 3 clusters at 6 months, P < .05). These differences were sustained at 18 months of follow-up. Two of 5 clusters had similar outcomes when comparing surgery with continued medical management. A simplified discriminant analysis based on 3 common clinical variables is able to cluster patients and provide prognostic information regarding surgical treatment versus continued medical management in patients with CRS. Copyright © 2015 American Academy of Allergy, Asthma & Immunology. Published by Elsevier Inc. All rights reserved.

  16. Unsupervised daily routine and activity discovery in smart homes.

    PubMed

    Jie Yin; Qing Zhang; Karunanithi, Mohan

    2015-08-01

    The ability to accurately recognize daily activities of residents is a core premise of smart homes to assist with remote health monitoring. Most of the existing methods rely on a supervised model trained from a preselected and manually labeled set of activities, which are often time-consuming and costly to obtain in practice. In contrast, this paper presents an unsupervised method for discovering daily routines and activities for smart home residents. Our proposed method first uses a Markov chain to model a resident's locomotion patterns at different times of day and discover clusters of daily routines at the macro level. For each routine cluster, it then drills down to further discover room-level activities at the micro level. The automatic identification of daily routines and activities is useful for understanding indicators of functional decline of elderly people and suggesting timely interventions.

  17. Graph-based unsupervised segmentation algorithm for cultured neuronal networks' structure characterization and modeling.

    PubMed

    de Santos-Sierra, Daniel; Sendiña-Nadal, Irene; Leyva, Inmaculada; Almendral, Juan A; Ayali, Amir; Anava, Sarit; Sánchez-Ávila, Carmen; Boccaletti, Stefano

    2015-06-01

    Large scale phase-contrast images taken at high resolution through the life of a cultured neuronal network are analyzed by a graph-based unsupervised segmentation algorithm with a very low computational cost, scaling linearly with the image size. The processing automatically retrieves the whole network structure, an object whose mathematical representation is a matrix in which nodes are identified neurons or neurons' clusters, and links are the reconstructed connections between them. The algorithm is also able to extract any other relevant morphological information characterizing neurons and neurites. More importantly, and at variance with other segmentation methods that require fluorescence imaging from immunocytochemistry techniques, our non invasive measures entitle us to perform a longitudinal analysis during the maturation of a single culture. Such an analysis furnishes the way of individuating the main physical processes underlying the self-organization of the neurons' ensemble into a complex network, and drives the formulation of a phenomenological model yet able to describe qualitatively the overall scenario observed during the culture growth. © 2014 International Society for Advancement of Cytometry.

  18. Audio-based, unsupervised machine learning reveals cyclic changes in earthquake mechanisms in the Geysers geothermal field, California

    NASA Astrophysics Data System (ADS)

    Holtzman, B. K.; Paté, A.; Paisley, J.; Waldhauser, F.; Repetto, D.; Boschi, L.

    2017-12-01

    The earthquake process reflects complex interactions of stress, fracture and frictional properties. New machine learning methods reveal patterns in time-dependent spectral properties of seismic signals and enable identification of changes in faulting processes. Our methods are based closely on those developed for music information retrieval and voice recognition, using the spectrogram instead of the waveform directly. Unsupervised learning involves identification of patterns based on differences among signals without any additional information provided to the algorithm. Clustering of 46,000 earthquakes of $0.3

  19. Systematic Association of Genes to Phenotypes by Genome and Literature Mining

    PubMed Central

    Jensen, Lars J; Perez-Iratxeta, Carolina; Kaczanowski, Szymon; Hooper, Sean D; Andrade, Miguel A

    2005-01-01

    One of the major challenges of functional genomics is to unravel the connection between genotype and phenotype. So far no global analysis has attempted to explore those connections in the light of the large phenotypic variability seen in nature. Here, we use an unsupervised, systematic approach for associating genes and phenotypic characteristics that combines literature mining with comparative genome analysis. We first mine the MEDLINE literature database for terms that reflect phenotypic similarities of species. Subsequently we predict the likely genomic determinants: genes specifically present in the respective genomes. In a global analysis involving 92 prokaryotic genomes we retrieve 323 clusters containing a total of 2,700 significant gene–phenotype associations. Some clusters contain mostly known relationships, such as genes involved in motility or plant degradation, often with additional hypothetical proteins associated with those phenotypes. Other clusters comprise unexpected associations; for example, a group of terms related to food and spoilage is linked to genes predicted to be involved in bacterial food poisoning. Among the clusters, we observe an enrichment of pathogenicity-related associations, suggesting that the approach reveals many novel genes likely to play a role in infectious diseases. PMID:15799710

  20. Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts

    PubMed Central

    Duan, Weisi; Song, Min; Yates, Alexander

    2009-01-01

    Background We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. Methods We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the principle of finding a maximum margin between clusters, determining a split of the data that maximizes the minimum distance between pairs of data points belonging to two different clusters. Results On a test set of 21 ambiguous keywords from PubMed abstracts, our system has an average accuracy of 78%, outperforming a state-of-the-art unsupervised system by 2% and a baseline technique by 23%. On a standard data set from the National Library of Medicine, our system outperforms the baseline by 6% and comes within 5% of the accuracy of a supervised system. Conclusion Our system is a novel, state-of-the-art technique for efficiently finding word sense clusters, and does not require training data or human effort for each new word to be disambiguated. PMID:19344480

  1. Unsupervised Clustering of Subcellular Protein Expression Patterns in High-Throughput Microscopy Images Reveals Protein Complexes and Functional Relationships between Proteins

    PubMed Central

    Handfield, Louis-François; Chong, Yolanda T.; Simmons, Jibril; Andrews, Brenda J.; Moses, Alan M.

    2013-01-01

    Protein subcellular localization has been systematically characterized in budding yeast using fluorescently tagged proteins. Based on the fluorescence microscopy images, subcellular localization of many proteins can be classified automatically using supervised machine learning approaches that have been trained to recognize predefined image classes based on statistical features. Here, we present an unsupervised analysis of protein expression patterns in a set of high-resolution, high-throughput microscope images. Our analysis is based on 7 biologically interpretable features which are evaluated on automatically identified cells, and whose cell-stage dependency is captured by a continuous model for cell growth. We show that it is possible to identify most previously identified localization patterns in a cluster analysis based on these features and that similarities between the inferred expression patterns contain more information about protein function than can be explained by a previous manual categorization of subcellular localization. Furthermore, the inferred cell-stage associated to each fluorescence measurement allows us to visualize large groups of proteins entering the bud at specific stages of bud growth. These correspond to proteins localized to organelles, revealing that the organelles must be entering the bud in a stereotypical order. We also identify and organize a smaller group of proteins that show subtle differences in the way they move around the bud during growth. Our results suggest that biologically interpretable features based on explicit models of cell morphology will yield unprecedented power for pattern discovery in high-resolution, high-throughput microscopy images. PMID:23785265

  2. Land cover classification in multispectral imagery using clustering of sparse approximations over learned feature dictionaries

    DOE PAGES

    Moody, Daniela I.; Brumby, Steven P.; Rowland, Joel C.; ...

    2014-12-09

    We present results from an ongoing effort to extend neuromimetic machine vision algorithms to multispectral data using adaptive signal processing combined with compressive sensing and machine learning techniques. Our goal is to develop a robust classification methodology that will allow for automated discretization of the landscape into distinct units based on attributes such as vegetation, surface hydrological properties, and topographic/geomorphic characteristics. We use a Hebbian learning rule to build spectral-textural dictionaries that are tailored for classification. We learn our dictionaries from millions of overlapping multispectral image patches and then use a pursuit search to generate classification features. Land cover labelsmore » are automatically generated using unsupervised clustering of sparse approximations (CoSA). We demonstrate our method on multispectral WorldView-2 data from a coastal plain ecosystem in Barrow, Alaska. We explore learning from both raw multispectral imagery and normalized band difference indices. We explore a quantitative metric to evaluate the spectral properties of the clusters in order to potentially aid in assigning land cover categories to the cluster labels. In this study, our results suggest CoSA is a promising approach to unsupervised land cover classification in high-resolution satellite imagery.« less

  3. Classification of damage in structural systems using time series analysis and supervised and unsupervised pattern recognition techniques

    NASA Astrophysics Data System (ADS)

    Omenzetter, Piotr; de Lautour, Oliver R.

    2010-04-01

    Developed for studying long, periodic records of various measured quantities, time series analysis methods are inherently suited and offer interesting possibilities for Structural Health Monitoring (SHM) applications. However, their use in SHM can still be regarded as an emerging application and deserves more studies. In this research, Autoregressive (AR) models were used to fit experimental acceleration time histories from two experimental structural systems, a 3- storey bookshelf-type laboratory structure and the ASCE Phase II SHM Benchmark Structure, in healthy and several damaged states. The coefficients of the AR models were chosen as damage sensitive features. Preliminary visual inspection of the large, multidimensional sets of AR coefficients to check the presence of clusters corresponding to different damage severities was achieved using Sammon mapping - an efficient nonlinear data compression technique. Systematic classification of damage into states based on the analysis of the AR coefficients was achieved using two supervised classification techniques: Nearest Neighbor Classification (NNC) and Learning Vector Quantization (LVQ), and one unsupervised technique: Self-organizing Maps (SOM). This paper discusses the performance of AR coefficients as damage sensitive features and compares the efficiency of the three classification techniques using experimental data.

  4. Multi-exemplar affinity propagation.

    PubMed

    Wang, Chang-Dong; Lai, Jian-Huang; Suen, Ching Y; Zhu, Jun-Yong

    2013-09-01

    The affinity propagation (AP) clustering algorithm has received much attention in the past few years. AP is appealing because it is efficient, insensitive to initialization, and it produces clusters at a lower error rate than other exemplar-based methods. However, its single-exemplar model becomes inadequate when applied to model multisubclasses in some situations such as scene analysis and character recognition. To remedy this deficiency, we have extended the single-exemplar model to a multi-exemplar one to create a new multi-exemplar affinity propagation (MEAP) algorithm. This new model automatically determines the number of exemplars in each cluster associated with a super exemplar to approximate the subclasses in the category. Solving the model is NP-hard and we tackle it with the max-sum belief propagation to produce neighborhood maximum clusters, with no need to specify beforehand the number of clusters, multi-exemplars, and superexemplars. Also, utilizing the sparsity in the data, we are able to reduce substantially the computational time and storage. Experimental studies have shown MEAP's significant improvements over other algorithms on unsupervised image categorization and the clustering of handwritten digits.

  5. An evaluation of unsupervised and supervised learning algorithms for clustering landscape types in the United States

    USGS Publications Warehouse

    Wendel, Jochen; Buttenfield, Barbara P.; Stanislawski, Larry V.

    2016-01-01

    Knowledge of landscape type can inform cartographic generalization of hydrographic features, because landscape characteristics provide an important geographic context that affects variation in channel geometry, flow pattern, and network configuration. Landscape types are characterized by expansive spatial gradients, lacking abrupt changes between adjacent classes; and as having a limited number of outliers that might confound classification. The US Geological Survey (USGS) is exploring methods to automate generalization of features in the National Hydrography Data set (NHD), to associate specific sequences of processing operations and parameters with specific landscape characteristics, thus obviating manual selection of a unique processing strategy for every NHD watershed unit. A chronology of methods to delineate physiographic regions for the United States is described, including a recent maximum likelihood classification based on seven input variables. This research compares unsupervised and supervised algorithms applied to these seven input variables, to evaluate and possibly refine the recent classification. Evaluation metrics for unsupervised methods include the Davies–Bouldin index, the Silhouette index, and the Dunn index as well as quantization and topographic error metrics. Cross validation and misclassification rate analysis are used to evaluate supervised classification methods. The paper reports the comparative analysis and its impact on the selection of landscape regions. The compared solutions show problems in areas of high landscape diversity. There is some indication that additional input variables, additional classes, or more sophisticated methods can refine the existing classification.

  6. Classification and analysis of the Rudaki's Area

    NASA Astrophysics Data System (ADS)

    Zambon, F.; De sanctis, M.; Capaccioni, F.; Filacchione, G.; Carli, C.; Ammannito, E.; Frigeri, A.

    2011-12-01

    During the first two MESSENGER flybys the Mercury Dual Imaging System (MDIS) has mapped 90% of the Mercury's surface. An effective way to study the different terrain on planetary surfaces is to apply classification methods. These are based on clustering algorithms and they can be divided in two categories: unsupervised and supervised. The unsupervised classifiers do not require the analyst feedback and the algorithm automatically organizes pixels values into classes. In the supervised method, instead, the analyst must choose the "training area" that define the pixels value of a given class. We applied an unsupervised classifier, ISODATA, to the WAC filter images of the Rudaki's area where several kind of terrain have been identified showing differences in albedo, topography and crater density. ISODATA classifier divides this region in four classes: 1) shadow regions, 2) rough regions, 3) smooth plane, 4) highest reflectance area. ISODATA can not distinguish the high albedo regions from highly reflective illuminated edge of the craters, however the algorithm identify four classes that can be considered different units mainly on the basis of their reflectances at the various wavelengths. Is not possible, instead, to extrapolate compositional information because of the absence of clear spectral features. An additional analysis was made using ISODATA to choose the "training area" for further supervised classifications. These approach would allow, for example, to separate more accurately the edge of the craters from the high reflectance areas and the low reflectance regions from the shadow areas.

  7. Training strategy for convolutional neural networks in pedestrian gender classification

    NASA Astrophysics Data System (ADS)

    Ng, Choon-Boon; Tay, Yong-Haur; Goi, Bok-Min

    2017-06-01

    In this work, we studied a strategy for training a convolutional neural network in pedestrian gender classification with limited amount of labeled training data. Unsupervised learning by k-means clustering on pedestrian images was used to learn the filters to initialize the first layer of the network. As a form of pre-training, supervised learning for the related task of pedestrian classification was performed. Finally, the network was fine-tuned for gender classification. We found that this strategy improved the network's generalization ability in gender classification, achieving better test results when compared to random weights initialization and slightly more beneficial than merely initializing the first layer filters by unsupervised learning. This shows that unsupervised learning followed by pre-training with pedestrian images is an effective strategy to learn useful features for pedestrian gender classification.

  8. Analysis of the mutations induced by conazole fungicides in vivo.

    PubMed

    Ross, Jeffrey A; Leavitt, Sharon A

    2010-05-01

    The mouse liver tumorigenic conazole fungicides triadimefon and propiconazole have previously been shown to be in vivo mouse liver mutagens in the Big Blue transgenic mutation assay when administered in feed at tumorigenic doses, whereas the non-tumorigenic conazole myclobutanil was not mutagenic. DNA sequencing of the mutants recovered from each treatment group as well as from animals receiving control diet was conducted to gain additional insight into the mode of action by which tumorigenic conazoles induce mutations. Relative dinucleotide mutabilities (RDMs) were calculated for each possible dinucleotide in each treatment group and then examined by multivariate statistical analysis techniques. Unsupervised hierarchical clustering analysis of RDM values segregated two independent control groups together, along with the non-tumorigen myclobutanil. The two tumorigenic conazoles clustered together in a distinct grouping. Partitioning around mediods of RDM values into two clusters also groups the triadimefon and propiconazole together in one cluster and the two control groups and myclobutanil together in a second cluster. Principal component analysis of these results identifies two components that account for 88.3% of the variability in the points. Taken together, these results are consistent with the hypothesis that propiconazole- and triadimefon-induced mutations do not represent clonal expansion of background mutations and support the hypothesis that they arise from the accumulation of reactive electrophilic metabolic intermediates within the liver in vivo.

  9. Unsupervised Anomaly Detection Based on Clustering and Multiple One-Class SVM

    NASA Astrophysics Data System (ADS)

    Song, Jungsuk; Takakura, Hiroki; Okabe, Yasuo; Kwon, Yongjin

    Intrusion detection system (IDS) has played an important role as a device to defend our networks from cyber attacks. However, since it is unable to detect unknown attacks, i.e., 0-day attacks, the ultimate challenge in intrusion detection field is how we can exactly identify such an attack by an automated manner. Over the past few years, several studies on solving these problems have been made on anomaly detection using unsupervised learning techniques such as clustering, one-class support vector machine (SVM), etc. Although they enable one to construct intrusion detection models at low cost and effort, and have capability to detect unforeseen attacks, they still have mainly two problems in intrusion detection: a low detection rate and a high false positive rate. In this paper, we propose a new anomaly detection method based on clustering and multiple one-class SVM in order to improve the detection rate while maintaining a low false positive rate. We evaluated our method using KDD Cup 1999 data set. Evaluation results show that our approach outperforms the existing algorithms reported in the literature; especially in detection of unknown attacks.

  10. Clustering for unsupervised fault diagnosis in nuclear turbine shut-down transients

    NASA Astrophysics Data System (ADS)

    Baraldi, Piero; Di Maio, Francesco; Rigamonti, Marco; Zio, Enrico; Seraoui, Redouane

    2015-06-01

    Empirical methods for fault diagnosis usually entail a process of supervised training based on a set of examples of signal evolutions "labeled" with the corresponding, known classes of fault. However, in practice, the signals collected during plant operation may be, very often, "unlabeled", i.e., the information on the corresponding type of occurred fault is not available. To cope with this practical situation, in this paper we develop a methodology for the identification of transient signals showing similar characteristics, under the conjecture that operational/faulty transient conditions of the same type lead to similar behavior in the measured signals evolution. The methodology is founded on a feature extraction procedure, which feeds a spectral clustering technique, embedding the unsupervised fuzzy C-means (FCM) algorithm, which evaluates the functional similarity among the different operational/faulty transients. A procedure for validating the plausibility of the obtained clusters is also propounded based on physical considerations. The methodology is applied to a real industrial case, on the basis of 148 shut-down transients of a Nuclear Power Plant (NPP) steam turbine.

  11. Unsupervised Decoding of Long-Term, Naturalistic Human Neural Recordings with Automated Video and Audio Annotations

    PubMed Central

    Wang, Nancy X. R.; Olson, Jared D.; Ojemann, Jeffrey G.; Rao, Rajesh P. N.; Brunton, Bingni W.

    2016-01-01

    Fully automated decoding of human activities and intentions from direct neural recordings is a tantalizing challenge in brain-computer interfacing. Implementing Brain Computer Interfaces (BCIs) outside carefully controlled experiments in laboratory settings requires adaptive and scalable strategies with minimal supervision. Here we describe an unsupervised approach to decoding neural states from naturalistic human brain recordings. We analyzed continuous, long-term electrocorticography (ECoG) data recorded over many days from the brain of subjects in a hospital room, with simultaneous audio and video recordings. We discovered coherent clusters in high-dimensional ECoG recordings using hierarchical clustering and automatically annotated them using speech and movement labels extracted from audio and video. To our knowledge, this represents the first time techniques from computer vision and speech processing have been used for natural ECoG decoding. Interpretable behaviors were decoded from ECoG data, including moving, speaking and resting; the results were assessed by comparison with manual annotation. Discovered clusters were projected back onto the brain revealing features consistent with known functional areas, opening the door to automated functional brain mapping in natural settings. PMID:27148018

  12. Unsupervised user similarity mining in GSM sensor networks.

    PubMed

    Shad, Shafqat Ali; Chen, Enhong

    2013-01-01

    Mobility data has attracted the researchers for the past few years because of its rich context and spatiotemporal nature, where this information can be used for potential applications like early warning system, route prediction, traffic management, advertisement, social networking, and community finding. All the mentioned applications are based on mobility profile building and user trend analysis, where mobility profile building is done through significant places extraction, user's actual movement prediction, and context awareness. However, significant places extraction and user's actual movement prediction for mobility profile building are a trivial task. In this paper, we present the user similarity mining-based methodology through user mobility profile building by using the semantic tagging information provided by user and basic GSM network architecture properties based on unsupervised clustering approach. As the mobility information is in low-level raw form, our proposed methodology successfully converts it to a high-level meaningful information by using the cell-Id location information rather than previously used location capturing methods like GPS, Infrared, and Wifi for profile mining and user similarity mining.

  13. Self-Organizing Hidden Markov Model Map (SOHMMM).

    PubMed

    Ferles, Christos; Stafylopatis, Andreas

    2013-12-01

    A hybrid approach combining the Self-Organizing Map (SOM) and the Hidden Markov Model (HMM) is presented. The Self-Organizing Hidden Markov Model Map (SOHMMM) establishes a cross-section between the theoretic foundations and algorithmic realizations of its constituents. The respective architectures and learning methodologies are fused in an attempt to meet the increasing requirements imposed by the properties of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and protein chain molecules. The fusion and synergy of the SOM unsupervised training and the HMM dynamic programming algorithms bring forth a novel on-line gradient descent unsupervised learning algorithm, which is fully integrated into the SOHMMM. Since the SOHMMM carries out probabilistic sequence analysis with little or no prior knowledge, it can have a variety of applications in clustering, dimensionality reduction and visualization of large-scale sequence spaces, and also, in sequence discrimination, search and classification. Two series of experiments based on artificial sequence data and splice junction gene sequences demonstrate the SOHMMM's characteristics and capabilities. Copyright © 2013 Elsevier Ltd. All rights reserved.

  14. Unsupervised Learning (Clustering) of Odontocete Echolocation Clicks

    DTIC Science & Technology

    2015-09-30

    of their bandwidth. Results on Risso’s dolphins (Grampus griseus), Pacific white-sided dolphins (Lagenorhynchus obliquidens), and Cuvier’s beaked...acoustic encounters to see which ones appeared to be closely related to one another. We noted that some of the Pacific white-sided and Risso’s dolphin ...should be clusterable. The group of odontocetes that we cannot label reliably by their acoustic features, primarily common dolphins (Delphinus spp

  15. ClusterTAD: an unsupervised machine learning approach to detecting topologically associated domains of chromosomes from Hi-C data.

    PubMed

    Oluwadare, Oluwatosin; Cheng, Jianlin

    2017-11-14

    With the development of chromosomal conformation capturing techniques, particularly, the Hi-C technique, the study of the spatial conformation of a genome is becoming an important topic in bioinformatics and computational biology. The Hi-C technique can generate genome-wide chromosomal interaction (contact) data, which can be used to investigate the higher-level organization of chromosomes, such as Topologically Associated Domains (TAD), i.e., locally packed chromosome regions bounded together by intra chromosomal contacts. The identification of the TADs for a genome is useful for studying gene regulation, genomic interaction, and genome function. Here, we formulate the TAD identification problem as an unsupervised machine learning (clustering) problem, and develop a new TAD identification method called ClusterTAD. We introduce a novel method to represent chromosomal contacts as features to be used by the clustering algorithm. Our results show that ClusterTAD can accurately predict the TADs on a simulated Hi-C data. Our method is also largely complementary and consistent with existing methods on the real Hi-C datasets of two mouse cells. The validation with the chromatin immunoprecipitation (ChIP) sequencing (ChIP-Seq) data shows that the domain boundaries identified by ClusterTAD have a high enrichment of CTCF binding sites, promoter-related marks, and enhancer-related histone modifications. As ClusterTAD is based on a proven clustering approach, it opens a new avenue to apply a large array of clustering methods developed in the machine learning field to the TAD identification problem. The source code, the results, and the TADs generated for the simulated and real Hi-C datasets are available here: https://github.com/BDM-Lab/ClusterTAD .

  16. Unsupervised classification of cirrhotic livers using MRI data

    NASA Astrophysics Data System (ADS)

    Lee, Gobert; Kanematsu, Masayuki; Kato, Hiroki; Kondo, Hiroshi; Zhou, Xiangrong; Hara, Takeshi; Fujita, Hiroshi; Hoshi, Hiroaki

    2008-03-01

    Cirrhosis of the liver is a chronic disease. It is characterized by the presence of widespread nodules and fibrosis in the liver which results in characteristic texture patterns. Computerized analysis of hepatic texture patterns is usually based on regions-of-interest (ROIs). However, not all ROIs are typical representatives of the disease stage of the liver from which the ROIs originated. This leads to uncertainties in the ROI labels (diseased or non-diseased). On the other hand, supervised classifiers are commonly used in determining the assignment rule. This presents a problem as the training of a supervised classifier requires the correct labels of the ROIs. The main purpose of this paper is to investigate the use of an unsupervised classifier, the k-means clustering, in classifying ROI based data. In addition, a procedure for generating a receiver operating characteristic (ROC) curve depicting the classification performance of k-means clustering is also reported. Hepatic MRI images of 44 patients (16 cirrhotic; 28 non-cirrhotic) are used in this study. The MRI data are derived from gadolinium-enhanced equilibrium phase images. For each patient, 10 ROIs selected by an experienced radiologist and 7 texture features measured on each ROI are included in the MRI data. Results of the k-means classifier are depicted using an ROC curve. The area under the curve (AUC) has a value of 0.704. This is slightly lower than but comparable to that of LDA and ANN classifiers which have values 0.781 and 0.801, respectively. Methods in constructing ROC curve in relation to k-means clustering have not been previously reported in the literature.

  17. Nonlinear dimension reduction and clustering by Minimum Curvilinearity unfold neuropathic pain and tissue embryological classes.

    PubMed

    Cannistraci, Carlo Vittorio; Ravasi, Timothy; Montevecchi, Franco Maria; Ideker, Trey; Alessio, Massimo

    2010-09-15

    Nonlinear small datasets, which are characterized by low numbers of samples and very high numbers of measures, occur frequently in computational biology, and pose problems in their investigation. Unsupervised hybrid-two-phase (H2P) procedures-specifically dimension reduction (DR), coupled with clustering-provide valuable assistance, not only for unsupervised data classification, but also for visualization of the patterns hidden in high-dimensional feature space. 'Minimum Curvilinearity' (MC) is a principle that-for small datasets-suggests the approximation of curvilinear sample distances in the feature space by pair-wise distances over their minimum spanning tree (MST), and thus avoids the introduction of any tuning parameter. MC is used to design two novel forms of nonlinear machine learning (NML): Minimum Curvilinear embedding (MCE) for DR, and Minimum Curvilinear affinity propagation (MCAP) for clustering. Compared with several other unsupervised and supervised algorithms, MCE and MCAP, whether individually or combined in H2P, overcome the limits of classical approaches. High performance was attained in the visualization and classification of: (i) pain patients (proteomic measurements) in peripheral neuropathy; (ii) human organ tissues (genomic transcription factor measurements) on the basis of their embryological origin. MC provides a valuable framework to estimate nonlinear distances in small datasets. Its extension to large datasets is prefigured for novel NMLs. Classification of neuropathic pain by proteomic profiles offers new insights for future molecular and systems biology characterization of pain. Improvements in tissue embryological classification refine results obtained in an earlier study, and suggest a possible reinterpretation of skin attribution as mesodermal. https://sites.google.com/site/carlovittoriocannistraci/home.

  18. Accuracy of un-supervised versus provider-supervised self-administered HIV testing in Uganda: A randomized implementation trial.

    PubMed

    Asiimwe, Stephen; Oloya, James; Song, Xiao; Whalen, Christopher C

    2014-12-01

    Unsupervised HIV self-testing (HST) has potential to increase knowledge of HIV status; however, its accuracy is unknown. To estimate the accuracy of unsupervised HST in field settings in Uganda, we performed a non-blinded, randomized controlled, non-inferiority trial of unsupervised compared with supervised HST among selected high HIV risk fisherfolk (22.1 % HIV Prevalence) in three fishing villages in Uganda between July and September 2013. The study enrolled 246 participants and randomized them in a 1:1 ratio to unsupervised HST or provider-supervised HST. In an intent-to-treat analysis, the HST sensitivity was 90 % in the unsupervised arm and 100 % among the provider-supervised, yielding a difference 0f -10 % (90 % CI -21, 1 %); non-inferiority was not shown. In a per protocol analysis, the difference in sensitivity was -5.6 % (90 % CI -14.4, 3.3 %) and did show non-inferiority. We conclude that unsupervised HST is feasible in rural Africa and may be non-inferior to provider-supervised HST.

  19. Subspace K-means clustering.

    PubMed

    Timmerman, Marieke E; Ceulemans, Eva; De Roover, Kim; Van Leeuwen, Karla

    2013-12-01

    To achieve an insightful clustering of multivariate data, we propose subspace K-means. Its central idea is to model the centroids and cluster residuals in reduced spaces, which allows for dealing with a wide range of cluster types and yields rich interpretations of the clusters. We review the existing related clustering methods, including deterministic, stochastic, and unsupervised learning approaches. To evaluate subspace K-means, we performed a comparative simulation study, in which we manipulated the overlap of subspaces, the between-cluster variance, and the error variance. The study shows that the subspace K-means algorithm is sensitive to local minima but that the problem can be reasonably dealt with by using partitions of various cluster procedures as a starting point for the algorithm. Subspace K-means performs very well in recovering the true clustering across all conditions considered and appears to be superior to its competitor methods: K-means, reduced K-means, factorial K-means, mixtures of factor analyzers (MFA), and MCLUST. The best competitor method, MFA, showed a performance similar to that of subspace K-means in easy conditions but deteriorated in more difficult ones. Using data from a study on parental behavior, we show that subspace K-means analysis provides a rich insight into the cluster characteristics, in terms of both the relative positions of the clusters (via the centroids) and the shape of the clusters (via the within-cluster residuals).

  20. Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis.

    PubMed

    González-Calabozo, Jose M; Valverde-Albacete, Francisco J; Peláez-Moreno, Carmen

    2016-09-15

    Gene Expression Data (GED) analysis poses a great challenge to the scientific community that can be framed into the Knowledge Discovery in Databases (KDD) and Data Mining (DM) paradigm. Biclustering has emerged as the machine learning method of choice to solve this task, but its unsupervised nature makes result assessment problematic. This is often addressed by means of Gene Set Enrichment Analysis (GSEA). We put forward a framework in which GED analysis is understood as an Exploratory Data Analysis (EDA) process where we provide support for continuous human interaction with data aiming at improving the step of hypothesis abduction and assessment. We focus on the adaptation to human cognition of data interpretation and visualization of the output of EDA. First, we give a proper theoretical background to bi-clustering using Lattice Theory and provide a set of analysis tools revolving around [Formula: see text]-Formal Concept Analysis ([Formula: see text]-FCA), a lattice-theoretic unsupervised learning technique for real-valued matrices. By using different kinds of cost structures to quantify expression we obtain different sequences of hierarchical bi-clusterings for gene under- and over-expression using thresholds. Consequently, we provide a method with interleaved analysis steps and visualization devices so that the sequences of lattices for a particular experiment summarize the researcher's vision of the data. This also allows us to define measures of persistence and robustness of biclusters to assess them. Second, the resulting biclusters are used to index external omics databases-for instance, Gene Ontology (GO)-thus offering a new way of accessing publicly available resources. This provides different flavors of gene set enrichment against which to assess the biclusters, by obtaining their p-values according to the terminology of those resources. We illustrate the exploration procedure on a real data example confirming results previously published. The GED analysis problem gets transformed into the exploration of a sequence of lattices enabling the visualization of the hierarchical structure of the biclusters with a certain degree of granularity. The ability of FCA-based bi-clustering methods to index external databases such as GO allows us to obtain a quality measure of the biclusters, to observe the evolution of a gene throughout the different biclusters it appears in, to look for relevant biclusters-by observing their genes and what their persistence is-to infer, for instance, hypotheses on their function.

  1. Raman spectroscopy of normal oral buccal mucosa tissues: study on intact and incised biopsies

    NASA Astrophysics Data System (ADS)

    Deshmukh, Atul; Singh, S. P.; Chaturvedi, Pankaj; Krishna, C. Murali

    2011-12-01

    Oral squamous cell carcinoma is one of among the top 10 malignancies. Optical spectroscopy, including Raman, is being actively pursued as alternative/adjunct for cancer diagnosis. Earlier studies have demonstrated the feasibility of classifying normal, premalignant, and malignant oral ex vivo tissues. Spectral features showed predominance of lipids and proteins in normal and cancer conditions, respectively, which were attributed to membrane lipids and surface proteins. In view of recent developments in deep tissue Raman spectroscopy, we have recorded Raman spectra from superior and inferior surfaces of 10 normal oral tissues on intact, as well as incised, biopsies after separation of epithelium from connective tissue. Spectral variations and similarities among different groups were explored by unsupervised (principal component analysis) and supervised (linear discriminant analysis, factorial discriminant analysis) methodologies. Clusters of spectra from superior and inferior surfaces of intact tissues show a high overlap; whereas spectra from separated epithelium and connective tissue sections yielded clear clusters, though they also overlap on clusters of intact tissues. Spectra of all four groups of normal tissues gave exclusive clusters when tested against malignant spectra. Thus, this study demonstrates that spectra recorded from the superior surface of an intact tissue may have contributions from deeper layers but has no bearing from the classification of a malignant tissues point of view.

  2. Differentiation of Uterine Leiomyosarcoma from Atypical Leiomyoma: Diagnostic Accuracy of Qualitative MR Imaging Features and Feasibility of Texture Analysis.

    PubMed

    Lakhman, Yulia; Veeraraghavan, Harini; Chaim, Joshua; Feier, Diana; Goldman, Debra A; Moskowitz, Chaya S; Nougaret, Stephanie; Sosa, Ramon E; Vargas, Hebert Alberto; Soslow, Robert A; Abu-Rustum, Nadeem R; Hricak, Hedvig; Sala, Evis

    2017-07-01

    To investigate whether qualitative magnetic resonance (MR) features can distinguish leiomyosarcoma (LMS) from atypical leiomyoma (ALM) and assess the feasibility of texture analysis (TA). This retrospective study included 41 women (ALM = 22, LMS = 19) imaged with MRI prior to surgery. Two readers (R1, R2) evaluated each lesion for qualitative MR features. Associations between MR features and LMS were evaluated with Fisher's exact test. Accuracy measures were calculated for the four most significant features. TA was performed for 24 patients (ALM = 14, LMS = 10) with uniform imaging following lesion segmentation on axial T2-weighted images. Texture features were pre-selected using Wilcoxon signed-rank test with Bonferroni correction and analyzed with unsupervised clustering to separate LMS from ALM. Four qualitative MR features most strongly associated with LMS were nodular borders, haemorrhage, "T2 dark" area(s), and central unenhanced area(s) (p ≤ 0.0001 each feature/reader). The highest sensitivity [1.00 (95%CI:0.82-1.00)/0.95 (95%CI: 0.74-1.00)] and specificity [0.95 (95%CI:0.77-1.00)/1.00 (95%CI:0.85-1.00)] were achieved for R1/R2, respectively, when a lesion had ≥3 of these four features. Sixteen texture features differed significantly between LMS and ALM (p-values: <0.001-0.036). Unsupervised clustering achieved accuracy of 0.75 (sensitivity: 0.70; specificity: 0.79). Combination of ≥3 qualitative MR features accurately distinguished LMS from ALM. TA was feasible. • Four qualitative MR features demonstrated the strongest statistical association with LMS. • Combination of ≥3 these features could accurately differentiate LMS from ALM. • Texture analysis was a feasible semi-automated approach for lesion categorization.

  3. Guasom Analysis Of The Alhambra Survey

    NASA Astrophysics Data System (ADS)

    Garabato, Daniel; Manteiga, Minia; Dafonte, Carlos; Álvarez, Marco A.

    2017-10-01

    GUASOM is a data mining tool designed for knowledge discovery in large astronomical spectrophotometric archives developed in the framework of Gaia DPAC (Data Processing and Analysis Consortium). Our tool is based on a type of unsupervised learning Artificial Neural Networks named Self-organizing maps (SOMs). SOMs permit the grouping and visualization of big amount of data for which there is no a priori knowledge and hence they are very useful for analyzing the huge amount of information present in modern spectrophotometric surveys. SOMs are used to organize the information in clusters of objects, as homogeneously as possible according to their spectral energy distributions, and to project them onto a 2D grid where the data structure can be visualized. Each cluster has a representative, called prototype which is a virtual pattern that better represents or resembles the set of input patterns belonging to such a cluster. Prototypes make easier the task of determining the physical nature and properties of the objects populating each cluster. Our algorithm has been tested on the ALHAMBRA survey spectrophotometric observations, here we present our results concerning the survey segmentation, visualization of the data structure, separation between types of objects (stars and galaxies), data homogeneity of neurons, cluster prototypes, redshift distribution and crossmatch with other databases (Simbad).

  4. Influenza vaccine response profiles are affected by vaccine preparation and preexisting immunity, but not HIV infection.

    PubMed

    Berger, Christoph T; Greiff, Victor; Mehling, Matthias; Fritz, Stefanie; Meier, Marc A; Hoenger, Gideon; Conen, Anna; Recher, Mike; Battegay, Manuel; Reddy, Sai T; Hess, Christoph

    2015-01-01

    Vaccines dramatically reduce infection-related morbidity and mortality. Determining factors that modulate the host response is key to rational vaccine design and demands unsupervised analysis. To longitudinally resolve influenza-specific humoral immune response dynamics we constructed vaccine response profiles of influenza A- and B-specific IgM and IgG levels from 42 healthy and 31 HIV infected influenza-vaccinated individuals. Pre-vaccination antibody levels and levels at 3 predefined time points after vaccination were included in each profile. We performed hierarchical clustering on these profiles to study the extent to which HIV infection associated immune dysfunction, adaptive immune factors (pre-existing influenza-specific antibodies, T cell responses), an innate immune factor (Mannose Binding Lectin, MBL), demographic characteristics (gender, age), or the vaccine preparation (split vs. virosomal) impacted the immune response to influenza vaccination. Hierarchical clustering associated vaccine preparation and pre-existing IgG levels with the profiles of healthy individuals. In contrast to previous in vitro and animal data, MBL levels had no impact on the adaptive vaccine response. Importantly, while HIV infected subjects with low CD4 T cell counts showed a reduced magnitude of their vaccine response, their response profiles were indistinguishable from those of healthy controls, suggesting quantitative but not qualitative deficits. Unsupervised profile-based analysis ranks factors impacting the vaccine-response by relative importance, with substantial implications for comparing, designing and improving vaccine preparations and strategies. Profile similarity between HIV infected and HIV negative individuals suggests merely quantitative differences in the vaccine response in these individuals, offering a rationale for boosting strategies in the HIV infected population.

  5. A cost-function approach to rival penalized competitive learning (RPCL).

    PubMed

    Ma, Jinwen; Wang, Taijun

    2006-08-01

    Rival penalized competitive learning (RPCL) has been shown to be a useful tool for clustering on a set of sample data in which the number of clusters is unknown. However, the RPCL algorithm was proposed heuristically and is still in lack of a mathematical theory to describe its convergence behavior. In order to solve the convergence problem, we investigate it via a cost-function approach. By theoretical analysis, we prove that a general form of RPCL, called distance-sensitive RPCL (DSRPCL), is associated with the minimization of a cost function on the weight vectors of a competitive learning network. As a DSRPCL process decreases the cost to a local minimum, a number of weight vectors eventually fall into a hypersphere surrounding the sample data, while the other weight vectors diverge to infinity. Moreover, it is shown by the theoretical analysis and simulation experiments that if the cost reduces into the global minimum, a correct number of weight vectors is automatically selected and located around the centers of the actual clusters, respectively. Finally, we apply the DSRPCL algorithms to unsupervised color image segmentation and classification of the wine data.

  6. Impact of the Choice of Normalization Method on Molecular Cancer Class Discovery Using Nonnegative Matrix Factorization.

    PubMed

    Yang, Haixuan; Seoighe, Cathal

    2016-01-01

    Nonnegative Matrix Factorization (NMF) has proved to be an effective method for unsupervised clustering analysis of gene expression data. By the nonnegativity constraint, NMF provides a decomposition of the data matrix into two matrices that have been used for clustering analysis. However, the decomposition is not unique. This allows different clustering results to be obtained, resulting in different interpretations of the decomposition. To alleviate this problem, some existing methods directly enforce uniqueness to some extent by adding regularization terms in the NMF objective function. Alternatively, various normalization methods have been applied to the factor matrices; however, the effects of the choice of normalization have not been carefully investigated. Here we investigate the performance of NMF for the task of cancer class discovery, under a wide range of normalization choices. After extensive evaluations, we observe that the maximum norm showed the best performance, although the maximum norm has not previously been used for NMF. Matlab codes are freely available from: http://maths.nuigalway.ie/~haixuanyang/pNMF/pNMF.htm.

  7. Cluster Analysis Identifies 3 Phenotypes within Allergic Asthma.

    PubMed

    Sendín-Hernández, María Paz; Ávila-Zarza, Carmelo; Sanz, Catalina; García-Sánchez, Asunción; Marcos-Vadillo, Elena; Muñoz-Bellido, Francisco J; Laffond, Elena; Domingo, Christian; Isidoro-García, María; Dávila, Ignacio

    Asthma is a heterogeneous chronic disease with different clinical expressions and responses to treatment. In recent years, several unbiased approaches based on clinical, physiological, and molecular features have described several phenotypes of asthma. Some phenotypes are allergic, but little is known about whether these phenotypes can be further subdivided. We aimed to phenotype patients with allergic asthma using an unbiased approach based on multivariate classification techniques (unsupervised hierarchical cluster analysis). From a total of 54 variables of 225 patients with well-characterized allergic asthma diagnosed following American Thoracic Society (ATS) recommendation, positive skin prick test to aeroallergens, and concordant symptoms, we finally selected 19 variables by multiple correspondence analyses. Then a cluster analysis was performed. Three groups were identified. Cluster 1 was constituted by patients with intermittent or mild persistent asthma, without family antecedents of atopy, asthma, or rhinitis. This group showed the lowest total IgE levels. Cluster 2 was constituted by patients with mild asthma with a family history of atopy, asthma, or rhinitis. Total IgE levels were intermediate. Cluster 3 included patients with moderate or severe persistent asthma that needed treatment with corticosteroids and long-acting β-agonists. This group showed the highest total IgE levels. We identified 3 phenotypes of allergic asthma in our population. Furthermore, we described 2 phenotypes of mild atopic asthma mainly differentiated by a family history of allergy. Copyright © 2017 American Academy of Allergy, Asthma & Immunology. Published by Elsevier Inc. All rights reserved.

  8. Image segmentation using fuzzy LVQ clustering networks

    NASA Technical Reports Server (NTRS)

    Tsao, Eric Chen-Kuo; Bezdek, James C.; Pal, Nikhil R.

    1992-01-01

    In this note we formulate image segmentation as a clustering problem. Feature vectors extracted from a raw image are clustered into subregions, thereby segmenting the image. A fuzzy generalization of a Kohonen learning vector quantization (LVQ) which integrates the Fuzzy c-Means (FCM) model with the learning rate and updating strategies of the LVQ is used for this task. This network, which segments images in an unsupervised manner, is thus related to the FCM optimization problem. Numerical examples on photographic and magnetic resonance images are given to illustrate this approach to image segmentation.

  9. Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification.

    PubMed

    Mehryary, Farrokh; Kaewphan, Suwisa; Hakala, Kai; Ginter, Filip

    2016-01-01

    Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/.

  10. Machine learning in APOGEE. Unsupervised spectral classification with K-means

    NASA Astrophysics Data System (ADS)

    Garcia-Dias, Rafael; Allende Prieto, Carlos; Sánchez Almeida, Jorge; Ordovás-Pascual, Ignacio

    2018-05-01

    Context. The volume of data generated by astronomical surveys is growing rapidly. Traditional analysis techniques in spectroscopy either demand intensive human interaction or are computationally expensive. In this scenario, machine learning, and unsupervised clustering algorithms in particular, offer interesting alternatives. The Apache Point Observatory Galactic Evolution Experiment (APOGEE) offers a vast data set of near-infrared stellar spectra, which is perfect for testing such alternatives. Aims: Our research applies an unsupervised classification scheme based on K-means to the massive APOGEE data set. We explore whether the data are amenable to classification into discrete classes. Methods: We apply the K-means algorithm to 153 847 high resolution spectra (R ≈ 22 500). We discuss the main virtues and weaknesses of the algorithm, as well as our choice of parameters. Results: We show that a classification based on normalised spectra captures the variations in stellar atmospheric parameters, chemical abundances, and rotational velocity, among other factors. The algorithm is able to separate the bulge and halo populations, and distinguish dwarfs, sub-giants, RC, and RGB stars. However, a discrete classification in flux space does not result in a neat organisation in the parameters' space. Furthermore, the lack of obvious groups in flux space causes the results to be fairly sensitive to the initialisation, and disrupts the efficiency of commonly-used methods to select the optimal number of clusters. Our classification is publicly available, including extensive online material associated with the APOGEE Data Release 12 (DR12). Conclusions: Our description of the APOGEE database can help greatly with the identification of specific types of targets for various applications. We find a lack of obvious groups in flux space, and identify limitations of the K-means algorithm in dealing with this kind of data. Full Tables B.1-B.4 are only available at the CDS via anonymous ftp to http://cdsarc.u-strasbg.fr (http://130.79.128.5) or via http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/612/A98

  11. Unsupervised deep learning reveals prognostically relevant subtypes of glioblastoma.

    PubMed

    Young, Jonathan D; Cai, Chunhui; Lu, Xinghua

    2017-10-03

    One approach to improving the personalized treatment of cancer is to understand the cellular signaling transduction pathways that cause cancer at the level of the individual patient. In this study, we used unsupervised deep learning to learn the hierarchical structure within cancer gene expression data. Deep learning is a group of machine learning algorithms that use multiple layers of hidden units to capture hierarchically related, alternative representations of the input data. We hypothesize that this hierarchical structure learned by deep learning will be related to the cellular signaling system. Robust deep learning model selection identified a network architecture that is biologically plausible. Our model selection results indicated that the 1st hidden layer of our deep learning model should contain about 1300 hidden units to most effectively capture the covariance structure of the input data. This agrees with the estimated number of human transcription factors, which is approximately 1400. This result lends support to our hypothesis that the 1st hidden layer of a deep learning model trained on gene expression data may represent signals related to transcription factor activation. Using the 3rd hidden layer representation of each tumor as learned by our unsupervised deep learning model, we performed consensus clustering on all tumor samples-leading to the discovery of clusters of glioblastoma multiforme with differential survival. One of these clusters contained all of the glioblastoma samples with G-CIMP, a known methylation phenotype driven by the IDH1 mutation and associated with favorable prognosis, suggesting that the hidden units in the 3rd hidden layer representations captured a methylation signal without explicitly using methylation data as input. We also found differentially expressed genes and well-known mutations (NF1, IDH1, EGFR) that were uniquely correlated with each of these clusters. Exploring these unique genes and mutations will allow us to further investigate the disease mechanisms underlying each of these clusters. In summary, we show that a deep learning model can be trained to represent biologically and clinically meaningful abstractions of cancer gene expression data. Understanding what additional relationships these hidden layer abstractions have with the cancer cellular signaling system could have a significant impact on the understanding and treatment of cancer.

  12. An unsupervised two-stage clustering approach for forest structure classification based on X-band InSAR data - A case study in complex temperate forest stands

    NASA Astrophysics Data System (ADS)

    Abdullahi, Sahra; Schardt, Mathias; Pretzsch, Hans

    2017-05-01

    Forest structure at stand level plays a key role for sustainable forest management, since the biodiversity, productivity, growth and stability of the forest can be positively influenced by managing its structural diversity. In contrast to field-based measurements, remote sensing techniques offer a cost-efficient opportunity to collect area-wide information about forest stand structure with high spatial and temporal resolution. Especially Interferometric Synthetic Aperture Radar (InSAR), which facilitates worldwide acquisition of 3d information independent from weather conditions and illumination, is convenient to capture forest stand structure. This study purposes an unsupervised two-stage clustering approach for forest structure classification based on height information derived from interferometric X-band SAR data which was performed in complex temperate forest stands of Traunstein forest (South Germany). In particular, a four dimensional input data set composed of first-order height statistics was non-linearly projected on a two-dimensional Self-Organizing Map, spatially ordered according to similarity (based on the Euclidean distance) in the first stage and classified using the k-means algorithm in the second stage. The study demonstrated that X-band InSAR data exhibits considerable capabilities for forest structure classification. Moreover, the unsupervised classification approach achieved meaningful and reasonable results by means of comparison to aerial imagery and LiDAR data.

  13. Classification of high-resolution multi-swath hyperspectral data using Landsat 8 surface reflectance data as a calibration target and a novel histogram based unsupervised classification technique to determine natural classes from biophysically relevant fit parameters

    NASA Astrophysics Data System (ADS)

    McCann, C.; Repasky, K. S.; Morin, M.; Lawrence, R. L.; Powell, S. L.

    2016-12-01

    Compact, cost-effective, flight-based hyperspectral imaging systems can provide scientifically relevant data over large areas for a variety of applications such as ecosystem studies, precision agriculture, and land management. To fully realize this capability, unsupervised classification techniques based on radiometrically-calibrated data that cluster based on biophysical similarity rather than simply spectral similarity are needed. An automated technique to produce high-resolution, large-area, radiometrically-calibrated hyperspectral data sets based on the Landsat surface reflectance data product as a calibration target was developed and applied to three subsequent years of data covering approximately 1850 hectares. The radiometrically-calibrated data allows inter-comparison of the temporal series. Advantages of the radiometric calibration technique include the need for minimal site access, no ancillary instrumentation, and automated processing. Fitting the reflectance spectra of each pixel using a set of biophysically relevant basis functions reduces the data from 80 spectral bands to 9 parameters providing noise reduction and data compression. Examination of histograms of these parameters allows for determination of natural splitting into biophysical similar clusters. This method creates clusters that are similar in terms of biophysical parameters, not simply spectral proximity. Furthermore, this method can be applied to other data sets, such as urban scenes, by developing other physically meaningful basis functions. The ability to use hyperspectral imaging for a variety of important applications requires the development of data processing techniques that can be automated. The radiometric-calibration combined with the histogram based unsupervised classification technique presented here provide one potential avenue for managing big-data associated with hyperspectral imaging.

  14. Enhanced HMAX model with feedforward feature learning for multiclass categorization.

    PubMed

    Li, Yinlin; Wu, Wei; Zhang, Bo; Li, Fengfu

    2015-01-01

    In recent years, the interdisciplinary research between neuroscience and computer vision has promoted the development in both fields. Many biologically inspired visual models are proposed, and among them, the Hierarchical Max-pooling model (HMAX) is a feedforward model mimicking the structures and functions of V1 to posterior inferotemporal (PIT) layer of the primate visual cortex, which could generate a series of position- and scale- invariant features. However, it could be improved with attention modulation and memory processing, which are two important properties of the primate visual cortex. Thus, in this paper, based on recent biological research on the primate visual cortex, we still mimic the first 100-150 ms of visual cognition to enhance the HMAX model, which mainly focuses on the unsupervised feedforward feature learning process. The main modifications are as follows: (1) To mimic the attention modulation mechanism of V1 layer, a bottom-up saliency map is computed in the S1 layer of the HMAX model, which can support the initial feature extraction for memory processing; (2) To mimic the learning, clustering and short-term memory to long-term memory conversion abilities of V2 and IT, an unsupervised iterative clustering method is used to learn clusters with multiscale middle level patches, which are taken as long-term memory; (3) Inspired by the multiple feature encoding mode of the primate visual cortex, information including color, orientation, and spatial position are encoded in different layers of the HMAX model progressively. By adding a softmax layer at the top of the model, multiclass categorization experiments can be conducted, and the results on Caltech101 show that the enhanced model with a smaller memory size exhibits higher accuracy than the original HMAX model, and could also achieve better accuracy than other unsupervised feature learning methods in multiclass categorization task.

  15. Enhancement of Tropical Land Cover Mapping with Wavelet-Based Fusion and Unsupervised Clustering of SAR and Landsat Image Data

    NASA Technical Reports Server (NTRS)

    LeMoigne, Jacqueline; Laporte, Nadine; Netanyahuy, Nathan S.; Zukor, Dorothy (Technical Monitor)

    2001-01-01

    The characterization and the mapping of land cover/land use of forest areas, such as the Central African rainforest, is a very complex task. This complexity is mainly due to the extent of such areas and, as a consequence, to the lack of full and continuous cloud-free coverage of those large regions by one single remote sensing instrument, In order to provide improved vegetation maps of Central Africa and to develop forest monitoring techniques for applications at the local and regional scales, we propose to utilize multi-sensor remote sensing observations coupled with in-situ data. Fusion and clustering of multi-sensor data are the first steps towards the development of such a forest monitoring system. In this paper, we will describe some preliminary experiments involving the fusion of SAR and Landsat image data of the Lope Reserve in Gabon. Similarly to previous fusion studies, our fusion method is wavelet-based. The fusion provides a new image data set which contains more detailed texture features and preserves the large homogeneous regions that are observed by the Thematic Mapper sensor. The fusion step is followed by unsupervised clustering and provides a vegetation map of the area.

  16. Design of partially supervised classifiers for multispectral image data

    NASA Technical Reports Server (NTRS)

    Jeon, Byeungwoo; Landgrebe, David

    1993-01-01

    A partially supervised classification problem is addressed, especially when the class definition and corresponding training samples are provided a priori only for just one particular class. In practical applications of pattern classification techniques, a frequently observed characteristic is the heavy, often nearly impossible requirements on representative prior statistical class characteristics of all classes in a given data set. Considering the effort in both time and man-power required to have a well-defined, exhaustive list of classes with a corresponding representative set of training samples, this 'partially' supervised capability would be very desirable, assuming adequate classifier performance can be obtained. Two different classification algorithms are developed to achieve simplicity in classifier design by reducing the requirement of prior statistical information without sacrificing significant classifying capability. The first one is based on optimal significance testing, where the optimal acceptance probability is estimated directly from the data set. In the second approach, the partially supervised classification is considered as a problem of unsupervised clustering with initially one known cluster or class. A weighted unsupervised clustering procedure is developed to automatically define other classes and estimate their class statistics. The operational simplicity thus realized should make these partially supervised classification schemes very viable tools in pattern classification.

  17. Recapitulation of Ayurveda constitution types by machine learning of phenotypic traits.

    PubMed

    Tiwari, Pradeep; Kutum, Rintu; Sethi, Tavpritesh; Shrivastava, Ankita; Girase, Bhushan; Aggarwal, Shilpi; Patil, Rutuja; Agarwal, Dhiraj; Gautam, Pramod; Agrawal, Anurag; Dash, Debasis; Ghosh, Saurabh; Juvekar, Sanjay; Mukerji, Mitali; Prasher, Bhavana

    2017-01-01

    In Ayurveda system of medicine individuals are classified into seven constitution types, "Prakriti", for assessing disease susceptibility and drug responsiveness. Prakriti evaluation involves clinical examination including questions about physiological and behavioural traits. A need was felt to develop models for accurately predicting Prakriti classes that have been shown to exhibit molecular differences. The present study was carried out on data of phenotypic attributes in 147 healthy individuals of three extreme Prakriti types, from a genetically homogeneous population of Western India. Unsupervised and supervised machine learning approaches were used to infer inherent structure of the data, and for feature selection and building classification models for Prakriti respectively. These models were validated in a North Indian population. Unsupervised clustering led to emergence of three natural clusters corresponding to three extreme Prakriti classes. The supervised modelling approaches could classify individuals, with distinct Prakriti types, in the training and validation sets. This study is the first to demonstrate that Prakriti types are distinct verifiable clusters within a multidimensional space of multiple interrelated phenotypic traits. It also provides a computational framework for predicting Prakriti classes from phenotypic attributes. This approach may be useful in precision medicine for stratification of endophenotypes in healthy and diseased populations.

  18. Comparison Between Supervised and Unsupervised Classifications of Neuronal Cell Types: A Case Study

    PubMed Central

    Guerra, Luis; McGarry, Laura M; Robles, Víctor; Bielza, Concha; Larrañaga, Pedro; Yuste, Rafael

    2011-01-01

    In the study of neural circuits, it becomes essential to discern the different neuronal cell types that build the circuit. Traditionally, neuronal cell types have been classified using qualitative descriptors. More recently, several attempts have been made to classify neurons quantitatively, using unsupervised clustering methods. While useful, these algorithms do not take advantage of previous information known to the investigator, which could improve the classification task. For neocortical GABAergic interneurons, the problem to discern among different cell types is particularly difficult and better methods are needed to perform objective classifications. Here we explore the use of supervised classification algorithms to classify neurons based on their morphological features, using a database of 128 pyramidal cells and 199 interneurons from mouse neocortex. To evaluate the performance of different algorithms we used, as a “benchmark,” the test to automatically distinguish between pyramidal cells and interneurons, defining “ground truth” by the presence or absence of an apical dendrite. We compared hierarchical clustering with a battery of different supervised classification algorithms, finding that supervised classifications outperformed hierarchical clustering. In addition, the selection of subsets of distinguishing features enhanced the classification accuracy for both sets of algorithms. The analysis of selected variables indicates that dendritic features were most useful to distinguish pyramidal cells from interneurons when compared with somatic and axonal morphological variables. We conclude that supervised classification algorithms are better matched to the general problem of distinguishing neuronal cell types when some information on these cell groups, in our case being pyramidal or interneuron, is known a priori. As a spin-off of this methodological study, we provide several methods to automatically distinguish neocortical pyramidal cells from interneurons, based on their morphologies. © 2010 Wiley Periodicals, Inc. Develop Neurobiol 71: 71–82, 2011 PMID:21154911

  19. On the clustering of multidimensional pictorial data

    NASA Technical Reports Server (NTRS)

    Bryant, J. D. (Principal Investigator)

    1979-01-01

    Obvious approaches to reducing the cost (in computer resources) of applying current clustering techniques to the problem of remote sensing are discussed. The use of spatial information in finding fields and in classifying mixture pixels is examined, and the AMOEBA clustering program is described. Internally, a pattern recognition program, from without, AMOEBA appears to be an unsupervised clustering program. It is fast and automatic. No choices (such as arbitrary thresholds to set split/combine sequences) need be made. The problem of finding the number of clusters is solved automatically. At the conclusion of the program, all points in the scene are classified; however, a provision is included for a reject classification of some points which, within the theoretical framework, cannot rationally be assigned to any cluster.

  20. Compositional Variability Associated with Stickney Crater on Phobos

    NASA Technical Reports Server (NTRS)

    Roush, T. L.; Hogan, R. C.

    2001-01-01

    Unsupervised clustering techniques identified four regions in and near Stickney crater on Phobos having unique spectral properties. These spectra are best matched by spectra of naturally occurring materials, e.g., lunar soils, meteorites, and rocks. Additional information is contained in the original extended abstract.

  1. Methods for automatically analyzing humpback song units.

    PubMed

    Rickwood, Peter; Taylor, Andrew

    2008-03-01

    This paper presents mathematical techniques for automatically extracting and analyzing bioacoustic signals. Automatic techniques are described for isolation of target signals from background noise, extraction of features from target signals and unsupervised classification (clustering) of the target signals based on these features. The only user-provided inputs, other than raw sound, is an initial set of signal processing and control parameters. Of particular note is that the number of signal categories is determined automatically. The techniques, applied to hydrophone recordings of humpback whales (Megaptera novaeangliae), produce promising initial results, suggesting that they may be of use in automated analysis of not only humpbacks, but possibly also in other bioacoustic settings where automated analysis is desirable.

  2. Modified vegetation indices for Ganoderma disease detection in oil palm from field spectroradiometer data

    NASA Astrophysics Data System (ADS)

    Shafri, Helmi Z. M.; Anuar, M. Izzuddin; Saripan, M. Iqbal

    2009-10-01

    High resolution field spectroradiometers are important for spectral analysis and mobile inspection of vegetation disease. The biggest challenges in using this technology for automated vegetation disease detection are in spectral signatures pre-processing, band selection and generating reflectance indices to improve the ability of hyperspectral data for early detection of disease. In this paper, new indices for oil palm Ganoderma disease detection were generated using band ratio and different band combination techniques. Unsupervised clustering method was used to cluster the values of each class resultant from each index. The wellness of band combinations was assessed by using Optimum Index Factor (OIF) while cluster validation was executed using Average Silhouette Width (ASW). 11 modified reflectance indices were generated in this study and the indices were ranked according to the values of their ASW. These modified indices were also compared to several existing and new indices. The results showed that the combination of spectral values at 610.5nm and 738nm was the best for clustering the three classes of infection levels in the determination of the best spectral index for early detection of Ganoderma disease.

  3. Accelerating atomic structure search with cluster regularization

    NASA Astrophysics Data System (ADS)

    Sørensen, K. H.; Jørgensen, M. S.; Bruix, A.; Hammer, B.

    2018-06-01

    We present a method for accelerating the global structure optimization of atomic compounds. The method is demonstrated to speed up the finding of the anatase TiO2(001)-(1 × 4) surface reconstruction within a density functional tight-binding theory framework using an evolutionary algorithm. As a key element of the method, we use unsupervised machine learning techniques to categorize atoms present in a diverse set of partially disordered surface structures into clusters of atoms having similar local atomic environments. Analysis of more than 1000 different structures shows that the total energy of the structures correlates with the summed distances of the atomic environments to their respective cluster centers in feature space, where the sum runs over all atoms in each structure. Our method is formulated as a gradient based minimization of this summed cluster distance for a given structure and alternates with a standard gradient based energy minimization. While the latter minimization ensures local relaxation within a given energy basin, the former enables escapes from meta-stable basins and hence increases the overall performance of the global optimization.

  4. Source Apportionment and Risk Assessment of Emerging Contaminants: An Approach of Pharmaco-Signature in Water Systems

    PubMed Central

    Jiang, Jheng Jie; Lee, Chon Lin; Fang, Meng Der; Boyd, Kenneth G.; Gibb, Stuart W.

    2015-01-01

    This paper presents a methodology based on multivariate data analysis for characterizing potential source contributions of emerging contaminants (ECs) detected in 26 river water samples across multi-scape regions during dry and wet seasons. Based on this methodology, we unveil an approach toward potential source contributions of ECs, a concept we refer to as the “Pharmaco-signature.” Exploratory analysis of data points has been carried out by unsupervised pattern recognition (hierarchical cluster analysis, HCA) and receptor model (principal component analysis-multiple linear regression, PCA-MLR) in an attempt to demonstrate significant source contributions of ECs in different land-use zone. Robust cluster solutions grouped the database according to different EC profiles. PCA-MLR identified that 58.9% of the mean summed ECs were contributed by domestic impact, 9.7% by antibiotics application, and 31.4% by drug abuse. Diclofenac, ibuprofen, codeine, ampicillin, tetracycline, and erythromycin-H2O have significant pollution risk quotients (RQ>1), indicating potentially high risk to aquatic organisms in Taiwan. PMID:25874375

  5. Post-processing interstitialcy diffusion from molecular dynamics simulations

    NASA Astrophysics Data System (ADS)

    Bhardwaj, U.; Bukkuru, S.; Warrier, M.

    2016-01-01

    An algorithm to rigorously trace the interstitialcy diffusion trajectory in crystals is developed. The algorithm incorporates unsupervised learning and graph optimization which obviate the need to input extra domain specific information depending on crystal or temperature of the simulation. The algorithm is implemented in a flexible framework as a post-processor to molecular dynamics (MD) simulations. We describe in detail the reduction of interstitialcy diffusion into known computational problems of unsupervised clustering and graph optimization. We also discuss the steps, computational efficiency and key components of the algorithm. Using the algorithm, thermal interstitialcy diffusion from low to near-melting point temperatures is studied. We encapsulate the algorithms in a modular framework with functionality to calculate diffusion coefficients, migration energies and other trajectory properties. The study validates the algorithm by establishing the conformity of output parameters with experimental values and provides detailed insights for the interstitialcy diffusion mechanism. The algorithm along with the help of supporting visualizations and analysis gives convincing details and a new approach to quantifying diffusion jumps, jump-lengths, time between jumps and to identify interstitials from lattice atoms.

  6. Unsupervised User Similarity Mining in GSM Sensor Networks

    PubMed Central

    Shad, Shafqat Ali; Chen, Enhong

    2013-01-01

    Mobility data has attracted the researchers for the past few years because of its rich context and spatiotemporal nature, where this information can be used for potential applications like early warning system, route prediction, traffic management, advertisement, social networking, and community finding. All the mentioned applications are based on mobility profile building and user trend analysis, where mobility profile building is done through significant places extraction, user's actual movement prediction, and context awareness. However, significant places extraction and user's actual movement prediction for mobility profile building are a trivial task. In this paper, we present the user similarity mining-based methodology through user mobility profile building by using the semantic tagging information provided by user and basic GSM network architecture properties based on unsupervised clustering approach. As the mobility information is in low-level raw form, our proposed methodology successfully converts it to a high-level meaningful information by using the cell-Id location information rather than previously used location capturing methods like GPS, Infrared, and Wifi for profile mining and user similarity mining. PMID:23576905

  7. Post-processing interstitialcy diffusion from molecular dynamics simulations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bhardwaj, U., E-mail: haptork@gmail.com; Bukkuru, S.; Warrier, M.

    2016-01-15

    An algorithm to rigorously trace the interstitialcy diffusion trajectory in crystals is developed. The algorithm incorporates unsupervised learning and graph optimization which obviate the need to input extra domain specific information depending on crystal or temperature of the simulation. The algorithm is implemented in a flexible framework as a post-processor to molecular dynamics (MD) simulations. We describe in detail the reduction of interstitialcy diffusion into known computational problems of unsupervised clustering and graph optimization. We also discuss the steps, computational efficiency and key components of the algorithm. Using the algorithm, thermal interstitialcy diffusion from low to near-melting point temperatures ismore » studied. We encapsulate the algorithms in a modular framework with functionality to calculate diffusion coefficients, migration energies and other trajectory properties. The study validates the algorithm by establishing the conformity of output parameters with experimental values and provides detailed insights for the interstitialcy diffusion mechanism. The algorithm along with the help of supporting visualizations and analysis gives convincing details and a new approach to quantifying diffusion jumps, jump-lengths, time between jumps and to identify interstitials from lattice atoms. -- Graphical abstract:.« less

  8. Source Apportionment of Atmospheric Particles by Electron Probe X-Ray Microanalysis and Receptor Models.

    NASA Astrophysics Data System (ADS)

    van Borm, Werner August

    Electron probe X-ray microanalysis (EPXMA) in combination with an automation system and an energy-dispersive X-ray detection system was used to analyse thousands of microscopical particles, originating from the ambient atmosphere. The huge amount of data was processed by a newly developed X-ray correction method and a number of data reduction procedures. A standardless ZAF procedure for EPXMA was developed for quick semi-quantitative analysis of particles starting from simple corrections, valid for bulk samples and modified taking into account the particle finit diameter, assuming a spherical shape. Tested on a limited database of bulk and particulate samples, the compromise between calculation speed and accuracy yielded for elements with Z > 14 accuracies on concentrations less than 10% while absolute deviations remained below 4 weight%, thus being only important for low concentrations. Next, the possibilities for the use of supervised and unsupervised multivariate particle classification were investigated for source apportionment of individual particles. In a detailed study of the unsupervised cluster analysis technique several aspects were considered, that have a severe influence on the final cluster analysis results, i.e. data acquisition, X-ray peak identification, data normalization, scaling, variable selection, similarity measure, cluster strategy, cluster significance and error propagation. A supervised approach was developed using an expert system-like approach in which identification rules are builded to describe the particle classes in a unique manner. Applications are presented for particles sampled (1) near a zinc smelter (Vieille-Montagne, Balen, Belgium), analyzed for heavy metals, (2) in an urban aerosol (Antwerp, Belgium), analyzed for over 20 elements and (3) in a rural aerosol originating from a swiss mountain area (Bern). Thus is was possible to pinpoint a number of known and unknown sources and characterize their emissions in terms of particles abundance and particle composition. Alternatively, the bulk analysis of filters (total, fine and coarse mode) using Particle Induced X -Ray Emission (PIXE) and the application of a receptor modeling approach provided for complementary information on a macroscopical level. A computer program was developed incorporating an absolute factor analysis based receptor modeling procedure. Source profiles and contributions are described by elemental concentrations and an atmospheric mass balance is put forward. The latter method was applied in a two year study of the Antwerp urban aerosol and for the swiss aerosol, revealing a number of previously known and unknown sources. Both methods were successfully combined to increase the source resolution.

  9. Phenotypic mapping of metabolic profiles using self-organizing maps of high-dimensional mass spectrometry data.

    PubMed

    Goodwin, Cody R; Sherrod, Stacy D; Marasco, Christina C; Bachmann, Brian O; Schramm-Sapyta, Nicole; Wikswo, John P; McLean, John A

    2014-07-01

    A metabolic system is composed of inherently interconnected metabolic precursors, intermediates, and products. The analysis of untargeted metabolomics data has conventionally been performed through the use of comparative statistics or multivariate statistical analysis-based approaches; however, each falls short in representing the related nature of metabolic perturbations. Herein, we describe a complementary method for the analysis of large metabolite inventories using a data-driven approach based upon a self-organizing map algorithm. This workflow allows for the unsupervised clustering, and subsequent prioritization of, correlated features through Gestalt comparisons of metabolic heat maps. We describe this methodology in detail, including a comparison to conventional metabolomics approaches, and demonstrate the application of this method to the analysis of the metabolic repercussions of prolonged cocaine exposure in rat sera profiles.

  10. Machine-learned cluster identification in high-dimensional data.

    PubMed

    Ultsch, Alfred; Lötsch, Jörn

    2017-02-01

    High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. Ward clustering imposed cluster structures on cluster-less "golf ball", "cuboid" and "S-shaped" data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.

  11. Species collapse via hybridization in Darwin's tree finches.

    PubMed

    Kleindorfer, Sonia; O'Connor, Jody A; Dudaniec, Rachael Y; Myers, Steven A; Robertson, Jeremy; Sulloway, Frank J

    2014-03-01

    Species hybridization can lead to fitness costs, species collapse, and novel evolutionary trajectories in changing environments. Hybridization is predicted to be more common when environmental conditions change rapidly. Here, we test patterns of hybridization in three sympatric tree finch species (small tree finch Camarhynchus parvulus, medium tree finch Camarhynchus pauper, and large tree finch: Camarhynchus psittacula) that are currently recognized on Floreana Island, Galápagos Archipelago. Genetic analysis of microsatellite data from contemporary samples showed two genetic populations and one hybrid cluster in both 2005 and 2010; hybrid individuals were derived from genetic population 1 (small morph) and genetic population 2 (large morph). Females of the large and rare species were more likely to pair with males of the small common species. Finch populations differed in morphology in 1852-1906 compared with 2005/2010. An unsupervised clustering method showed (a) support for three morphological clusters in the historical tree finch sample (1852-1906), which is consistent with current species recognition; (b) support for two or three morphological clusters in 2005 with some (19%) hybridization; and (c) support for just two morphological clusters in 2010 with frequent (41%) hybridization. We discuss these findings in relation to species demarcations of Camarhynchus tree finches on Floreana Island.

  12. Infrared vehicle recognition using unsupervised feature learning based on K-feature

    NASA Astrophysics Data System (ADS)

    Lin, Jin; Tan, Yihua; Xia, Haijiao; Tian, Jinwen

    2018-02-01

    Subject to the complex battlefield environment, it is difficult to establish a complete knowledge base in practical application of vehicle recognition algorithms. The infrared vehicle recognition is always difficult and challenging, which plays an important role in remote sensing. In this paper we propose a new unsupervised feature learning method based on K-feature to recognize vehicle in infrared images. First, we use the target detection algorithm which is based on the saliency to detect the initial image. Then, the unsupervised feature learning based on K-feature, which is generated by Kmeans clustering algorithm that extracted features by learning a visual dictionary from a large number of samples without label, is calculated to suppress the false alarm and improve the accuracy. Finally, the vehicle target recognition image is finished by some post-processing. Large numbers of experiments demonstrate that the proposed method has satisfy recognition effectiveness and robustness for vehicle recognition in infrared images under complex backgrounds, and it also improve the reliability of it.

  13. Investigating intracranial tumour growth patterns with multiparametric MRI incorporating Gd‐DTPA and USPIO‐enhanced imaging

    PubMed Central

    Borri, Marco; Jury, Alexa; Popov, Sergey; Box, Gary; Perryman, Lara; Eccles, Suzanne A.; Jones, Chris; Robinson, Simon P.

    2016-01-01

    Abstract High grade and metastatic brain tumours exhibit considerable spatial variations in proliferation, angiogenesis, invasion, necrosis and oedema. Vascular heterogeneity arising from vascular co‐option in regions of invasive growth (in which the blood–brain barrier remains intact) and neoangiogenesis is a major challenge faced in the assessment of brain tumours by conventional MRI. A multiparametric MRI approach, incorporating native measurements and both Gd‐DTPA (Magnevist) and ultrasmall superparamagnetic iron oxide (P904)‐enhanced imaging, was used in combination with histogram and unsupervised cluster analysis using a k‐means algorithm to examine the spatial distribution of vascular parameters, water diffusion characteristics and invasion in intracranially propagated rat RG2 gliomas and human MDA‐MB‐231 LM2–4 breast adenocarcinomas in mice. Both tumour models presented with higher ΔR 1 (the change in transverse relaxation rate R 1 induced by Gd‐DTPA), fractional blood volume (fBV) and apparent diffusion coefficient than uninvolved regions of the brain. MDA‐MB‐231 LM2–4 tumours were less densely cellular than RG2 tumours and exhibited substantial local invasion, associated with oedema, whereas invasion in RG2 tumours was minimal. These additional features were reflected in the more heterogeneous appearance of MDA‐MB‐231 LM2–4 tumours on T 2‐weighted images and maps of functional MRI parameters. Unsupervised cluster analysis separated subregions with distinct functional properties; areas with a low fBV and relatively impermeable blood vessels (low ΔR 1) were predominantly located at the tumour margins, regions of MDA‐MB‐231 LM2–4 tumours with relatively high levels of water diffusion and low vascular permeability and/or fBV corresponded to histologically identified regions of invasion and oedema, and areas of mismatch between vascular permeability and blood volume were identified. We demonstrate that dual contrast MRI and evaluation of tissue diffusion properties, coupled with cluster analysis, allows for the assessment of heterogeneity within invasive brain tumours and the designation of functionally diverse subregions that may provide more informative predictive biomarkers. PMID:27671990

  14. Investigating intracranial tumour growth patterns with multiparametric MRI incorporating Gd-DTPA and USPIO-enhanced imaging.

    PubMed

    Boult, Jessica K R; Borri, Marco; Jury, Alexa; Popov, Sergey; Box, Gary; Perryman, Lara; Eccles, Suzanne A; Jones, Chris; Robinson, Simon P

    2016-11-01

    High grade and metastatic brain tumours exhibit considerable spatial variations in proliferation, angiogenesis, invasion, necrosis and oedema. Vascular heterogeneity arising from vascular co-option in regions of invasive growth (in which the blood-brain barrier remains intact) and neoangiogenesis is a major challenge faced in the assessment of brain tumours by conventional MRI. A multiparametric MRI approach, incorporating native measurements and both Gd-DTPA (Magnevist) and ultrasmall superparamagnetic iron oxide (P904)-enhanced imaging, was used in combination with histogram and unsupervised cluster analysis using a k-means algorithm to examine the spatial distribution of vascular parameters, water diffusion characteristics and invasion in intracranially propagated rat RG2 gliomas and human MDA-MB-231 LM2-4 breast adenocarcinomas in mice. Both tumour models presented with higher ΔR 1 (the change in transverse relaxation rate R 1 induced by Gd-DTPA), fractional blood volume (fBV) and apparent diffusion coefficient than uninvolved regions of the brain. MDA-MB-231 LM2-4 tumours were less densely cellular than RG2 tumours and exhibited substantial local invasion, associated with oedema, whereas invasion in RG2 tumours was minimal. These additional features were reflected in the more heterogeneous appearance of MDA-MB-231 LM2-4 tumours on T 2 -weighted images and maps of functional MRI parameters. Unsupervised cluster analysis separated subregions with distinct functional properties; areas with a low fBV and relatively impermeable blood vessels (low ΔR 1 ) were predominantly located at the tumour margins, regions of MDA-MB-231 LM2-4 tumours with relatively high levels of water diffusion and low vascular permeability and/or fBV corresponded to histologically identified regions of invasion and oedema, and areas of mismatch between vascular permeability and blood volume were identified. We demonstrate that dual contrast MRI and evaluation of tissue diffusion properties, coupled with cluster analysis, allows for the assessment of heterogeneity within invasive brain tumours and the designation of functionally diverse subregions that may provide more informative predictive biomarkers. © 2016 The Authors. NMR in Biomedicine published by John Wiley & Sons Ltd.

  15. Clustering analysis of line indices for LAMOST spectra with AstroStat

    NASA Astrophysics Data System (ADS)

    Chen, Shu-Xin; Sun, Wei-Min; Yan, Qi

    2018-06-01

    The application of data mining in astronomical surveys, such as the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) survey, provides an effective approach to automatically analyze a large amount of complex survey data. Unsupervised clustering could help astronomers find the associations and outliers in a big data set. In this paper, we employ the k-means method to perform clustering for the line index of LAMOST spectra with the powerful software AstroStat. Implementing the line index approach for analyzing astronomical spectra is an effective way to extract spectral features for low resolution spectra, which can represent the main spectral characteristics of stars. A total of 144 340 line indices for A type stars is analyzed through calculating their intra and inter distances between pairs of stars. For intra distance, we use the definition of Mahalanobis distance to explore the degree of clustering for each class, while for outlier detection, we define a local outlier factor for each spectrum. AstroStat furnishes a set of visualization tools for illustrating the analysis results. Checking the spectra detected as outliers, we find that most of them are problematic data and only a few correspond to rare astronomical objects. We show two examples of these outliers, a spectrum with abnormal continuumand a spectrum with emission lines. Our work demonstrates that line index clustering is a good method for examining data quality and identifying rare objects.

  16. Phosphotyrosine-based-phosphoproteomics scaled-down to biopsy level for analysis of individual tumor biology and treatment selection.

    PubMed

    Labots, Mariette; van der Mijn, Johannes C; Beekhof, Robin; Piersma, Sander R; de Goeij-de Haas, Richard R; Pham, Thang V; Knol, Jaco C; Dekker, Henk; van Grieken, Nicole C T; Verheul, Henk M W; Jiménez, Connie R

    2017-06-06

    Mass spectrometry-based phosphoproteomics of cancer cell and tissue lysates provides insight in aberrantly activated signaling pathways and potential drug targets. For improved understanding of individual patient's tumor biology and to allow selection of tyrosine kinase inhibitors in individual patients, phosphoproteomics of small clinical samples should be feasible and reproducible. We aimed to scale down a pTyr-phosphopeptide enrichment protocol to biopsy-level protein input and assess reproducibility and applicability to tumor needle biopsies. To this end, phosphopeptide immunoprecipitation using anti-phosphotyrosine beads was performed using 10, 5 and 1mg protein input from lysates of colorectal cancer (CRC) cell line HCT116. Multiple needle biopsies from 7 human CRC resection specimens were analyzed at the 1mg-level. The total number of phosphopeptides captured and detected by LC-MS/MS ranged from 681 at 10mg input to 471 at 1mg HCT116 protein. ID-reproducibility ranged from 60.5% at 10mg to 43.9% at 1mg. Per 1mg-level biopsy sample, >200 phosphopeptides were identified with 57% ID-reproducibility between paired tumor biopsies. Unsupervised analysis clustered biopsies from individual patients together and revealed known and potential therapeutic targets. This study demonstrates the feasibility of label-free pTyr-phosphoproteomics at the tumor biopsy level based on reproducible analyses using 1mg of protein input. The considerable number of identified phosphopeptides at this level is attributed to an effective down-scaled immuno-affinity protocol as well as to the application of ID propagation in the data processing and analysis steps. Unsupervised cluster analysis reveals patient-specific profiles. Together, these findings pave the way for clinical trials in which pTyr-phosphoproteomics will be performed on pre- and on-treatment biopsies. Such studies will improve our understanding of individual tumor biology and may enable future pTyr-phosphoproteomics-based personalized medicine. Copyright © 2017. Published by Elsevier B.V.

  17. Unsupervised Gaussian Mixture-Model With Expectation Maximization for Detecting Glaucomatous Progression in Standard Automated Perimetry Visual Fields.

    PubMed

    Yousefi, Siamak; Balasubramanian, Madhusudhanan; Goldbaum, Michael H; Medeiros, Felipe A; Zangwill, Linda M; Weinreb, Robert N; Liebmann, Jeffrey M; Girkin, Christopher A; Bowd, Christopher

    2016-05-01

    To validate Gaussian mixture-model with expectation maximization (GEM) and variational Bayesian independent component analysis mixture-models (VIM) for detecting glaucomatous progression along visual field (VF) defect patterns (GEM-progression of patterns (POP) and VIM-POP). To compare GEM-POP and VIM-POP with other methods. GEM and VIM models separated cross-sectional abnormal VFs from 859 eyes and normal VFs from 1117 eyes into abnormal and normal clusters. Clusters were decomposed into independent axes. The confidence limit (CL) of stability was established for each axis with a set of 84 stable eyes. Sensitivity for detecting progression was assessed in a sample of 83 eyes with known progressive glaucomatous optic neuropathy (PGON). Eyes were classified as progressed if any defect pattern progressed beyond the CL of stability. Performance of GEM-POP and VIM-POP was compared to point-wise linear regression (PLR), permutation analysis of PLR (PoPLR), and linear regression (LR) of mean deviation (MD), and visual field index (VFI). Sensitivity and specificity for detecting glaucomatous VFs were 89.9% and 93.8%, respectively, for GEM and 93.0% and 97.0%, respectively, for VIM. Receiver operating characteristic (ROC) curve areas for classifying progressed eyes were 0.82 for VIM-POP, 0.86 for GEM-POP, 0.81 for PoPLR, 0.69 for LR of MD, and 0.76 for LR of VFI. GEM-POP was significantly more sensitive to PGON than PoPLR and linear regression of MD and VFI in our sample, while providing localized progression information. Detection of glaucomatous progression can be improved by assessing longitudinal changes in localized patterns of glaucomatous defect identified by unsupervised machine learning.

  18. Sequence-structure relationship study in all-α transmembrane proteins using an unsupervised learning approach.

    PubMed

    Esque, Jérémy; Urbain, Aurélie; Etchebest, Catherine; de Brevern, Alexandre G

    2015-11-01

    Transmembrane proteins (TMPs) are major drug targets, but the knowledge of their precise topology structure remains highly limited compared with globular proteins. In spite of the difficulties in obtaining their structures, an important effort has been made these last years to increase their number from an experimental and computational point of view. In view of this emerging challenge, the development of computational methods to extract knowledge from these data is crucial for the better understanding of their functions and in improving the quality of structural models. Here, we revisit an efficient unsupervised learning procedure, called Hybrid Protein Model (HPM), which is applied to the analysis of transmembrane proteins belonging to the all-α structural class. HPM method is an original classification procedure that efficiently combines sequence and structure learning. The procedure was initially applied to the analysis of globular proteins. In the present case, HPM classifies a set of overlapping protein fragments, extracted from a non-redundant databank of TMP 3D structure. After fine-tuning of the learning parameters, the optimal classification results in 65 clusters. They represent at best similar relationships between sequence and local structure properties of TMPs. Interestingly, HPM distinguishes among the resulting clusters two helical regions with distinct hydrophobic patterns. This underlines the complexity of the topology of these proteins. The HPM classification enlightens unusual relationship between amino acids in TMP fragments, which can be useful to elaborate new amino acids substitution matrices. Finally, two challenging applications are described: the first one aims at annotating protein functions (channel or not), the second one intends to assess the quality of the structures (X-ray or models) via a new scoring function deduced from the HPM classification.

  19. Cluster Analysis Identifies Distinct Pathogenetic Patterns in C3 Glomerulopathies/Immune Complex-Mediated Membranoproliferative GN.

    PubMed

    Iatropoulos, Paraskevas; Daina, Erica; Curreri, Manuela; Piras, Rossella; Valoti, Elisabetta; Mele, Caterina; Bresin, Elena; Gamba, Sara; Alberti, Marta; Breno, Matteo; Perna, Annalisa; Bettoni, Serena; Sabadini, Ettore; Murer, Luisa; Vivarelli, Marina; Noris, Marina; Remuzzi, Giuseppe

    2018-01-01

    Membranoproliferative GN (MPGN) was recently reclassified as alternative pathway complement-mediated C3 glomerulopathy (C3G) and immune complex-mediated membranoproliferative GN (IC-MPGN). However, genetic and acquired alternative pathway abnormalities are also observed in IC-MPGN. Here, we explored the presence of distinct disease entities characterized by specific pathophysiologic mechanisms. We performed unsupervised hierarchical clustering, a data-driven statistical approach, on histologic, genetic, and clinical data and data regarding serum/plasma complement parameters from 173 patients with C3G/IC-MPGN. This approach divided patients into four clusters, indicating the existence of four different pathogenetic patterns. Specifically, this analysis separated patients with fluid-phase complement activation (clusters 1-3) who had low serum C3 levels and a high prevalence of genetic and acquired alternative pathway abnormalities from patients with solid-phase complement activation (cluster 4) who had normal or mildly altered serum C3, late disease onset, and poor renal survival. In patients with fluid-phase complement activation, those in clusters 1 and 2 had massive activation of the alternative pathway, including activation of the terminal pathway, and the highest prevalence of subendothelial deposits, but those in cluster 2 had additional activation of the classic pathway and the highest prevalence of nephrotic syndrome at disease onset. Patients in cluster 3 had prevalent activation of C3 convertase and highly electron-dense intramembranous deposits. In addition, we provide a simple algorithm to assign patients with C3G/IC-MPGN to specific clusters. These distinct clusters may facilitate clarification of disease etiology, improve risk assessment for ESRD, and pave the way for personalized treatment. Copyright © 2018 by the American Society of Nephrology.

  20. Rapid detection of Listeria monocytogenes in milk using confocal micro-Raman spectroscopy and chemometric analysis.

    PubMed

    Wang, Junping; Xie, Xinfang; Feng, Jinsong; Chen, Jessica C; Du, Xin-jun; Luo, Jiangzhao; Lu, Xiaonan; Wang, Shuo

    2015-07-02

    Listeria monocytogenes is a facultatively anaerobic, Gram-positive, rod-shape foodborne bacterium causing invasive infection, listeriosis, in susceptible populations. Rapid and high-throughput detection of this pathogen in dairy products is critical as milk and other dairy products have been implicated as food vehicles in several outbreaks. Here we evaluated confocal micro-Raman spectroscopy (785 nm laser) coupled with chemometric analysis to distinguish six closely related Listeria species, including L. monocytogenes, in both liquid media and milk. Raman spectra of different Listeria species and other bacteria (i.e., Staphylococcus aureus, Salmonella enterica and Escherichia coli) were collected to create two independent databases for detection in media and milk, respectively. Unsupervised chemometric models including principal component analysis and hierarchical cluster analysis were applied to differentiate L. monocytogenes from Listeria and other bacteria. To further evaluate the performance and reliability of unsupervised chemometric analyses, supervised chemometrics were performed, including two discriminant analyses (DA) and soft independent modeling of class analogies (SIMCA). By analyzing Raman spectra via two DA-based chemometric models, average identification accuracies of 97.78% and 98.33% for L. monocytogenes in media, and 95.28% and 96.11% in milk were obtained, respectively. SIMCA analysis also resulted in satisfied average classification accuracies (over 93% in both media and milk). This Raman spectroscopic-based detection of L. monocytogenes in media and milk can be finished within a few hours and requires no extensive sample preparation. Copyright © 2015 Elsevier B.V. All rights reserved.

  1. Defining functioning levels in patients with schizophrenia: A combination of a novel clustering method and brain SPECT analysis.

    PubMed

    Catherine, Faget-Agius; Aurélie, Vincenti; Eric, Guedj; Pierre, Michel; Raphaëlle, Richieri; Marine, Alessandrini; Pascal, Auquier; Christophe, Lançon; Laurent, Boyer

    2017-12-30

    This study aims to define functioning levels of patients with schizophrenia by using a method of interpretable clustering based on a specific functioning scale, the Functional Remission Of General Schizophrenia (FROGS) scale, and to test their validity regarding clinical and neuroimaging characterization. In this observational study, patients with schizophrenia have been classified using a hierarchical top-down method called clustering using unsupervised binary trees (CUBT). Socio-demographic, clinical, and neuroimaging SPECT perfusion data were compared between the different clusters to ensure their clinical relevance. A total of 242 patients were analyzed. A four-group functioning level structure has been identified: 54 are classified as "minimal", 81 as "low", 64 as "moderate", and 43 as "high". The clustering shows satisfactory statistical properties, including reproducibility and discriminancy. The 4 clusters consistently differentiate patients. "High" functioning level patients reported significantly the lowest scores on the PANSS and the CDSS, and the highest scores on the GAF, the MARS and S-QoL 18. Functioning levels were significantly associated with cerebral perfusion of two relevant areas: the left inferior parietal cortex and the anterior cingulate. Our study provides relevant functioning levels in schizophrenia, and may enhance the use of functioning scale. Copyright © 2017 Elsevier B.V. All rights reserved.

  2. Adaptive fuzzy leader clustering of complex data sets in pattern recognition

    NASA Technical Reports Server (NTRS)

    Newton, Scott C.; Pemmaraju, Surya; Mitra, Sunanda

    1992-01-01

    A modular, unsupervised neural network architecture for clustering and classification of complex data sets is presented. The adaptive fuzzy leader clustering (AFLC) architecture is a hybrid neural-fuzzy system that learns on-line in a stable and efficient manner. The initial classification is performed in two stages: a simple competitive stage and a distance metric comparison stage. The cluster prototypes are then incrementally updated by relocating the centroid positions from fuzzy C-means system equations for the centroids and the membership values. The AFLC algorithm is applied to the Anderson Iris data and laser-luminescent fingerprint image data. It is concluded that the AFLC algorithm successfully classifies features extracted from real data, discrete or continuous.

  3. Change detection and classification in brain MR images using change vector analysis.

    PubMed

    Simões, Rita; Slump, Cornelis

    2011-01-01

    The automatic detection of longitudinal changes in brain images is valuable in the assessment of disease evolution and treatment efficacy. Most existing change detection methods that are currently used in clinical research to monitor patients suffering from neurodegenerative diseases--such as Alzheimer's--focus on large-scale brain deformations. However, such patients often have other brain impairments, such as infarcts, white matter lesions and hemorrhages, which are typically overlooked by the deformation-based methods. Other unsupervised change detection algorithms have been proposed to detect tissue intensity changes. The outcome of these methods is typically a binary change map, which identifies changed brain regions. However, understanding what types of changes these regions underwent is likely to provide equally important information about lesion evolution. In this paper, we present an unsupervised 3D change detection method based on Change Vector Analysis. We compute and automatically threshold the Generalized Likelihood Ratio map to obtain a binary change map. Subsequently, we perform histogram-based clustering to classify the change vectors. We obtain a Kappa Index of 0.82 using various types of simulated lesions. The classification error is 2%. Finally, we are able to detect and discriminate both small changes and ventricle expansions in datasets from Mild Cognitive Impairment patients.

  4. Vegetation spatial variability and its effect on vegetation indices

    NASA Technical Reports Server (NTRS)

    Ormsby, J. P.; Choudhury, B. J.; Owe, M.

    1987-01-01

    Landsat MSS data were used to simulate low resolution satellite data, such as NOAA AVHRR, to quantify the fractional vegetation cover within a pixel and relate the fractional cover to the normalized difference vegetation index (NDVI) and the simple ratio (SR). The MSS data were converted to radiances from which the NDVI and SR values for the simulated pixels were determined. Each simulated pixel was divided into clusters using an unsupervised classification program. Spatial and spectral analysis provided a means of combining clusters representing similar surface characteristics into vegetated and non-vegetated areas. Analysis showed an average error of 12.7 per cent in determining these areas. NDVI values less than 0.3 represented fractional vegetated areas of 5 per cent or less, while a value of 0.7 or higher represented fractional vegetated areas greater than 80 per cent. Regression analysis showed a strong linear relation between fractional vegetation area and the NDVI and SR values; correlation values were 0.89 and 0.95 respectively. The range of NDVI values calculated from the MSS data agrees well with field studies.

  5. Gene expression analysis of hypersensitivity to mosquito bite, chronic active EBV infection and NK/T-lymphoma/leukemia.

    PubMed

    Washio, Kana; Oka, Takashi; Abdalkader, Lamia; Muraoka, Michiko; Shimada, Akira; Oda, Megumi; Sato, Hiaki; Takata, Katsuyoshi; Kagami, Yoshitoyo; Shimizu, Norio; Kato, Seiichi; Kimura, Hiroshi; Nishizaki, Kazunori; Yoshino, Tadashi; Tsukahara, Hirokazu

    2017-11-01

    The human herpes virus, Epstein-Barr virus (EBV), is a known oncogenic virus and plays important roles in life-threatening T/NK-cell lymphoproliferative disorders (T/NK-cell LPD) such as hypersensitivity to mosquito bite (HMB), chronic active EBV infection (CAEBV), and NK/T-cell lymphoma/leukemia. During the clinical courses of HMB and CAEBV, patients frequently develop malignant lymphomas and the diseases passively progress sequentially. In the present study, gene expression of CD16 (-) CD56 (+) -, EBV (+) HMB, CAEBV, NK-lymphoma, and NK-leukemia cell lines, which were established from patients, was analyzed using oligonucleotide microarrays and compared to that of CD56 bright CD16 dim/- NK cells from healthy donors. Principal components analysis showed that CAEBV and NK-lymphoma cells were relatively closely located, indicating that they had similar expression profiles. Unsupervised hierarchal clustering analyses of microarray data and gene ontology analysis revealed specific gene clusters and identified several candidate genes responsible for disease that can be used to discriminate each category of NK-LPD and NK-cell lymphoma/leukemia.

  6. Rapid analysis of microbial systems using vibrational spectroscopy and supervised learning methods: application to the discrimination between methicillin-resistant and methicillin-susceptible Staphy

    NASA Astrophysics Data System (ADS)

    Goodacre, Royston; Rooney, Paul J.; Kell, Douglas B.

    1998-04-01

    FTIR spectra were obtained from 15 methicillin-resistant and 22 methicillin-susceptible Staphylococcus aureus strains using our DRASTIC approach. Cluster analysis showed that the major source of variation between the IR spectra was not due to their resistance or susceptibility to methicillin; indeed early studies suing pyrolysis mass spectrometry had shown that this unsupervised analysis gave information on the phage group of the bacteria. By contrast, artificial neural networks, based on a supervised learning, could be trained to recognize those aspects of the IR spectra which differentiated methicillin-resistant from methicillin- susceptible strains. These results give the first demonstration that the combination of FTIR with neural networks can provide a very rapid and accurate antibiotic susceptibility testing technique.

  7. Maximum Margin Clustering of Hyperspectral Data

    NASA Astrophysics Data System (ADS)

    Niazmardi, S.; Safari, A.; Homayouni, S.

    2013-09-01

    In recent decades, large margin methods such as Support Vector Machines (SVMs) are supposed to be the state-of-the-art of supervised learning methods for classification of hyperspectral data. However, the results of these algorithms mainly depend on the quality and quantity of available training data. To tackle down the problems associated with the training data, the researcher put effort into extending the capability of large margin algorithms for unsupervised learning. One of the recent proposed algorithms is Maximum Margin Clustering (MMC). The MMC is an unsupervised SVMs algorithm that simultaneously estimates both the labels and the hyperplane parameters. Nevertheless, the optimization of the MMC algorithm is a non-convex problem. Most of the existing MMC methods rely on the reformulating and the relaxing of the non-convex optimization problem as semi-definite programs (SDP), which are computationally very expensive and only can handle small data sets. Moreover, most of these algorithms are two-class classification, which cannot be used for classification of remotely sensed data. In this paper, a new MMC algorithm is used that solve the original non-convex problem using Alternative Optimization method. This algorithm is also extended for multi-class classification and its performance is evaluated. The results of the proposed algorithm show that the algorithm has acceptable results for hyperspectral data clustering.

  8. Phenotype in combination with genotype improves outcome prediction in acute myeloid leukemia: a report from Children’s Oncology Group protocol AAML0531

    PubMed Central

    Voigt, Andrew P.; Brodersen, Lisa Eidenschink; Alonzo, Todd A.; Gerbing, Robert B.; Menssen, Andrew J.; Wilson, Elisabeth R.; Kahwash, Samir; Raimondi, Susana C.; Hirsch, Betsy A.; Gamis, Alan S.; Meshinchi, Soheil; Wells, Denise A.; Loken, Michael R.

    2017-01-01

    Diagnostic biomarkers can be used to determine relapse risk in acute myeloid leukemia, and certain genetic aberrancies have prognostic relevance. A diagnostic immunophenotypic expression profile, which quantifies the amounts of distinct gene products, not just their presence or absence, was established in order to improve outcome prediction for patients with acute myeloid leukemia. The immunophenotypic expression profile, which defines each patient’s leukemia as a location in 15-dimensional space, was generated for 769 patients enrolled in the Children’s Oncology Group AAML0531 protocol. Unsupervised hierarchical clustering grouped patients with similar immunophenotypic expression profiles into eleven patient cohorts, demonstrating high associations among phenotype, genotype, morphology, and outcome. Of 95 patients with inv(16), 79% segregated in Cluster A. Of 109 patients with t(8;21), 92% segregated in Clusters A and B. Of 152 patients with 11q23 alterations, 78% segregated in Clusters D, E, F, G, or H. For both inv(16) and 11q23 abnormalities, differential phenotypic expression identified patient groups with different survival characteristics (P<0.05). Clinical outcome analysis revealed that Cluster B (predominantly t(8;21)) was associated with favorable outcome (P<0.001) and Clusters E, G, H, and K were associated with adverse outcomes (P<0.05). Multivariable regression analysis revealed that Clusters E, G, H, and K were independently associated with worse survival (P range <0.001 to 0.008). The Children’s Oncology Group AAML0531 trial: clinicaltrials.gov Identifier: 00372593. PMID:28883080

  9. Spike sorting using locality preserving projection with gap statistics and landmark-based spectral clustering.

    PubMed

    Nguyen, Thanh; Khosravi, Abbas; Creighton, Douglas; Nahavandi, Saeid

    2014-12-30

    Understanding neural functions requires knowledge from analysing electrophysiological data. The process of assigning spikes of a multichannel signal into clusters, called spike sorting, is one of the important problems in such analysis. There have been various automated spike sorting techniques with both advantages and disadvantages regarding accuracy and computational costs. Therefore, developing spike sorting methods that are highly accurate and computationally inexpensive is always a challenge in the biomedical engineering practice. An automatic unsupervised spike sorting method is proposed in this paper. The method uses features extracted by the locality preserving projection (LPP) algorithm. These features afterwards serve as inputs for the landmark-based spectral clustering (LSC) method. Gap statistics (GS) is employed to evaluate the number of clusters before the LSC can be performed. The proposed LPP-LSC is highly accurate and computationally inexpensive spike sorting approach. LPP spike features are very discriminative; thereby boost the performance of clustering methods. Furthermore, the LSC method exhibits its efficiency when integrated with the cluster evaluator GS. The proposed method's accuracy is approximately 13% superior to that of the benchmark combination between wavelet transformation and superparamagnetic clustering (WT-SPC). Additionally, LPP-LSC computing time is six times less than that of the WT-SPC. LPP-LSC obviously demonstrates a win-win spike sorting solution meeting both accuracy and computational cost criteria. LPP and LSC are linear algorithms that help reduce computational burden and thus their combination can be applied into real-time spike analysis. Copyright © 2014 Elsevier B.V. All rights reserved.

  10. On application of image analysis and natural language processing for music search

    NASA Astrophysics Data System (ADS)

    Gwardys, Grzegorz

    2013-10-01

    In this paper, I investigate a problem of finding most similar music tracks using, popular in Natural Language Processing, techniques like: TF-IDF and LDA. I de ned document as music track. Each music track is transformed to spectrogram, thanks that, I can use well known techniques to get words from images. I used SURF operation to detect characteristic points and novel approach for their description. The standard kmeans was used for clusterization. Clusterization is here identical with dictionary making, so after that I can transform spectrograms to text documents and perform TF-IDF and LDA. At the final, I can make a query in an obtained vector space. The research was done on 16 music tracks for training and 336 for testing, that are splitted in four categories: Hiphop, Jazz, Metal and Pop. Although used technique is completely unsupervised, results are satisfactory and encouraging to further research.

  11. Identification of sea ice types in spaceborne synthetic aperture radar data

    NASA Technical Reports Server (NTRS)

    Kwok, Ronald; Rignot, Eric; Holt, Benjamin; Onstott, R.

    1992-01-01

    This study presents an approach for identification of sea ice types in spaceborne SAR image data. The unsupervised classification approach involves cluster analysis for segmentation of the image data followed by cluster labeling based on previously defined look-up tables containing the expected backscatter signatures of different ice types measured by a land-based scatterometer. Extensive scatterometer observations and experience accumulated in field campaigns during the last 10 yr were used to construct these look-up tables. The classification approach, its expected performance, the dependence of this performance on radar system performance, and expected ice scattering characteristics are discussed. Results using both aircraft and simulated ERS-1 SAR data are presented and compared to limited field ice property measurements and coincident passive microwave imagery. The importance of an integrated postlaunch program for the validation and improvement of this approach is discussed.

  12. The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments

    PubMed Central

    2009-01-01

    Background The characterisation, or binning, of metagenome fragments is an important first step to further downstream analysis of microbial consortia. Here, we propose a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possible to obtain a meaningful phylogenetic signal for relatively short DNA sequences. The one-dimensional signal is essentially a compact representation of higher dimensional feature spaces of greater complexity and is intended to improve on the tetranucleotide frequency feature space preferred by current compositional binning methods. Results We compare the fidelity of OFDEG against tetranucleotide frequency in both an unsupervised and semi-supervised setting on simulated metagenome benchmark data. Four tests were conducted using assembler output of Arachne and phrap, and for each, performance was evaluated on contigs which are greater than or equal to 8 kbp in length and contigs which are composed of at least 10 reads. Using both G-C content in conjunction with OFDEG gave an average accuracy of 96.75% (semi-supervised) and 95.19% (unsupervised), versus 94.25% (semi-supervised) and 82.35% (unsupervised) for tetranucleotide frequency. Conclusion We have presented an observation of an alternative characteristic of DNA sequences. The proposed feature representation has proven to be more beneficial than the existing tetranucleotide frequency space to the metagenome binning problem. We do note, however, that our observation of OFDEG deserves further anlaysis and investigation. Unsupervised clustering revealed OFDEG related features performed better than standard tetranucleotide frequency in representing a relevant organism specific signal. Further improvement in binning accuracy is given by semi-supervised classification using OFDEG. The emphasis on a feature-driven, bottom-up approach to the problem of binning reveals promising avenues for future development of techniques to characterise short environmental sequences without bias toward cultivable organisms. PMID:19958473

  13. Evaluation of solar angle variation over digital processing of LANDSAT imagery. [Brazil

    NASA Technical Reports Server (NTRS)

    Parada, N. D. J. (Principal Investigator); Novo, E. M. L. M.

    1984-01-01

    The effects of the seasonal variation of illumination over digital processing of LANDSAT images are evaluated. Original images are transformed by means of digital filtering to enhance their spatial features. The resulting images are used to obtain an unsupervised classification of relief units. After defining relief classes, which are supposed to be spectrally different, topographic variables (declivity, altitude, relief range and slope length) are used to identify the true relief units existing on the ground. The samples are also clustered by means of an unsupervised classification option. The results obtained for each LANDSAT overpass are compared. Digital processing is highly affected by illumination geometry. There is no correspondence between relief units as defined by spectral features and those resulting from topographic features.

  14. Parallel Multivariate Spatio-Temporal Clustering of Large Ecological Datasets on Hybrid Supercomputers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sreepathi, Sarat; Kumar, Jitendra; Mills, Richard T.

    A proliferation of data from vast networks of remote sensing platforms (satellites, unmanned aircraft systems (UAS), airborne etc.), observational facilities (meteorological, eddy covariance etc.), state-of-the-art sensors, and simulation models offer unprecedented opportunities for scientific discovery. Unsupervised classification is a widely applied data mining approach to derive insights from such data. However, classification of very large data sets is a complex computational problem that requires efficient numerical algorithms and implementations on high performance computing (HPC) platforms. Additionally, increasing power, space, cooling and efficiency requirements has led to the deployment of hybrid supercomputing platforms with complex architectures and memory hierarchies like themore » Titan system at Oak Ridge National Laboratory. The advent of such accelerated computing architectures offers new challenges and opportunities for big data analytics in general and specifically, large scale cluster analysis in our case. Although there is an existing body of work on parallel cluster analysis, those approaches do not fully meet the needs imposed by the nature and size of our large data sets. Moreover, they had scaling limitations and were mostly limited to traditional distributed memory computing platforms. We present a parallel Multivariate Spatio-Temporal Clustering (MSTC) technique based on k-means cluster analysis that can target hybrid supercomputers like Titan. We developed a hybrid MPI, CUDA and OpenACC implementation that can utilize both CPU and GPU resources on computational nodes. We describe performance results on Titan that demonstrate the scalability and efficacy of our approach in processing large ecological data sets.« less

  15. Quantitative radiomic profiling of glioblastoma represents transcriptomic expression.

    PubMed

    Kong, Doo-Sik; Kim, Junhyung; Ryu, Gyuha; You, Hye-Jin; Sung, Joon Kyung; Han, Yong Hee; Shin, Hye-Mi; Lee, In-Hee; Kim, Sung-Tae; Park, Chul-Kee; Choi, Seung Hong; Choi, Jeong Won; Seol, Ho Jun; Lee, Jung-Il; Nam, Do-Hyun

    2018-01-19

    Quantitative imaging biomarkers have increasingly emerged in the field of research utilizing available imaging modalities. We aimed to identify good surrogate radiomic features that can represent genetic changes of tumors, thereby establishing noninvasive means for predicting treatment outcome. From May 2012 to June 2014, we retrospectively identified 65 patients with treatment-naïve glioblastoma with available clinical information from the Samsung Medical Center data registry. Preoperative MR imaging data were obtained for all 65 patients with primary glioblastoma. A total of 82 imaging features including first-order statistics, volume, and size features, were semi-automatically extracted from structural and physiologic images such as apparent diffusion coefficient and perfusion images. Using commercially available software, NordicICE, we performed quantitative imaging analysis and collected the dataset composed of radiophenotypic parameters. Unsupervised clustering methods revealed that the radiophenotypic dataset was composed of three clusters. Each cluster represented a distinct molecular classification of glioblastoma; classical type, proneural and neural types, and mesenchymal type. These clusters also reflected differential clinical outcomes. We found that extracted imaging signatures does not represent copy number variation and somatic mutation. Quantitative radiomic features provide a potential evidence to predict molecular phenotype and treatment outcome. Radiomic profiles represents transcriptomic phenotypes more well.

  16. Widespread Micropollutant Monitoring in the Hudson River Estuary Reveals Spatiotemporal Micropollutant Clusters and Their Sources.

    PubMed

    Carpenter, Corey M G; Helbling, Damian E

    2018-06-05

    The objective of this study was to identify sources of micropollutants in the Hudson River Estuary (HRE). We collected 127 grab samples at 17 sites along the HRE over 2 years and screened for up to 200 micropollutants. We quantified 168 of the micropollutants in at least one of the samples. Atrazine, gabapentin, metolachlor, and sucralose were measured in every sample. We used data-driven unsupervised methods to cluster the micropollutants on the basis of their spatiotemporal occurrence and normalized-concentration patterns. Three major clusters of micropollutants were identified: ubiquitous and mixed-use (core micropollutants), sourced from sewage treatment plant outfalls (STP micropollutants), and derived from diffuse upstream sources (diffuse micropollutants). Each of these clusters was further refined into subclusters that were linked to specific sources on the basis of relationships identified through geospatial analysis of watershed features. Evaluation of cumulative loadings of each subcluster revealed that the Mohawk River and Rondout Creek are major contributors of most core micropollutants and STP micropollutants and the upper HRE is a major contributor of diffuse micropollutants. These data provide the first comprehensive evaluation of micropollutants in the HRE and define distinct spatiotemporal micropollutant clusters that are linked to sources and conserved across surface water systems around the world.

  17. Shape component analysis: structure-preserving dimension reduction on biological shape spaces.

    PubMed

    Lee, Hao-Chih; Liao, Tao; Zhang, Yongjie Jessica; Yang, Ge

    2016-03-01

    Quantitative shape analysis is required by a wide range of biological studies across diverse scales, ranging from molecules to cells and organisms. In particular, high-throughput and systems-level studies of biological structures and functions have started to produce large volumes of complex high-dimensional shape data. Analysis and understanding of high-dimensional biological shape data require dimension-reduction techniques. We have developed a technique for non-linear dimension reduction of 2D and 3D biological shape representations on their Riemannian spaces. A key feature of this technique is that it preserves distances between different shapes in an embedded low-dimensional shape space. We demonstrate an application of this technique by combining it with non-linear mean-shift clustering on the Riemannian spaces for unsupervised clustering of shapes of cellular organelles and proteins. Source code and data for reproducing results of this article are freely available at https://github.com/ccdlcmu/shape_component_analysis_Matlab The implementation was made in MATLAB and supported on MS Windows, Linux and Mac OS. geyang@andrew.cmu.edu. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  18. A Study on Regional Frequency Analysis using Artificial Neural Network - the Sumjin River Basin

    NASA Astrophysics Data System (ADS)

    Jeong, C.; Ahn, J.; Ahn, H.; Heo, J. H.

    2017-12-01

    Regional frequency analysis means to make up for shortcomings in the at-site frequency analysis which is about a lack of sample size through the regional concept. Regional rainfall quantile depends on the identification of hydrologically homogeneous regions, hence the regional classification based on hydrological homogeneous assumption is very important. For regional clustering about rainfall, multidimensional variables and factors related geographical features and meteorological figure are considered such as mean annual precipitation, number of days with precipitation in a year and average maximum daily precipitation in a month. Self-Organizing Feature Map method which is one of the artificial neural network algorithm in the unsupervised learning techniques solves N-dimensional and nonlinear problems and be shown results simply as a data visualization technique. In this study, for the Sumjin river basin in South Korea, cluster analysis was performed based on SOM method using high-dimensional geographical features and meteorological factor as input data. then, for the results, in order to evaluate the homogeneity of regions, the L-moment based discordancy and heterogeneity measures were used. Rainfall quantiles were estimated as the index flood method which is one of regional rainfall frequency analysis. Clustering analysis using SOM method and the consequential variation in rainfall quantile were analyzed. This research was supported by a grant(2017-MPSS31-001) from Supporting Technology Development Program for Disaster Management funded by Ministry of Public Safety and Security(MPSS) of the Korean government.

  19. Performance Assessment of Kernel Density Clustering for Gene Expression Profile Data

    PubMed Central

    Zeng, Beiyan; Chen, Yiping P.; Smith, Oscar H.

    2003-01-01

    Kernel density smoothing techniques have been used in classification or supervised learning of gene expression profile (GEP) data, but their applications to clustering or unsupervised learning of those data have not been explored and assessed. Here we report a kernel density clustering method for analysing GEP data and compare its performance with the three most widely-used clustering methods: hierarchical clustering, K-means clustering, and multivariate mixture model-based clustering. Using several methods to measure agreement, between-cluster isolation, and withincluster coherence, such as the Adjusted Rand Index, the Pseudo F test, the r2 test, and the profile plot, we have assessed the effectiveness of kernel density clustering for recovering clusters, and its robustness against noise on clustering both simulated and real GEP data. Our results show that the kernel density clustering method has excellent performance in recovering clusters from simulated data and in grouping large real expression profile data sets into compact and well-isolated clusters, and that it is the most robust clustering method for analysing noisy expression profile data compared to the other three methods assessed. PMID:18629292

  20. Unsupervised learning of structure in spectroscopic cubes

    NASA Astrophysics Data System (ADS)

    Araya, M.; Mendoza, M.; Solar, M.; Mardones, D.; Bayo, A.

    2018-07-01

    We consider the problem of analyzing the structure of spectroscopic cubes using unsupervised machine learning techniques. We propose representing the target's signal as a homogeneous set of volumes through an iterative algorithm that separates the structured emission from the background while not overestimating the flux. Besides verifying some basic theoretical properties, the algorithm is designed to be tuned by domain experts, because its parameters have meaningful values in the astronomical context. Nevertheless, we propose a heuristic to automatically estimate the signal-to-noise ratio parameter of the algorithm directly from data. The resulting light-weighted set of samples (≤ 1% compared to the original data) offer several advantages. For instance, it is statistically correct and computationally inexpensive to apply well-established techniques of the pattern recognition and machine learning domains; such as clustering and dimensionality reduction algorithms. We use ALMA science verification data to validate our method, and present examples of the operations that can be performed by using the proposed representation. Even though this approach is focused on providing faster and better analysis tools for the end-user astronomer, it also opens the possibility of content-aware data discovery by applying our algorithm to big data.

  1. Copy number alterations in small intestinal neuroendocrine tumors determined by array comparative genomic hybridization.

    PubMed

    Hashemi, Jamileh; Fotouhi, Omid; Sulaiman, Luqman; Kjellman, Magnus; Höög, Anders; Zedenius, Jan; Larsson, Catharina

    2013-10-29

    Small intestinal neuroendocrine tumors (SI-NETs) are typically slow-growing tumors that have metastasized already at the time of diagnosis. The purpose of the present study was to further refine and define regions of recurrent copy number (CN) alterations (CNA) in SI-NETs. Genome-wide CNAs was determined by applying array CGH (a-CGH) on SI-NETs including 18 primary tumors and 12 metastases. Quantitative PCR analysis (qPCR) was used to confirm CNAs detected by a-CGH as well as to detect CNAs in an extended panel of SI-NETs. Unsupervised hierarchical clustering was used to detect tumor groups with similar patterns of chromosomal alterations based on recurrent regions of CN loss or gain. The log rank test was used to calculate overall survival. Mann-Whitney U test or Fisher's exact test were used to evaluate associations between tumor groups and recurrent CNAs or clinical parameters. The most frequent abnormality was loss of chromosome 18 observed in 70% of the cases. CN losses were also frequently found of chromosomes 11 (23%), 16 (20%), and 9 (20%), with regions of recurrent CN loss identified in 11q23.1-qter, 16q12.2-qter, 9pter-p13.2 and 9p13.1-11.2. Gains were most frequently detected in chromosomes 14 (43%), 20 (37%), 4 (27%), and 5 (23%) with recurrent regions of CN gain located to 14q11.2, 14q32.2-32.31, 20pter-p11.21, 20q11.1-11.21, 20q12-qter, 4 and 5. qPCR analysis confirmed most CNAs detected by a-CGH as well as revealed CNAs in an extended panel of SI-NETs. Unsupervised hierarchical clustering of recurrent regions of CNAs revealed two separate tumor groups and 5 chromosomal clusters. Loss of chromosomes 18, 16 and 11 and gain of chromosome 20 were found in both tumor groups. Tumor group II was enriched for alterations in chromosome cluster-d, including gain of chromosomes 4, 5, 7, 14 and gain of 20 in chromosome cluster-b. Gain in 20pter-p11.21 was associated with short survival. Statistically significant differences were observed between primary tumors and metastases for loss of 16q and gain of 7. Our results revealed recurrent CNAs in several candidate regions with a potential role in SI-NET development. Distinct genetic alterations and pathways are involved in tumorigenesis of SI-NETs.

  2. Cluster ensemble based on Random Forests for genetic data.

    PubMed

    Alhusain, Luluah; Hafez, Alaaeldin M

    2017-01-01

    Clustering plays a crucial role in several application domains, such as bioinformatics. In bioinformatics, clustering has been extensively used as an approach for detecting interesting patterns in genetic data. One application is population structure analysis, which aims to group individuals into subpopulations based on shared genetic variations, such as single nucleotide polymorphisms. Advances in DNA sequencing technology have facilitated the obtainment of genetic datasets with exceptional sizes. Genetic data usually contain hundreds of thousands of genetic markers genotyped for thousands of individuals, making an efficient means for handling such data desirable. Random Forests (RFs) has emerged as an efficient algorithm capable of handling high-dimensional data. RFs provides a proximity measure that can capture different levels of co-occurring relationships between variables. RFs has been widely considered a supervised learning method, although it can be converted into an unsupervised learning method. Therefore, RF-derived proximity measure combined with a clustering technique may be well suited for determining the underlying structure of unlabeled data. This paper proposes, RFcluE, a cluster ensemble approach for determining the underlying structure of genetic data based on RFs. The approach comprises a cluster ensemble framework to combine multiple runs of RF clustering. Experiments were conducted on high-dimensional, real genetic dataset to evaluate the proposed approach. The experiments included an examination of the impact of parameter changes, comparing RFcluE performance against other clustering methods, and an assessment of the relationship between the diversity and quality of the ensemble and its effect on RFcluE performance. This paper proposes, RFcluE, a cluster ensemble approach based on RF clustering to address the problem of population structure analysis and demonstrate the effectiveness of the approach. The paper also illustrates that applying a cluster ensemble approach, combining multiple RF clusterings, produces more robust and higher-quality results as a consequence of feeding the ensemble with diverse views of high-dimensional genetic data obtained through bagging and random subspace, the two key features of the RF algorithm.

  3. Spike sorting based upon machine learning algorithms (SOMA).

    PubMed

    Horton, P M; Nicol, A U; Kendrick, K M; Feng, J F

    2007-02-15

    We have developed a spike sorting method, using a combination of various machine learning algorithms, to analyse electrophysiological data and automatically determine the number of sampled neurons from an individual electrode, and discriminate their activities. We discuss extensions to a standard unsupervised learning algorithm (Kohonen), as using a simple application of this technique would only identify a known number of clusters. Our extra techniques automatically identify the number of clusters within the dataset, and their sizes, thereby reducing the chance of misclassification. We also discuss a new pre-processing technique, which transforms the data into a higher dimensional feature space revealing separable clusters. Using principal component analysis (PCA) alone may not achieve this. Our new approach appends the features acquired using PCA with features describing the geometric shapes that constitute a spike waveform. To validate our new spike sorting approach, we have applied it to multi-electrode array datasets acquired from the rat olfactory bulb, and from the sheep infero-temporal cortex, and using simulated data. The SOMA sofware is available at http://www.sussex.ac.uk/Users/pmh20/spikes.

  4. acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data

    DOE PAGES

    Lux, Markus; Kruger, Jan; Rinke, Christian; ...

    2016-12-20

    A major obstacle in single-cell sequencing is sample contamination with foreign DNA. To guarantee clean genome assemblies and to prevent the introduction of contamination into public databases, considerable quality control efforts are put into post-sequencing analysis. Contamination screening generally relies on reference-based methods such as database alignment or marker gene search, which limits the set of detectable contaminants to organisms with closely related reference species. As genomic coverage in the tree of life is highly fragmented, there is an urgent need for a reference-free methodology for contaminant identification in sequence data. We present acdc, a tool specifically developed to aidmore » the quality control process of genomic sequence data. By combining supervised and unsupervised methods, it reliably detects both known and de novo contaminants. First, 16S rRNA gene prediction and the inclusion of ultrafast exact alignment techniques allow sequence classification using existing knowledge from databases. Second, reference-free inspection is enabled by the use of state-of-the-art machine learning techniques that include fast, non-linear dimensionality reduction of oligonucleotide signatures and subsequent clustering algorithms that automatically estimate the number of clusters. The latter also enables the removal of any contaminant, yielding a clean sample. Furthermore, given the data complexity and the ill-posedness of clustering, acdc employs bootstrapping techniques to provide statistically profound confidence values. Tested on a large number of samples from diverse sequencing projects, our software is able to quickly and accurately identify contamination. Results are displayed in an interactive user interface. Acdc can be run from the web as well as a dedicated command line application, which allows easy integration into large sequencing project analysis workflows. Acdc can reliably detect contamination in single-cell genome data. In addition to database-driven detection, it complements existing tools by its unsupervised techniques, which allow for the detection of de novo contaminants. Our contribution has the potential to drastically reduce the amount of resources put into these processes, particularly in the context of limited availability of reference species. As single-cell genome data continues to grow rapidly, acdc adds to the toolkit of crucial quality assurance tools.« less

  5. acdc – Automated Contamination Detection and Confidence estimation for single-cell genome data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lux, Markus; Kruger, Jan; Rinke, Christian

    A major obstacle in single-cell sequencing is sample contamination with foreign DNA. To guarantee clean genome assemblies and to prevent the introduction of contamination into public databases, considerable quality control efforts are put into post-sequencing analysis. Contamination screening generally relies on reference-based methods such as database alignment or marker gene search, which limits the set of detectable contaminants to organisms with closely related reference species. As genomic coverage in the tree of life is highly fragmented, there is an urgent need for a reference-free methodology for contaminant identification in sequence data. We present acdc, a tool specifically developed to aidmore » the quality control process of genomic sequence data. By combining supervised and unsupervised methods, it reliably detects both known and de novo contaminants. First, 16S rRNA gene prediction and the inclusion of ultrafast exact alignment techniques allow sequence classification using existing knowledge from databases. Second, reference-free inspection is enabled by the use of state-of-the-art machine learning techniques that include fast, non-linear dimensionality reduction of oligonucleotide signatures and subsequent clustering algorithms that automatically estimate the number of clusters. The latter also enables the removal of any contaminant, yielding a clean sample. Furthermore, given the data complexity and the ill-posedness of clustering, acdc employs bootstrapping techniques to provide statistically profound confidence values. Tested on a large number of samples from diverse sequencing projects, our software is able to quickly and accurately identify contamination. Results are displayed in an interactive user interface. Acdc can be run from the web as well as a dedicated command line application, which allows easy integration into large sequencing project analysis workflows. Acdc can reliably detect contamination in single-cell genome data. In addition to database-driven detection, it complements existing tools by its unsupervised techniques, which allow for the detection of de novo contaminants. Our contribution has the potential to drastically reduce the amount of resources put into these processes, particularly in the context of limited availability of reference species. As single-cell genome data continues to grow rapidly, acdc adds to the toolkit of crucial quality assurance tools.« less

  6. Precision assessment of some supervised and unsupervised algorithms for genotype discrimination in the genus Pisum using SSR molecular data.

    PubMed

    Nasiri, Jaber; Naghavi, Mohammad Reza; Kayvanjoo, Amir Hossein; Nasiri, Mojtaba; Ebrahimi, Mansour

    2015-03-07

    For the first time, prediction accuracies of some supervised and unsupervised algorithms were evaluated in an SSR-based DNA fingerprinting study of a pea collection containing 20 cultivars and 57 wild samples. In general, according to the 10 attribute weighting models, the SSR alleles of PEAPHTAP-2 and PSBLOX13.2-1 were the two most important attributes to generate discrimination among eight different species and subspecies of genus Pisum. In addition, K-Medoids unsupervised clustering run on Chi squared dataset exhibited the best prediction accuracy (83.12%), while the lowest accuracy (25.97%) gained as K-Means model ran on FCdb database. Irrespective of some fluctuations, the overall accuracies of tree induction models were significantly high for many algorithms, and the attributes PSBLOX13.2-3 and PEAPHTAP could successfully detach Pisum fulvum accessions and cultivars from the others when two selected decision trees were taken into account. Meanwhile, the other used supervised algorithms exhibited overall reliable accuracies, even though in some rare cases, they gave us low amounts of accuracies. Our results, altogether, demonstrate promising applications of both supervised and unsupervised algorithms to provide suitable data mining tools regarding accurate fingerprinting of different species and subspecies of genus Pisum, as a fundamental priority task in breeding programs of the crop. Copyright © 2015 Elsevier Ltd. All rights reserved.

  7. A Hybrid Supervised/Unsupervised Machine Learning Approach to Solar Flare Prediction

    NASA Astrophysics Data System (ADS)

    Benvenuto, Federico; Piana, Michele; Campi, Cristina; Massone, Anna Maria

    2018-01-01

    This paper introduces a novel method for flare forecasting, combining prediction accuracy with the ability to identify the most relevant predictive variables. This result is obtained by means of a two-step approach: first, a supervised regularization method for regression, namely, LASSO is applied, where a sparsity-enhancing penalty term allows the identification of the significance with which each data feature contributes to the prediction; then, an unsupervised fuzzy clustering technique for classification, namely, Fuzzy C-Means, is applied, where the regression outcome is partitioned through the minimization of a cost function and without focusing on the optimization of a specific skill score. This approach is therefore hybrid, since it combines supervised and unsupervised learning; realizes classification in an automatic, skill-score-independent way; and provides effective prediction performances even in the case of imbalanced data sets. Its prediction power is verified against NOAA Space Weather Prediction Center data, using as a test set, data in the range between 1996 August and 2010 December and as training set, data in the range between 1988 December and 1996 June. To validate the method, we computed several skill scores typically utilized in flare prediction and compared the values provided by the hybrid approach with the ones provided by several standard (non-hybrid) machine learning methods. The results showed that the hybrid approach performs classification better than all other supervised methods and with an effectiveness comparable to the one of clustering methods; but, in addition, it provides a reliable ranking of the weights with which the data properties contribute to the forecast.

  8. Algorithms of maximum likelihood data clustering with applications

    NASA Astrophysics Data System (ADS)

    Giada, Lorenzo; Marsili, Matteo

    2002-12-01

    We address the problem of data clustering by introducing an unsupervised, parameter-free approach based on maximum likelihood principle. Starting from the observation that data sets belonging to the same cluster share a common information, we construct an expression for the likelihood of any possible cluster structure. The likelihood in turn depends only on the Pearson's coefficient of the data. We discuss clustering algorithms that provide a fast and reliable approximation to maximum likelihood configurations. Compared to standard clustering methods, our approach has the advantages that (i) it is parameter free, (ii) the number of clusters need not be fixed in advance and (iii) the interpretation of the results is transparent. In order to test our approach and compare it with standard clustering algorithms, we analyze two very different data sets: time series of financial market returns and gene expression data. We find that different maximization algorithms produce similar cluster structures whereas the outcome of standard algorithms has a much wider variability.

  9. Unsupervised classification of multivariate geostatistical data: Two algorithms

    NASA Astrophysics Data System (ADS)

    Romary, Thomas; Ors, Fabien; Rivoirard, Jacques; Deraisme, Jacques

    2015-12-01

    With the increasing development of remote sensing platforms and the evolution of sampling facilities in mining and oil industry, spatial datasets are becoming increasingly large, inform a growing number of variables and cover wider and wider areas. Therefore, it is often necessary to split the domain of study to account for radically different behaviors of the natural phenomenon over the domain and to simplify the subsequent modeling step. The definition of these areas can be seen as a problem of unsupervised classification, or clustering, where we try to divide the domain into homogeneous domains with respect to the values taken by the variables in hand. The application of classical clustering methods, designed for independent observations, does not ensure the spatial coherence of the resulting classes. Image segmentation methods, based on e.g. Markov random fields, are not adapted to irregularly sampled data. Other existing approaches, based on mixtures of Gaussian random functions estimated via the expectation-maximization algorithm, are limited to reasonable sample sizes and a small number of variables. In this work, we propose two algorithms based on adaptations of classical algorithms to multivariate geostatistical data. Both algorithms are model free and can handle large volumes of multivariate, irregularly spaced data. The first one proceeds by agglomerative hierarchical clustering. The spatial coherence is ensured by a proximity condition imposed for two clusters to merge. This proximity condition relies on a graph organizing the data in the coordinates space. The hierarchical algorithm can then be seen as a graph-partitioning algorithm. Following this interpretation, a spatial version of the spectral clustering algorithm is also proposed. The performances of both algorithms are assessed on toy examples and a mining dataset.

  10. Gastrointestinal Fibroblasts Have Specialized, Diverse Transcriptional Phenotypes: A Comprehensive Gene Expression Analysis of Human Fibroblasts

    PubMed Central

    Ishii, Genichiro; Aoyagi, Kazuhiko; Sasaki, Hiroki; Ochiai, Atsushi

    2015-01-01

    Background Fibroblasts are the principal stromal cells that exist in whole organs and play vital roles in many biological processes. Although the functional diversity of fibroblasts has been estimated, a comprehensive analysis of fibroblasts from the whole body has not been performed and their transcriptional diversity has not been sufficiently explored. The aim of this study was to elucidate the transcriptional diversity of human fibroblasts within the whole body. Methods Global gene expression analysis was performed on 63 human primary fibroblasts from 13 organs. Of these, 32 fibroblasts from gastrointestinal organs (gastrointestinal fibroblasts: GIFs) were obtained from a pair of 2 anatomical sites: the submucosal layer (submucosal fibroblasts: SMFs) and the subperitoneal layer (subperitoneal fibroblasts: SPFs). Using hierarchical clustering analysis, we elucidated identifiable subgroups of fibroblasts and analyzed the transcriptional character of each subgroup. Results In unsupervised clustering, 2 major clusters that separate GIFs and non-GIFs were observed. Organ- and anatomical site-dependent clusters within GIFs were also observed. The signature genes that discriminated GIFs from non-GIFs, SMFs from SPFs, and the fibroblasts of one organ from another organ consisted of genes associated with transcriptional regulation, signaling ligands, and extracellular matrix remodeling. Conclusions GIFs are characteristic fibroblasts with specific gene expressions from transcriptional regulation, signaling ligands, and extracellular matrix remodeling related genes. In addition, the anatomical site- and organ-dependent diversity of GIFs was also discovered. These features of GIFs contribute to their specific physiological function and homeostatic maintenance, and create a functional diversity of the gastrointestinal tract. PMID:26046848

  11. Recognizing patterns of visual field loss using unsupervised machine learning

    NASA Astrophysics Data System (ADS)

    Yousefi, Siamak; Goldbaum, Michael H.; Zangwill, Linda M.; Medeiros, Felipe A.; Bowd, Christopher

    2014-03-01

    Glaucoma is a potentially blinding optic neuropathy that results in a decrease in visual sensitivity. Visual field abnormalities (decreased visual sensitivity on psychophysical tests) are the primary means of glaucoma diagnosis. One form of visual field testing is Frequency Doubling Technology (FDT) that tests sensitivity at 52 points within the visual field. Like other psychophysical tests used in clinical practice, FDT results yield specific patterns of defect indicative of the disease. We used Gaussian Mixture Model with Expectation Maximization (GEM), (EM is used to estimate the model parameters) to automatically separate FDT data into clusters of normal and abnormal eyes. Principal component analysis (PCA) was used to decompose each cluster into different axes (patterns). FDT measurements were obtained from 1,190 eyes with normal FDT results and 786 eyes with abnormal (i.e., glaucomatous) FDT results, recruited from a university-based, longitudinal, multi-center, clinical study on glaucoma. The GEM input was the 52-point FDT threshold sensitivities for all eyes. The optimal GEM model separated the FDT fields into 3 clusters. Cluster 1 contained 94% normal fields (94% specificity) and clusters 2 and 3 combined, contained 77% abnormal fields (77% sensitivity). For clusters 1, 2 and 3 the optimal number of PCA-identified axes were 2, 2 and 5, respectively. GEM with PCA successfully separated FDT fields from healthy and glaucoma eyes and identified familiar glaucomatous patterns of loss.

  12. Sequential Organization and Room Reverberation for Speech Segregation

    DTIC Science & Technology

    2012-02-28

    we have proposed two algorithms for sequential organization, an unsupervised clustering algorithm applicable to monaural recordings and a binaural ...algorithm that integrates monaural and binaural analyses. In addition, we have conducted speech intelligibility tests that Firmly establish the...comprehensive version is currently under review for journal publication. A binaural approach in room reverberation Most existing approaches to binaural or

  13. Crater monitoring through social media observations

    NASA Astrophysics Data System (ADS)

    Gialampoukidis, I.; Vrochidis, S.; Kompatsiaris, I.

    2017-09-01

    We have collected more than one lunar image per two days from social media observations. Each one of the collected images has been clustered into two main groups of lunar images and an additional cluster is provided (noise) with pictures that have not been assigned to any cluster. The proposed lunar image clustering process provides two classes of lunar pictures, at different zoom levels; the first showing a clear view of craters grouped into one cluster and the second demonstrating a complete view of the Moon at various phases that are correlated with the crawling date. The clustering stage is unsupervised, so new topics can be detected on-the-fly. We have provided additional sources of planetary images using crowdsourcing information, which is associated with metadata such as time, text, location, links to other users and other related posts. This content has crater information that can be fused with other planetary data to enhance crater monitoring.

  14. Cluster Method Analysis of K. S. C. Image

    NASA Technical Reports Server (NTRS)

    Rodriguez, Joe, Jr.; Desai, M.

    1997-01-01

    Information obtained from satellite-based systems has moved to the forefront as a method in the identification of many land cover types. Identification of different land features through remote sensing is an effective tool for regional and global assessment of geometric characteristics. Classification data acquired from remote sensing images have a wide variety of applications. In particular, analysis of remote sensing images have special applications in the classification of various types of vegetation. Results obtained from classification studies of a particular area or region serve towards a greater understanding of what parameters (ecological, temporal, etc.) affect the region being analyzed. In this paper, we make a distinction between both types of classification approaches although, focus is given to the unsupervised classification method using 1987 Thematic Mapped (TM) images of Kennedy Space Center.

  15. Evaluation of the environmental contamination at an abandoned mining site using multivariate statistical techniques--the Rodalquilar (Southern Spain) mining district.

    PubMed

    Bagur, M G; Morales, S; López-Chicano, M

    2009-11-15

    Unsupervised and supervised pattern recognition techniques such as hierarchical cluster analysis, principal component analysis, factor analysis and linear discriminant analysis have been applied to water samples recollected in Rodalquilar mining district (Southern Spain) in order to identify different sources of environmental pollution caused by the abandoned mining industry. The effect of the mining activity on waters was monitored determining the concentration of eleven elements (Mn, Ba, Co, Cu, Zn, As, Cd, Sb, Hg, Au and Pb) by inductively coupled plasma mass spectrometry (ICP-MS). The Box-Cox transformation has been used to transform the data set in normal form in order to minimize the non-normal distribution of the geochemical data. The environmental impact is affected mainly by the mining activity developed in the zone, the acid drainage and finally by the chemical treatment used for the benefit of gold.

  16. Quantitative Evaluation of Head and Neck Cancer Treatment-Related Dysphagia in the Development of a Personalized Treatment Deintensification Paradigm.

    PubMed

    Quon, Harry; Hui, Xuan; Cheng, Zhi; Robertson, Scott; Peng, Luke; Bowers, Michael; Moore, Joseph; Choflet, Amanda; Thompson, Alex; Muse, Mariah; Kiess, Ana; Page, Brandi; Fakhry, Carole; Gourin, Christine; O'Hare, Jolyne; Graham, Peter; Szczesniak, Michal; Maclean, Julia; Cook, Ian; McNutt, Todd

    2017-12-01

    To test the hypothesis that quantifying swallow function with multiple patient-reported outcome (PRO) instruments is an important strategy to yield insights in the development of personalized deintensified therapies seeking to reduce the risk of head and neck cancer (HNC) treatment-related dysphagia (HNCTD). Irradiated HNC subjects seen in follow-up care (April 2015 to December 2015) who prospectively completed the Sydney Swallow Questionnaire (SSQ) and the MD Anderson Dysphagia Inventory (MDADI) concurrently on the web interface to our Oncospace database were evaluated. A correlation matrix quantified the relationship between the SSQ and MDADI. Machine-learning unsupervised cluster analysis using the elbow criterion and CLUSPLOT analysis to establish its validity was performed. We identified 89 subjects. The MDADI and SSQ scores were moderately but significantly correlated (correlation coefficient -0.69). K-means cluster analysis demonstrated that 3 unique statistical cohorts (elbow criterion) could be identified with CLUSPLOT analysis, confirming that 100% of variances were accounted for. Correlation coefficients between the individual items in the SSQ and the MDADI demonstrated weak to moderate negative correlation, except for SSQ17 (quality of life question). Pilot analysis demonstrates that the MDADI and SSQ are complementary. Three unique clusters of patients can be defined, suggesting that a unique dysphagia signature for HNCTD may be definable. Longitudinal studies relying on only a single PRO, such as MDADI, may be inadequate for classifying HNCTD. Copyright © 2017 Elsevier Inc. All rights reserved.

  17. An Unsupervised Anomalous Event Detection and Interactive Analysis Framework for Large-scale Satellite Data

    NASA Astrophysics Data System (ADS)

    LIU, Q.; Lv, Q.; Klucik, R.; Chen, C.; Gallaher, D. W.; Grant, G.; Shang, L.

    2016-12-01

    Due to the high volume and complexity of satellite data, computer-aided tools for fast quality assessments and scientific discovery are indispensable for scientists in the era of Big Data. In this work, we have developed a framework for automated anomalous event detection in massive satellite data. The framework consists of a clustering-based anomaly detection algorithm and a cloud-based tool for interactive analysis of detected anomalies. The algorithm is unsupervised and requires no prior knowledge of the data (e.g., expected normal pattern or known anomalies). As such, it works for diverse data sets, and performs well even in the presence of missing and noisy data. The cloud-based tool provides an intuitive mapping interface that allows users to interactively analyze anomalies using multiple features. As a whole, our framework can (1) identify outliers in a spatio-temporal context, (2) recognize and distinguish meaningful anomalous events from individual outliers, (3) rank those events based on "interestingness" (e.g., rareness or total number of outliers) defined by users, and (4) enable interactively query, exploration, and analysis of those anomalous events. In this presentation, we will demonstrate the effectiveness and efficiency of our framework in the application of detecting data quality issues and unusual natural events using two satellite datasets. The techniques and tools developed in this project are applicable for a diverse set of satellite data and will be made publicly available for scientists in early 2017.

  18. A scheme for racquet sports video analysis with the combination of audio-visual information

    NASA Astrophysics Data System (ADS)

    Xing, Liyuan; Ye, Qixiang; Zhang, Weigang; Huang, Qingming; Yu, Hua

    2005-07-01

    As a very important category in sports video, racquet sports video, e.g. table tennis, tennis and badminton, has been paid little attention in the past years. Considering the characteristics of this kind of sports video, we propose a new scheme for structure indexing and highlight generating based on the combination of audio and visual information. Firstly, a supervised classification method is employed to detect important audio symbols including impact (ball hit), audience cheers, commentator speech, etc. Meanwhile an unsupervised algorithm is proposed to group video shots into various clusters. Then, by taking advantage of temporal relationship between audio and visual signals, we can specify the scene clusters with semantic labels including rally scenes and break scenes. Thirdly, a refinement procedure is developed to reduce false rally scenes by further audio analysis. Finally, an exciting model is proposed to rank the detected rally scenes from which many exciting video clips such as game (match) points can be correctly retrieved. Experiments on two types of representative racquet sports video, table tennis video and tennis video, demonstrate encouraging results.

  19. Application of classification methods for mapping Mercury's surface composition: analysis on Rudaki's Area

    NASA Astrophysics Data System (ADS)

    Zambon, F.; De Sanctis, M. C.; Capaccioni, F.; Filacchione, G.; Carli, C.; Ammanito, E.; Friggeri, A.

    2011-10-01

    During the first two MESSENGER flybys (14th January 2008 and 6th October 2008) the Mercury Dual Imaging System (MDIS) has extended the coverage of the Mercury surface, obtained by Mariner 10 and now we have images of about 90% of the Mercury surface [1]. MDIS is equipped with a Narrow Angle Camera (NAC) and a Wide Angle Camera (WAC). The NAC uses an off-axis reflective design with a 1.5° field of view (FOV) centered at 747 nm. The WAC has a re- fractive design with a 10.5° FOV and 12-position filters that cover a 395-1040 nm spectral range [2]. The color images can be used to infer information on the surface composition and classification meth- ods are an interesting technique for multispectral image analysis which can be applied to the study of the planetary surfaces. Classification methods are based on clustering algorithms and they can be divided in two categories: unsupervised and supervised. The unsupervised classifiers do not require the analyst feedback, and the algorithm automatically organizes pixels values into classes. In the supervised method, instead, the analyst must choose the "training area" that define the pixels value of a given class [3]. Here we will describe the classification in different compositional units of the region near the Rudaki Crater on Mercury.

  20. Unsupervised segmentation of lung fields in chest radiographs using multiresolution fractal feature vector and deformable models.

    PubMed

    Lee, Wen-Li; Chang, Koyin; Hsieh, Kai-Sheng

    2016-09-01

    Segmenting lung fields in a chest radiograph is essential for automatically analyzing an image. We present an unsupervised method based on multiresolution fractal feature vector. The feature vector characterizes the lung field region effectively. A fuzzy c-means clustering algorithm is then applied to obtain a satisfactory initial contour. The final contour is obtained by deformable models. The results show the feasibility and high performance of the proposed method. Furthermore, based on the segmentation of lung fields, the cardiothoracic ratio (CTR) can be measured. The CTR is a simple index for evaluating cardiac hypertrophy. After identifying a suspicious symptom based on the estimated CTR, a physician can suggest that the patient undergoes additional extensive tests before a treatment plan is finalized.

  1. Unsupervised tattoo segmentation combining bottom-up and top-down cues

    NASA Astrophysics Data System (ADS)

    Allen, Josef D.; Zhao, Nan; Yuan, Jiangbo; Liu, Xiuwen

    2011-06-01

    Tattoo segmentation is challenging due to the complexity and large variance in tattoo structures. We have developed a segmentation algorithm for finding tattoos in an image. Our basic idea is split-merge: split each tattoo image into clusters through a bottom-up process, learn to merge the clusters containing skin and then distinguish tattoo from the other skin via top-down prior in the image itself. Tattoo segmentation with unknown number of clusters is transferred to a figureground segmentation. We have applied our segmentation algorithm on a tattoo dataset and the results have shown that our tattoo segmentation system is efficient and suitable for further tattoo classification and retrieval purpose.

  2. An introduction to mass cytometry: fundamentals and applications.

    PubMed

    Tanner, Scott D; Baranov, Vladimir I; Ornatsky, Olga I; Bandura, Dmitry R; George, Thaddeus C

    2013-05-01

    Mass cytometry addresses the analytical challenges of polychromatic flow cytometry by using metal atoms as tags rather than fluorophores and atomic mass spectrometry as the detector rather than photon optics. The many available enriched stable isotopes of the transition elements can provide up to 100 distinguishable reporting tags, which can be measured simultaneously because of the essential independence of detection provided by the mass spectrometer. We discuss the adaptation of traditional inductively coupled plasma mass spectrometry to cytometry applications. We focus on the generation of cytometry-compatible data and on approaches to unsupervised multivariate clustering analysis. Finally, we provide a high-level review of some recent benchmark reports that highlight the potential for massively multi-parameter mass cytometry.

  3. Characterizing the spatial structure of endangered species habitat using geostatistical analysis of IKONOS imagery

    USGS Publications Warehouse

    Wallace, C.S.A.; Marsh, S.E.

    2005-01-01

    Our study used geostatistics to extract measures that characterize the spatial structure of vegetated landscapes from satellite imagery for mapping endangered Sonoran pronghorn habitat. Fine spatial resolution IKONOS data provided information at the scale of individual trees or shrubs that permitted analysis of vegetation structure and pattern. We derived images of landscape structure by calculating local estimates of the nugget, sill, and range variogram parameters within 25 ?? 25-m image windows. These variogram parameters, which describe the spatial autocorrelation of the 1-m image pixels, are shown in previous studies to discriminate between different species-specific vegetation associations. We constructed two independent models of pronghorn landscape preference by coupling the derived measures with Sonoran pronghorn sighting data: a distribution-based model and a cluster-based model. The distribution-based model used the descriptive statistics for variogram measures at pronghorn sightings, whereas the cluster-based model used the distribution of pronghorn sightings within clusters of an unsupervised classification of derived images. Both models define similar landscapes, and validation results confirm they effectively predict the locations of an independent set of pronghorn sightings. Such information, although not a substitute for field-based knowledge of the landscape and associated ecological processes, can provide valuable reconnaissance information to guide natural resource management efforts. ?? 2005 Taylor & Francis Group Ltd.

  4. Individualized Functional Parcellation of the Human Amygdala Using a Semi-supervised Clustering Method: A 7T Resting State fMRI Study.

    PubMed

    Zhang, Xianchang; Cheng, Hewei; Zuo, Zhentao; Zhou, Ke; Cong, Fei; Wang, Bo; Zhuo, Yan; Chen, Lin; Xue, Rong; Fan, Yong

    2018-01-01

    The amygdala plays an important role in emotional functions and its dysfunction is considered to be associated with multiple psychiatric disorders in humans. Cytoarchitectonic mapping has demonstrated that the human amygdala complex comprises several subregions. However, it's difficult to delineate boundaries of these subregions in vivo even if using state of the art high resolution structural MRI. Previous attempts to parcellate this small structure using unsupervised clustering methods based on resting state fMRI data suffered from the low spatial resolution of typical fMRI data, and it remains challenging for the unsupervised methods to define subregions of the amygdala in vivo . In this study, we developed a novel brain parcellation method to segment the human amygdala into spatially contiguous subregions based on 7T high resolution fMRI data. The parcellation was implemented using a semi-supervised spectral clustering (SSC) algorithm at an individual subject level. Under guidance of prior information derived from the Julich cytoarchitectonic atlas, our method clustered voxels of the amygdala into subregions according to similarity measures of their functional signals. As a result, three distinct amygdala subregions can be obtained in each hemisphere for every individual subject. Compared with the cytoarchitectonic atlas, our method achieved better performance in terms of subregional functional homogeneity. Validation experiments have also demonstrated that the amygdala subregions obtained by our method have distinctive, lateralized functional connectivity (FC) patterns. Our study has demonstrated that the semi-supervised brain parcellation method is a powerful tool for exploring amygdala subregional functions.

  5. Assessment of self-organizing maps to analyze sole-carbon source utilization profiles.

    PubMed

    Leflaive, Joséphine; Céréghino, Régis; Danger, Michaël; Lacroix, Gérard; Ten-Hage, Loïc

    2005-07-01

    The use of community-level physiological profiles obtained with Biolog microplates is widely employed to consider the functional diversity of bacterial communities. Biolog produces a great amount of data which analysis has been the subject of many studies. In most cases, after some transformations, these data were investigated with classical multivariate analyses. Here we provided an alternative to this method, that is the use of an artificial intelligence technique, the Self-Organizing Maps (SOM, unsupervised neural network). We used data from a microcosm study of algae-associated bacterial communities placed in various nutritive conditions. Analyses were carried out on the net absorbances at two incubation times for each substrates and on the chemical guild categorization of the total bacterial activity. Compared to Principal Components Analysis and cluster analysis, SOM appeared as a valuable tool for community classification, and to establish clear relationships between clusters of bacterial communities and sole-carbon sources utilization. Specifically, SOM offered a clear bidimensional projection of a relatively large volume of data and were easier to interpret than plots commonly obtained with multivariate analyses. They would be recommended to pattern the temporal evolution of communities' functional diversity.

  6. Principal component analysis-based unsupervised feature extraction applied to in silico drug discovery for posttraumatic stress disorder-mediated heart disease.

    PubMed

    Taguchi, Y-h; Iwadate, Mitsuo; Umeyama, Hideaki

    2015-04-30

    Feature extraction (FE) is difficult, particularly if there are more features than samples, as small sample numbers often result in biased outcomes or overfitting. Furthermore, multiple sample classes often complicate FE because evaluating performance, which is usual in supervised FE, is generally harder than the two-class problem. Developing sample classification independent unsupervised methods would solve many of these problems. Two principal component analysis (PCA)-based FE, specifically, variational Bayes PCA (VBPCA) was extended to perform unsupervised FE, and together with conventional PCA (CPCA)-based unsupervised FE, were tested as sample classification independent unsupervised FE methods. VBPCA- and CPCA-based unsupervised FE both performed well when applied to simulated data, and a posttraumatic stress disorder (PTSD)-mediated heart disease data set that had multiple categorical class observations in mRNA/microRNA expression of stressed mouse heart. A critical set of PTSD miRNAs/mRNAs were identified that show aberrant expression between treatment and control samples, and significant, negative correlation with one another. Moreover, greater stability and biological feasibility than conventional supervised FE was also demonstrated. Based on the results obtained, in silico drug discovery was performed as translational validation of the methods. Our two proposed unsupervised FE methods (CPCA- and VBPCA-based) worked well on simulated data, and outperformed two conventional supervised FE methods on a real data set. Thus, these two methods have suggested equivalence for FE on categorical multiclass data sets, with potential translational utility for in silico drug discovery.

  7. Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering

    DTIC Science & Technology

    2005-08-04

    describe a four-band magnetic resonance image (MRI) consisting of 23,712 pixels of a brain with a tumor 2. Because of the size of the dataset, it is not...the Royal Statistical Society, Series B 56, 363–375. Figueiredo, M. A. T. and A. K. Jain (2002). Unsupervised learning of finite mixture models. IEEE...20 5.4 Brain MRI

  8. Automatic microseismic event picking via unsupervised machine learning

    NASA Astrophysics Data System (ADS)

    Chen, Yangkang

    2018-01-01

    Effective and efficient arrival picking plays an important role in microseismic and earthquake data processing and imaging. Widely used short-term-average long-term-average ratio (STA/LTA) based arrival picking algorithms suffer from the sensitivity to moderate-to-strong random ambient noise. To make the state-of-the-art arrival picking approaches effective, microseismic data need to be first pre-processed, for example, removing sufficient amount of noise, and second analysed by arrival pickers. To conquer the noise issue in arrival picking for weak microseismic or earthquake event, I leverage the machine learning techniques to help recognizing seismic waveforms in microseismic or earthquake data. Because of the dependency of supervised machine learning algorithm on large volume of well-designed training data, I utilize an unsupervised machine learning algorithm to help cluster the time samples into two groups, that is, waveform points and non-waveform points. The fuzzy clustering algorithm has been demonstrated to be effective for such purpose. A group of synthetic, real microseismic and earthquake data sets with different levels of complexity show that the proposed method is much more robust than the state-of-the-art STA/LTA method in picking microseismic events, even in the case of moderately strong background noise.

  9. Geological applications of machine learning on hyperspectral remote sensing data

    NASA Astrophysics Data System (ADS)

    Tse, C. H.; Li, Yi-liang; Lam, Edmund Y.

    2015-02-01

    The CRISM imaging spectrometer orbiting Mars has been producing a vast amount of data in the visible to infrared wavelengths in the form of hyperspectral data cubes. These data, compared with those obtained from previous remote sensing techniques, yield an unprecedented level of detailed spectral resolution in additional to an ever increasing level of spatial information. A major challenge brought about by the data is the burden of processing and interpreting these datasets and extract the relevant information from it. This research aims at approaching the challenge by exploring machine learning methods especially unsupervised learning to achieve cluster density estimation and classification, and ultimately devising an efficient means leading to identification of minerals. A set of software tools have been constructed by Python to access and experiment with CRISM hyperspectral cubes selected from two specific Mars locations. A machine learning pipeline is proposed and unsupervised learning methods were implemented onto pre-processed datasets. The resulting data clusters are compared with the published ASTER spectral library and browse data products from the Planetary Data System (PDS). The result demonstrated that this approach is capable of processing the huge amount of hyperspectral data and potentially providing guidance to scientists for more detailed studies.

  10. Data-driven cluster reinforcement and visualization in sparsely-matched self-organizing maps.

    PubMed

    Manukyan, Narine; Eppstein, Margaret J; Rizzo, Donna M

    2012-05-01

    A self-organizing map (SOM) is a self-organized projection of high-dimensional data onto a typically 2-dimensional (2-D) feature map, wherein vector similarity is implicitly translated into topological closeness in the 2-D projection. However, when there are more neurons than input patterns, it can be challenging to interpret the results, due to diffuse cluster boundaries and limitations of current methods for displaying interneuron distances. In this brief, we introduce a new cluster reinforcement (CR) phase for sparsely-matched SOMs. The CR phase amplifies within-cluster similarity in an unsupervised, data-driven manner. Discontinuities in the resulting map correspond to between-cluster distances and are stored in a boundary (B) matrix. We describe a new hierarchical visualization of cluster boundaries displayed directly on feature maps, which requires no further clustering beyond what was implicitly accomplished during self-organization in SOM training. We use a synthetic benchmark problem and previously published microbial community profile data to demonstrate the benefits of the proposed methods.

  11. Image fusion using sparse overcomplete feature dictionaries

    DOEpatents

    Brumby, Steven P.; Bettencourt, Luis; Kenyon, Garrett T.; Chartrand, Rick; Wohlberg, Brendt

    2015-10-06

    Approaches for deciding what individuals in a population of visual system "neurons" are looking for using sparse overcomplete feature dictionaries are provided. A sparse overcomplete feature dictionary may be learned for an image dataset and a local sparse representation of the image dataset may be built using the learned feature dictionary. A local maximum pooling operation may be applied on the local sparse representation to produce a translation-tolerant representation of the image dataset. An object may then be classified and/or clustered within the translation-tolerant representation of the image dataset using a supervised classification algorithm and/or an unsupervised clustering algorithm.

  12. Wildlife management by habitat units: A preliminary plan of action

    NASA Technical Reports Server (NTRS)

    Frentress, C. D.; Frye, R. G.

    1975-01-01

    Procedures for yielding vegetation type maps were developed using LANDSAT data and a computer assisted classification analysis (LARSYS) to assist in managing populations of wildlife species by defined area units. Ground cover in Travis County, Texas was classified on two occasions using a modified version of the unsupervised approach to classification. The first classification produced a total of 17 classes. Examination revealed that further grouping was justified. A second analysis produced 10 classes which were displayed on printouts which were later color-coded. The final classification was 82 percent accurate. While the classification map appeared to satisfactorily depict the existing vegetation, two classes were determined to contain significant error. The major sources of error could have been eliminated by stratifying cluster sites more closely among previously mapped soil associations that are identified with particular plant associations and by precisely defining class nomenclature using established criteria early in the analysis.

  13. Anomaly Detection of Electromyographic Signals.

    PubMed

    Ijaz, Ahsan; Choi, Jongeun

    2018-04-01

    In this paper, we provide a robust framework to detect anomalous electromyographic (EMG) signals and identify contamination types. As a first step for feature selection, optimally selected Lawton wavelets transform is applied. Robust principal component analysis (rPCA) is then performed on these wavelet coefficients to obtain features in a lower dimension. The rPCA based features are used for constructing a self-organizing map (SOM). Finally, hierarchical clustering is applied on the SOM that separates anomalous signals residing in the smaller clusters and breaks them into logical units for contamination identification. The proposed methodology is tested using synthetic and real world EMG signals. The synthetic EMG signals are generated using a heteroscedastic process mimicking desired experimental setups. A sub-part of these synthetic signals is introduced with anomalies. These results are followed with real EMG signals introduced with synthetic anomalies. Finally, a heterogeneous real world data set is used with known quality issues under an unsupervised setting. The framework provides recall of 90% (± 3.3) and precision of 99%(±0.4).

  14. Partially supervised speaker clustering.

    PubMed

    Tang, Hao; Chu, Stephen Mingyu; Hasegawa-Johnson, Mark; Huang, Thomas S

    2012-05-01

    Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm—linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical model-based distance metrics, 2) our advocated use of the cosine distance metric yields consistent increases in the speaker clustering performance as compared to the commonly used euclidean distance metric, 3) our partially supervised speaker clustering concept and strategies significantly improve the speaker clustering performance over the baselines, and 4) our proposed LSDA algorithm further leads to state-of-the-art speaker clustering performance.

  15. Classification of multispectral or hyperspectral satellite imagery using clustering of sparse approximations on sparse representations in learned dictionaries obtained using efficient convolutional sparse coding

    DOEpatents

    Moody, Daniela; Wohlberg, Brendt

    2018-01-02

    An approach for land cover classification, seasonal and yearly change detection and monitoring, and identification of changes in man-made features may use a clustering of sparse approximations (CoSA) on sparse representations in learned dictionaries. The learned dictionaries may be derived using efficient convolutional sparse coding to build multispectral or hyperspectral, multiresolution dictionaries that are adapted to regional satellite image data. Sparse image representations of images over the learned dictionaries may be used to perform unsupervised k-means clustering into land cover categories. The clustering process behaves as a classifier in detecting real variability. This approach may combine spectral and spatial textural characteristics to detect geologic, vegetative, hydrologic, and man-made features, as well as changes in these features over time.

  16. Diagnostic index of three-dimensional osteoarthritic changes in temporomandibular joint condylar morphology

    PubMed Central

    Gomes, Liliane R.; Gomes, Marcelo; Jung, Bryan; Paniagua, Beatriz; Ruellas, Antonio C.; Gonçalves, João Roberto; Styner, Martin A.; Wolford, Larry; Cevidanes, Lucia

    2015-01-01

    Abstract. This study aimed to investigate imaging statistical approaches for classifying three-dimensional (3-D) osteoarthritic morphological variations among 169 temporomandibular joint (TMJ) condyles. Cone-beam computed tomography scans were acquired from 69 subjects with long-term TMJ osteoarthritis (OA), 15 subjects at initial diagnosis of OA, and 7 healthy controls. Three-dimensional surface models of the condyles were constructed and SPHARM-PDM established correspondent points on each model. Multivariate analysis of covariance and direction-projection-permutation (DiProPerm) were used for testing statistical significance of the differences between the groups determined by clinical and radiographic diagnoses. Unsupervised classification using hierarchical agglomerative clustering was then conducted. Compared with healthy controls, OA average condyle was significantly smaller in all dimensions except its anterior surface. Significant flattening of the lateral pole was noticed at initial diagnosis. We observed areas of 3.88-mm bone resorption at the superior surface and 3.10-mm bone apposition at the anterior aspect of the long-term OA average model. DiProPerm supported a significant difference between the healthy control and OA group (p-value=0.001). Clinically meaningful unsupervised classification of TMJ condylar morphology determined a preliminary diagnostic index of 3-D osteoarthritic changes, which may be the first step towards a more targeted diagnosis of this condition. PMID:26158119

  17. Metals and organic compounds in the biosynthesis of cannabinoids: a chemometric approach to the analysis of Cannabis sativa samples.

    PubMed

    Radosavljevic-Stevanovic, Natasa; Markovic, Jelena; Agatonovic-Kustrin, Snezana; Razic, Slavica

    2014-01-01

    Illicit production and trade of Cannabis sativa affect many societies. This drug is the most popular and easy to produce. Important information for the authorities is the production locality and the indicators of a particular production. This work is an attempt to recognise correlations between the metal content in the different parts of C. sativa L., in soils where plants were cultivated and the cannabinoids content, as a potential indicator. The organic fraction of the leaves of Cannabis plants was investigated by GC-FID analysis. In addition, the determination of Cu, Fe, Cr, Mn, Zn, Ca and Mg was realised by spectroscopic techniques (FAAS and GFAAS). In this study, numerous correlations between metal content in plants and soil, already confirmed in previous publications, were analysed applying chemometric unsupervised methods, that is, principal component analysis, factor analysis and cluster analysis, in order to highlight their role in the biosynthesis of cannabinoids.

  18. SEURAT: visual analytics for the integrated analysis of microarray data.

    PubMed

    Gribov, Alexander; Sill, Martin; Lück, Sonja; Rücker, Frank; Döhner, Konstanze; Bullinger, Lars; Benner, Axel; Unwin, Antony

    2010-06-03

    In translational cancer research, gene expression data is collected together with clinical data and genomic data arising from other chip based high throughput technologies. Software tools for the joint analysis of such high dimensional data sets together with clinical data are required. We have developed an open source software tool which provides interactive visualization capability for the integrated analysis of high-dimensional gene expression data together with associated clinical data, array CGH data and SNP array data. The different data types are organized by a comprehensive data manager. Interactive tools are provided for all graphics: heatmaps, dendrograms, barcharts, histograms, eventcharts and a chromosome browser, which displays genetic variations along the genome. All graphics are dynamic and fully linked so that any object selected in a graphic will be highlighted in all other graphics. For exploratory data analysis the software provides unsupervised data analytics like clustering, seriation algorithms and biclustering algorithms. The SEURAT software meets the growing needs of researchers to perform joint analysis of gene expression, genomical and clinical data.

  19. Analysis of neoplastic lesions in magnetic resonance imaging using self-organizing maps.

    PubMed

    Mei, Paulo Afonso; de Carvalho Carneiro, Cleyton; Fraser, Stephen J; Min, Li Li; Reis, Fabiano

    2015-12-15

    To provide an improved method for the identification and analysis of brain tumors in MRI scans using a semi-automated computational approach, that has the potential to provide a more objective, precise and quantitatively rigorous analysis, compared to human visual analysis. Self-Organizing Maps (SOM) is an unsupervised, exploratory data analysis tool, which can automatically domain an image into selfsimilar regions or clusters, based on measures of similarity. It can be used to perform image-domain of brain tissue on MR images, without prior knowledge. We used SOM to analyze T1, T2 and FLAIR acquisitions from two MRI machines in our service from 14 patients with brain tumors confirmed by biopsies--three lymphomas, six glioblastomas, one meningioma, one ganglioglioma, two oligoastrocytomas and one astrocytoma. The SOM software was used to analyze the data from the three image acquisitions from each patient and generated a self-organized map for each containing 25 clusters. Damaged tissue was separated from the normal tissue using the SOM technique. Furthermore, in some cases it allowed to separate different areas from within the tumor--like edema/peritumoral infiltration and necrosis. In lesions with less precise boundaries in FLAIR, the estimated damaged tissue area in the resulting map appears bigger. Our results showed that SOM has the potential to be a powerful MR imaging analysis technique for the assessment of brain tumors. Copyright © 2015. Published by Elsevier B.V.

  20. Detection of sunn pest-damaged wheat samples using visible/near-infrared spectroscopy based on pattern recognition.

    PubMed

    Basati, Zahra; Jamshidi, Bahareh; Rasekh, Mansour; Abbaspour-Gilandeh, Yousef

    2018-05-30

    The presence of sunn pest-damaged grains in wheat mass reduces the quality of flour and bread produced from it. Therefore, it is essential to assess the quality of the samples in collecting and storage centers of wheat and flour mills. In this research, the capability of visible/near-infrared (Vis/NIR) spectroscopy combined with pattern recognition methods was investigated for discrimination of wheat samples with different percentages of sunn pest-damaged. To this end, various samples belonging to five classes (healthy and 5%, 10%, 15% and 20% unhealthy) were analyzed using Vis/NIR spectroscopy (wavelength range of 350-1000 nm) based on both supervised and unsupervised pattern recognition methods. Principal component analysis (PCA) and hierarchical cluster analysis (HCA) as the unsupervised techniques and soft independent modeling of class analogies (SIMCA) and partial least squares-discriminant analysis (PLS-DA) as supervised methods were used. The results showed that Vis/NIR spectra of healthy samples were correctly clustered using both PCA and HCA. Due to the high overlapping between the four unhealthy classes (5%, 10%, 15% and 20%), it was not possible to discriminate all the unhealthy samples in individual classes. However, when considering only the two main categories of healthy and unhealthy, an acceptable degree of separation between the classes can be obtained after classification with supervised pattern recognition methods of SIMCA and PLS-DA. SIMCA based on PCA modeling correctly classified samples in two classes of healthy and unhealthy with classification accuracy of 100%. Moreover, the power of the wavelengths of 839 nm, 918 nm and 995 nm were more than other wavelengths to discriminate two classes of healthy and unhealthy. It was also concluded that PLS-DA provides excellent classification results of healthy and unhealthy samples (R 2  = 0.973 and RMSECV = 0.057). Therefore, Vis/NIR spectroscopy based on pattern recognition techniques can be useful for rapid distinguishing the healthy wheat samples from those damaged by sunn pest in the maintenance and processing centers. Copyright © 2018 Elsevier B.V. All rights reserved.

  1. Predicting protein complexes from weighted protein-protein interaction graphs with a novel unsupervised methodology: Evolutionary enhanced Markov clustering.

    PubMed

    Theofilatos, Konstantinos; Pavlopoulou, Niki; Papasavvas, Christoforos; Likothanassis, Spiros; Dimitrakopoulos, Christos; Georgopoulos, Efstratios; Moschopoulos, Charalampos; Mavroudi, Seferina

    2015-03-01

    Proteins are considered to be the most important individual components of biological systems and they combine to form physical protein complexes which are responsible for certain molecular functions. Despite the large availability of protein-protein interaction (PPI) information, not much information is available about protein complexes. Experimental methods are limited in terms of time, efficiency, cost and performance constraints. Existing computational methods have provided encouraging preliminary results, but they phase certain disadvantages as they require parameter tuning, some of them cannot handle weighted PPI data and others do not allow a protein to participate in more than one protein complex. In the present paper, we propose a new fully unsupervised methodology for predicting protein complexes from weighted PPI graphs. The proposed methodology is called evolutionary enhanced Markov clustering (EE-MC) and it is a hybrid combination of an adaptive evolutionary algorithm and a state-of-the-art clustering algorithm named enhanced Markov clustering. EE-MC was compared with state-of-the-art methodologies when applied to datasets from the human and the yeast Saccharomyces cerevisiae organisms. Using public available datasets, EE-MC outperformed existing methodologies (in some datasets the separation metric was increased by 10-20%). Moreover, when applied to new human datasets its performance was encouraging in the prediction of protein complexes which consist of proteins with high functional similarity. In specific, 5737 protein complexes were predicted and 72.58% of them are enriched for at least one gene ontology (GO) function term. EE-MC is by design able to overcome intrinsic limitations of existing methodologies such as their inability to handle weighted PPI networks, their constraint to assign every protein in exactly one cluster and the difficulties they face concerning the parameter tuning. This fact was experimentally validated and moreover, new potentially true human protein complexes were suggested as candidates for further validation using experimental techniques. Copyright © 2015 Elsevier B.V. All rights reserved.

  2. VizieR Online Data Catalog: Redshift reliability flags (VVDS data) (Jamal+, 2018)

    NASA Astrophysics Data System (ADS)

    Jamal, S.; Le Brun, V.; Le Fevre, O.; Vibert, D.; Schmitt, A.; Surace, C.; Copin, Y.; Garilli, B.; Moresco, M.; Pozzetti, L.

    2017-09-01

    The VIMOS VLT Deep Survey (Le Fevre et al. 2013A&A...559A..14L) is a combination of 3 i-band magnitude limited surveys: Wide (17.5<=iAB<=22.5; 8.6deg2), Deep (17.5<=iAB<=24; 0.6deg2) and Ultra-Deep (23<=iAB<=24.75; 512arcmin2), that produced a total of 35526 spectroscopic galaxy redshifts between 0 and 6.7 (22434 in Wide, 12051 in Deep and 1041 in UDeep). We supplement spectra of the VIMOS VLT Deep Survey (VVDS) with newly-defined redshift reliability flags obtained from clustering (unsupervised classification in Machine Learning) a set of descriptors from individual zPDFs. In this paper, we exploit a set of 24519 spectra from the VVDS database. After computing zPDFs for each individual spectrum, a set of (8) descriptors of the zPDF are extracted to build a feature matrix X (dimension = 24519 rows, 8 columns). Then, we use a clustering (unsupervised algorithms in Machine Learning) algorithm to partition the feature space into distinct clusters (5 clusters: C1,C2,C3,C4,C5), each depicting a different level of confidence to associate with the measured redshift zMAP (Maximum-A-Posteriori estimate that corresponds to the maximum of the redshift PDF). The clustering results (C1,C2,C3,C4,C5) reported in the table are those used in the paper (Jamal et al, 2017) to present the new methodology of automating the zspec reliability assessment. In particular, we would like to point out that they were obtained from first tests conducted on the VVDS spectroscopic data (end of 2016). Therefore, the table does not depict immutable results (on-going improvements). Future updates of the VVDS redshift reliability flags can be expected. (1 data file).

  3. Report: Unsupervised identification of malaria parasites using computer vision.

    PubMed

    Khan, Najeed Ahmed; Pervaz, Hassan; Latif, Arsalan; Musharaff, Ayesha

    2017-01-01

    Malaria in human is a serious and fatal tropical disease. This disease results from Anopheles mosquitoes that are infected by Plasmodium species. The clinical diagnosis of malaria based on the history, symptoms and clinical findings must always be confirmed by laboratory diagnosis. Laboratory diagnosis of malaria involves identification of malaria parasite or its antigen / products in the blood of the patient. Manual diagnosis of malaria parasite by the pathologists has proven to become cumbersome. Therefore, there is a need of automatic, efficient and accurate identification of malaria parasite. In this paper, we proposed a computer vision based approach to identify the malaria parasite from light microscopy images. This research deals with the challenges involved in the automatic detection of malaria parasite tissues. Our proposed method is based on the pixel-based approach. We used K-means clustering (unsupervised approach) for the segmentation to identify malaria parasite tissues.

  4. Generalized Wishart Mixtures for Unsupervised Classification of PolSAR Data

    NASA Astrophysics Data System (ADS)

    Li, Lan; Chen, Erxue; Li, Zengyuan

    2013-01-01

    This paper presents an unsupervised clustering algorithm based upon the expectation maximization (EM) algorithm for finite mixture modelling, using the complex wishart probability density function (PDF) for the probabilities. The mixture model enables to consider heterogeneous thematic classes which could not be better fitted by the unimodal wishart distribution. In order to make it fast and robust to calculate, we use the recently proposed generalized gamma distribution (GΓD) for the single polarization intensity data to make the initial partition. Then we use the wishart probability density function for the corresponding sample covariance matrix to calculate the posterior class probabilities for each pixel. The posterior class probabilities are used for the prior probability estimates of each class and weights for all class parameter updates. The proposed method is evaluated and compared with the wishart H-Alpha-A classification. Preliminary results show that the proposed method has better performance.

  5. Characterizing Interference in Radio Astronomy Observations through Active and Unsupervised Learning

    NASA Technical Reports Server (NTRS)

    Doran, G.

    2013-01-01

    In the process of observing signals from astronomical sources, radio astronomers must mitigate the effects of manmade radio sources such as cell phones, satellites, aircraft, and observatory equipment. Radio frequency interference (RFI) often occurs as short bursts (< 1 ms) across a broad range of frequencies, and can be confused with signals from sources of interest such as pulsars. With ever-increasing volumes of data being produced by observatories, automated strategies are required to detect, classify, and characterize these short "transient" RFI events. We investigate an active learning approach in which an astronomer labels events that are most confusing to a classifier, minimizing the human effort required for classification. We also explore the use of unsupervised clustering techniques, which automatically group events into classes without user input. We apply these techniques to data from the Parkes Multibeam Pulsar Survey to characterize several million detected RFI events from over a thousand hours of observation.

  6. Unsupervised fuzzy segmentation of 3D magnetic resonance brain images

    NASA Astrophysics Data System (ADS)

    Velthuizen, Robert P.; Hall, Lawrence O.; Clarke, Laurence P.; Bensaid, Amine M.; Arrington, J. A.; Silbiger, Martin L.

    1993-07-01

    Unsupervised fuzzy methods are proposed for segmentation of 3D Magnetic Resonance images of the brain. Fuzzy c-means (FCM) has shown promising results for segmentation of single slices. FCM has been investigated for volume segmentations, both by combining results of single slices and by segmenting the full volume. Different strategies and initializations have been tried. In particular, two approaches have been used: (1) a method by which, iteratively, the furthest sample is split off to form a new cluster center, and (2) the traditional FCM in which the membership grade matrix is initialized in some way. Results have been compared with volume segmentations by k-means and with two supervised methods, k-nearest neighbors and region growing. Results of individual segmentations are presented as well as comparisons on the application of the different methods to a number of tumor patient data sets.

  7. Signature extension: An approach to operational multispectral surveys

    NASA Technical Reports Server (NTRS)

    Nalepka, R. F.; Morgenstern, J. P.

    1973-01-01

    Two data processing techniques were suggested as applicable to the large area survey problem. One approach was to use unsupervised classification (clustering) techniques. Investigation of this method showed that since the method did nothing to reduce the signal variability, the use of this method would be very time consuming and possibly inaccurate as well. The conclusion is that unsupervised classification techniques of themselves are not a solution to the large area survey problem. The other method investigated was the use of signature extension techniques. Such techniques function by normalizing the data to some reference condition. Thus signatures from an isolated area could be used to process large quantities of data. In this manner, ground information requirements and computer training are minimized. Several signature extension techniques were tested. The best of these allowed signatures to be extended between data sets collected four days and 80 miles apart with an average accuracy of better than 90%.

  8. Multilayer Extreme Learning Machine With Subnetwork Nodes for Representation Learning.

    PubMed

    Yang, Yimin; Wu, Q M Jonathan

    2016-11-01

    The extreme learning machine (ELM), which was originally proposed for "generalized" single-hidden layer feedforward neural networks, provides efficient unified learning solutions for the applications of clustering, regression, and classification. It presents competitive accuracy with superb efficiency in many applications. However, ELM with subnetwork nodes architecture has not attracted much research attentions. Recently, many methods have been proposed for supervised/unsupervised dimension reduction or representation learning, but these methods normally only work for one type of problem. This paper studies the general architecture of multilayer ELM (ML-ELM) with subnetwork nodes, showing that: 1) the proposed method provides a representation learning platform with unsupervised/supervised and compressed/sparse representation learning and 2) experimental results on ten image datasets and 16 classification datasets show that, compared to other conventional feature learning methods, the proposed ML-ELM with subnetwork nodes performs competitively or much better than other feature learning methods.

  9. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Churchill, R. Michael

    Apache Spark is explored as a tool for analyzing large data sets from the magnetic fusion simulation code XGCI. Implementation details of Apache Spark on the NERSC Edison supercomputer are discussed, including binary file reading, and parameter setup. Here, an unsupervised machine learning algorithm, k-means clustering, is applied to XGCI particle distribution function data, showing that highly turbulent spatial regions do not have common coherent structures, but rather broad, ring-like structures in velocity space.

  10. Implementation of novel statistical procedures and other advanced approaches to improve analysis of CASA data.

    PubMed

    Ramón, M; Martínez-Pastor, F

    2018-04-23

    Computer-aided sperm analysis (CASA) produces a wealth of data that is frequently ignored. The use of multiparametric statistical methods can help explore these datasets, unveiling the subpopulation structure of sperm samples. In this review we analyse the significance of the internal heterogeneity of sperm samples and its relevance. We also provide a brief description of the statistical tools used for extracting sperm subpopulations from the datasets, namely unsupervised clustering (with non-hierarchical, hierarchical and two-step methods) and the most advanced supervised methods, based on machine learning. The former method has allowed exploration of subpopulation patterns in many species, whereas the latter offering further possibilities, especially considering functional studies and the practical use of subpopulation analysis. We also consider novel approaches, such as the use of geometric morphometrics or imaging flow cytometry. Finally, although the data provided by CASA systems provides valuable information on sperm samples by applying clustering analyses, there are several caveats. Protocols for capturing and analysing motility or morphometry should be standardised and adapted to each experiment, and the algorithms should be open in order to allow comparison of results between laboratories. Moreover, we must be aware of new technology that could change the paradigm for studying sperm motility and morphology.

  11. Change detection and change monitoring of natural and man-made features in multispectral and hyperspectral satellite imagery

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Moody, Daniela Irina

    An approach for land cover classification, seasonal and yearly change detection and monitoring, and identification of changes in man-made features may use a clustering of sparse approximations (CoSA) on sparse representations in learned dictionaries. A Hebbian learning rule may be used to build multispectral or hyperspectral, multiresolution dictionaries that are adapted to regional satellite image data. Sparse image representations of pixel patches over the learned dictionaries may be used to perform unsupervised k-means clustering into land cover categories. The clustering process behaves as a classifier in detecting real variability. This approach may combine spectral and spatial textural characteristics to detectmore » geologic, vegetative, hydrologic, and man-made features, as well as changes in these features over time.« less

  12. The influence of unsupervised time on elementary school children at high risk for inattention and problem behaviors.

    PubMed

    Na, Kyoung-Sae; Lee, Soyoung Irene; Hong, Hyun Ju; Oh, Myoung-Ja; Bahn, Geon Ho; Ha, Kyunghee; Shin, Yun Mi; Song, Jungeun; Park, Eun Jin; Yoo, Heejung; Kim, Hyunsoo; Kyung, Yun-Mi

    2014-06-01

    In the last few decades, changing socioeconomic and family structures have increasingly left children alone without adult supervision. Carefully prepared and limited periods of unsupervised time are not harmful for children. However, long unsupervised periods have harmful effects, particularly for those children at high risk for inattention and problem behaviors. In this study, we examined the influence of unsupervised time on behavior problems by studying a sample of elementary school children at high risk for inattention and problem behaviors. The study analyzed data from the Children's Mental Health Promotion Project, which was conducted in collaboration with education, government, and mental health professionals. The child behavior checklist (CBCL) was administered to assess problem behaviors among first- and fourth-grade children. Multivariate logistic regression analysis was used to evaluate the influence of unsupervised time on children's behavior. A total of 3,270 elementary school children (1,340 first-graders and 1,930 fourth-graders) were available for this study; 1,876 of the 3,270 children (57.4%) reportedly spent a significant amount of time unsupervised during the day. Unsupervised time that exceeded more than 2h per day increased the risk of delinquency, aggressive behaviors, and somatic complaints, as well as externalizing and internalizing problems. Carefully planned afterschool programming and care should be provided to children at high risk for inattention and problem behaviors. Also, a more comprehensive approach is needed to identify the possible mechanisms by which unsupervised time aggravates behavior problems in children predisposed for these behaviors. Copyright © 2013 Elsevier Ltd. All rights reserved.

  13. DISCRIMINATION OF GRANITOIDS AND MINERALIZED GRANITOIDS IN THE MIDYAN REGION, NORTHWESTERN ARABIAN SHIELD, SAUDI ARABIA, BY LANDSAT MSS DATA-ANALYSIS.

    USGS Publications Warehouse

    Davis, Philip A.; Grolier, Maurice J.

    1984-01-01

    Landsat multispectral scanner (MSS) band and band-ratio databases of two scenes covering the Midyan region of northwestern Saudi Arabia were examined quantitatively and qualitatively to determine which databases best discriminate the geologic units of this semi-arid and arid region. Unsupervised, linear-discriminant cluster-analysis was performed on these two band-ratio combinations and on the MSS bands for both scenes. The results for granitoid-rock discrimination indicated that the classification images using the MSS bands are superior to the band-ratio classification images for two reasons, discussed in the paper. Yet, the effects of topography and material type (including desert varnish) on the MSS-band data produced ambiguities in the MSS-band classification results. However, these ambiguities were clarified by using a simulated natural-color image in conjunction with the MSS-band classification image.

  14. Deep Unsupervised Learning on a Desktop PC: A Primer for Cognitive Scientists.

    PubMed

    Testolin, Alberto; Stoianov, Ivilin; De Filippo De Grazia, Michele; Zorzi, Marco

    2013-01-01

    Deep belief networks hold great promise for the simulation of human cognition because they show how structured and abstract representations may emerge from probabilistic unsupervised learning. These networks build a hierarchy of progressively more complex distributed representations of the sensory data by fitting a hierarchical generative model. However, learning in deep networks typically requires big datasets and it can involve millions of connection weights, which implies that simulations on standard computers are unfeasible. Developing realistic, medium-to-large-scale learning models of cognition would therefore seem to require expertise in programing parallel-computing hardware, and this might explain why the use of this promising approach is still largely confined to the machine learning community. Here we show how simulations of deep unsupervised learning can be easily performed on a desktop PC by exploiting the processors of low cost graphic cards (graphic processor units) without any specific programing effort, thanks to the use of high-level programming routines (available in MATLAB or Python). We also show that even an entry-level graphic card can outperform a small high-performance computing cluster in terms of learning time and with no loss of learning quality. We therefore conclude that graphic card implementations pave the way for a widespread use of deep learning among cognitive scientists for modeling cognition and behavior.

  15. Deep Unsupervised Learning on a Desktop PC: A Primer for Cognitive Scientists

    PubMed Central

    Testolin, Alberto; Stoianov, Ivilin; De Filippo De Grazia, Michele; Zorzi, Marco

    2013-01-01

    Deep belief networks hold great promise for the simulation of human cognition because they show how structured and abstract representations may emerge from probabilistic unsupervised learning. These networks build a hierarchy of progressively more complex distributed representations of the sensory data by fitting a hierarchical generative model. However, learning in deep networks typically requires big datasets and it can involve millions of connection weights, which implies that simulations on standard computers are unfeasible. Developing realistic, medium-to-large-scale learning models of cognition would therefore seem to require expertise in programing parallel-computing hardware, and this might explain why the use of this promising approach is still largely confined to the machine learning community. Here we show how simulations of deep unsupervised learning can be easily performed on a desktop PC by exploiting the processors of low cost graphic cards (graphic processor units) without any specific programing effort, thanks to the use of high-level programming routines (available in MATLAB or Python). We also show that even an entry-level graphic card can outperform a small high-performance computing cluster in terms of learning time and with no loss of learning quality. We therefore conclude that graphic card implementations pave the way for a widespread use of deep learning among cognitive scientists for modeling cognition and behavior. PMID:23653617

  16. Canonical PSO Based K-Means Clustering Approach for Real Datasets.

    PubMed

    Dey, Lopamudra; Chakraborty, Sanjay

    2014-01-01

    "Clustering" the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

  17. Applying reconfigurable hardware to the analysis of multispectral and hyperspectral imagery

    NASA Astrophysics Data System (ADS)

    Leeser, Miriam E.; Belanovic, Pavle; Estlick, Michael; Gokhale, Maya; Szymanski, John J.; Theiler, James P.

    2002-01-01

    Unsupervised clustering is a powerful technique for processing multispectral and hyperspectral images. Last year, we reported on an implementation of k-means clustering for multispectral images. Our implementation in reconfigurable hardware processed 10 channel multispectral images two orders of magnitude faster than a software implementation of the same algorithm. The advantage of using reconfigurable hardware to accelerate k-means clustering is clear; the disadvantage is the hardware implementation worked for one specific dataset. It is a non-trivial task to change this implementation to handle a dataset with different number of spectral channels, bits per spectral channel, or number of pixels; or to change the number of clusters. These changes required knowledge of the hardware design process and could take several days of a designer's time. Since multispectral data sets come in many shapes and sizes, being able to easily change the k-means implementation for these different data sets is important. For this reason, we have developed a parameterized implementation of the k-means algorithm. Our design is parameterized by the number of pixels in an image, the number of channels per pixel, and the number of bits per channel as well as the number of clusters. These parameters can easily be changed in a few minutes by someone not familiar with the design process. The resulting implementation is very close in performance to the original hardware implementation. It has the added advantage that the parameterized design compiles approximately three times faster than the original.

  18. Spatiotemporal Analysis of Corn Phenoregions in the Continental United States

    NASA Astrophysics Data System (ADS)

    Konduri, V. S.; Kumar, J.; Hoffman, F. M.; Ganguly, A. R.; Hargrove, W. W.

    2017-12-01

    The delineation of regions exhibiting similar crop performance has potential benefits for agricultural planning and management, policymaking and natural resource conservation. Studies of natural ecosystems have used multivariate clustering algorithms based on environmental characteristics to identify ecoregions for species range prediction and habitat conservation. However, few studies have used clustering to delineate regions based on crop phenology. The aim of this study was to perform a spatiotemporal analysis of phenologically self-similar clusters, or phenoregions, for the major corn growing areas in the Continental United States (CONUS) for the period 2008-2016. Annual trajectories of remotely sensed normalized difference vegetation index (NDVI), a useful proxy for land surface phenology, derived from Moderate Resolution Spectroradiometer (MODIS) instruments at 8-day intervals and 250 m resolution was used as the phenological metric. Because of the large data volumes involved, the phenoregion delineation was performed using a highly scalable, unsupervised clustering technique with the help of high performance computing. These phenoregions capture the spatial variability in the timing of important crop phenological stages (like emergence and maturity dates) and thus could be used to develop more accurate parameterizations for crop models applied at regional to global scales. Moreover, historical crop performance from phenoregions, in combination with climate and soils data, could be used to improve production forecasts. The temporal variability in NDVI at each location could also be used to develop an early warning system to identify locations where the crop deviates from its expected phenological behavior. Such deviations may indicate a need for irrigation or fertilization or suggest where pest outbreaks or other disturbances have occurred.

  19. Macula segmentation and fovea localization employing image processing and heuristic based clustering for automated retinal screening.

    PubMed

    R, GeethaRamani; Balasubramanian, Lakshmi

    2018-07-01

    Macula segmentation and fovea localization is one of the primary tasks in retinal analysis as they are responsible for detailed vision. Existing approaches required segmentation of retinal structures viz. optic disc and blood vessels for this purpose. This work avoids knowledge of other retinal structures and attempts data mining techniques to segment macula. Unsupervised clustering algorithm is exploited for this purpose. Selection of initial cluster centres has a great impact on performance of clustering algorithms. A heuristic based clustering in which initial centres are selected based on measures defining statistical distribution of data is incorporated in the proposed methodology. The initial phase of proposed framework includes image cropping, green channel extraction, contrast enhancement and application of mathematical closing. Then, the pre-processed image is subjected to heuristic based clustering yielding a binary map. The binary image is post-processed to eliminate unwanted components. Finally, the component which possessed the minimum intensity is finalized as macula and its centre constitutes the fovea. The proposed approach outperforms existing works by reporting that 100%,of HRF, 100% of DRIVE, 96.92% of DIARETDB0, 97.75% of DIARETDB1, 98.81% of HEI-MED, 90% of STARE and 99.33% of MESSIDOR images satisfy the 1R criterion, a standard adopted for evaluating performance of macula and fovea identification. The proposed system thus helps the ophthalmologists in identifying the macula thereby facilitating to identify if any abnormality is present within the macula region. Copyright © 2018 Elsevier B.V. All rights reserved.

  20. Low-dimensional dynamical characterization of human performance of cancer patients using motion data.

    PubMed

    Hasnain, Zaki; Li, Ming; Dorff, Tanya; Quinn, David; Ueno, Naoto T; Yennu, Sriram; Kolatkar, Anand; Shahabi, Cyrus; Nocera, Luciano; Nieva, Jorge; Kuhn, Peter; Newton, Paul K

    2018-05-18

    Biomechanical characterization of human performance with respect to fatigue and fitness is relevant in many settings, however is usually limited to either fully qualitative assessments or invasive methods which require a significant experimental setup consisting of numerous sensors, force plates, and motion detectors. Qualitative assessments are difficult to standardize due to their intrinsic subjective nature, on the other hand, invasive methods provide reliable metrics but are not feasible for large scale applications. Presented here is a dynamical toolset for detecting performance groups using a non-invasive system based on the Microsoft Kinect motion capture sensor, and a case study of 37 cancer patients performing two clinically monitored tasks before and after therapy regimens. Dynamical features are extracted from the motion time series data and evaluated based on their ability to i) cluster patients into coherent fitness groups using unsupervised learning algorithms and to ii) predict Eastern Cooperative Oncology Group performance status via supervised learning. The unsupervised patient clustering is comparable to clustering based on physician assigned Eastern Cooperative Oncology Group status in that they both have similar concordance with change in weight before and after therapy as well as unexpected hospitalizations throughout the study. The extracted dynamical features can predict physician, coordinator, and patient Eastern Cooperative Oncology Group status with an accuracy of approximately 80%. The non-invasive Microsoft Kinect sensor and the proposed dynamical toolset comprised of data preprocessing, feature extraction, dimensionality reduction, and machine learning offers a low-cost and general method for performance segregation and can complement existing qualitative clinical assessments. Copyright © 2018 Elsevier Ltd. All rights reserved.

  1. Canonical PSO Based K-Means Clustering Approach for Real Datasets

    PubMed Central

    Dey, Lopamudra; Chakraborty, Sanjay

    2014-01-01

    “Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms. PMID:27355083

  2. Analysis of the Tanana River Basin using LANDSAT data

    NASA Technical Reports Server (NTRS)

    Morrissey, L. A.; Ambrosia, V. G.; Carson-Henry, C.

    1981-01-01

    Digital image classification techniques were used to classify land cover/resource information in the Tanana River Basin of Alaska. Portions of four scenes of LANDSAT digital data were analyzed using computer systems at Ames Research Center in an unsupervised approach to derive cluster statistics. The spectral classes were identified using the IDIMS display and color infrared photography. Classification errors were corrected using stratification procedures. The classification scheme resulted in the following eleven categories; sedimented/shallow water, clear/deep water, coniferous forest, mixed forest, deciduous forest, shrub and grass, bog, alpine tundra, barrens, snow and ice, and cultural features. Color coded maps and acreage summaries of the major land cover categories were generated for selected USGS quadrangles (1:250,000) which lie within the drainage basin. The project was completed within six months.

  3. Multiple Imputation based Clustering Validation (MIV) for Big Longitudinal Trial Data with Missing Values in eHealth.

    PubMed

    Zhang, Zhaoyang; Fang, Hua; Wang, Honggang

    2016-06-01

    Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services.

  4. Multiple Imputation based Clustering Validation (MIV) for Big Longitudinal Trial Data with Missing Values in eHealth

    PubMed Central

    Zhang, Zhaoyang; Wang, Honggang

    2016-01-01

    Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering is more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services. PMID:27126063

  5. A Fast Projection-Based Algorithm for Clustering Big Data.

    PubMed

    Wu, Yun; He, Zhiquan; Lin, Hao; Zheng, Yufei; Zhang, Jingfen; Xu, Dong

    2018-06-07

    With the fast development of various techniques, more and more data have been accumulated with the unique properties of large size (tall) and high dimension (wide). The era of big data is coming. How to understand and discover new knowledge from these data has attracted more and more scholars' attention and has become the most important task in data mining. As one of the most important techniques in data mining, clustering analysis, a kind of unsupervised learning, could group a set data into objectives(clusters) that are meaningful, useful, or both. Thus, the technique has played very important role in knowledge discovery in big data. However, when facing the large-sized and high-dimensional data, most of the current clustering methods exhibited poor computational efficiency and high requirement of computational source, which will prevent us from clarifying the intrinsic properties and discovering the new knowledge behind the data. Based on this consideration, we developed a powerful clustering method, called MUFOLD-CL. The principle of the method is to project the data points to the centroid, and then to measure the similarity between any two points by calculating their projections on the centroid. The proposed method could achieve linear time complexity with respect to the sample size. Comparison with K-Means method on very large data showed that our method could produce better accuracy and require less computational time, demonstrating that the MUFOLD-CL can serve as a valuable tool, at least may play a complementary role to other existing methods, for big data clustering. Further comparisons with state-of-the-art clustering methods on smaller datasets showed that our method was fastest and achieved comparable accuracy. For the convenience of most scholars, a free soft package was constructed.

  6. Best friends' interactions and substance use: The role of friend pressure and unsupervised co-deviancy.

    PubMed

    Tsakpinoglou, Florence; Poulin, François

    2017-10-01

    Best friends exert a substantial influence on rising alcohol and marijuana use during adolescence. Two mechanisms occurring within friendship - friend pressure and unsupervised co-deviancy - may partially capture the way friends influence one another. The current study aims to: (1) examine the psychometric properties of a new instrument designed to assess pressure from a youth's best friend and unsupervised co-deviancy; (2) investigate the relative contribution of these processes to alcohol and marijuana use; and (3) determine whether gender moderates these associations. Data were collected through self-report questionnaires completed by 294 Canadian youths (62% female) across two time points (ages 15-16). Principal component analysis yielded a two-factor solution corresponding to friend pressure and unsupervised co-deviancy. Logistic regressions subsequently showed that unsupervised co-deviancy was predictive of an increase in marijuana use one year later. Neither process predicted an increase in alcohol use. Results did not differ as a function of gender. Copyright © 2017 The Foundation for Professionals in Services for Adolescents. Published by Elsevier Ltd. All rights reserved.

  7. [Analysis on traditional Chinese medicine prescriptions treating cancer based on traditional Chinese medicine inheritance assistance system and discovery of new prescriptions].

    PubMed

    Yu, Ming; Cao, Qi-chen; Su, Yu-xi; Sui, Xin; Yang, Hong-jun; Huang, Lu-qi; Wang, Wen-ping

    2015-08-01

    Malignant tumor is one of the main causes for death in the world at present as well as a major disease seriously harming human health and life and restricting the social and economic development. There are many kinds of reports about traditional Chinese medicine patent prescriptions, empirical prescriptions and self-made prescriptions treating cancer, and prescription rules were often analyzed based on medication frequency. Such methods were applicable for discovering dominant experience but hard to have an innovative discovery and knowledge. In this paper, based on the traditional Chinese medicine inheritance assistance system, the software integration of mutual information improvement method, complex system entropy clustering and unsupervised entropy-level clustering data mining methods was adopted to analyze the rules of traditional Chinese medicine prescriptions for cancer. Totally 114 prescriptions were selected, the frequency of herbs in prescription was determined, and 85 core combinations and 13 new prescriptions were indentified. The traditional Chinese medicine inheritance assistance system, as a valuable traditional Chinese medicine research-supporting tool, can be used to record, manage, inquire and analyze prescription data.

  8. Population analysis of the cingulum bundle using the tubular surface model for schizophrenia detection

    NASA Astrophysics Data System (ADS)

    Mohan, Vandana; Sundaramoorthi, Ganesh; Kubicki, Marek; Terry, Douglas; Tannenbaum, Allen

    2010-03-01

    We propose a novel framework for population analysis of DW-MRI data using the Tubular Surface Model. We focus on the Cingulum Bundle (CB) - a major tract for the Limbic System and the main connection of the Cingulate Gyrus, which has been associated with several aspects of Schizophrenia symptomatology. The Tubular Surface Model represents a tubular surface as a center-line with an associated radius function. It provides a natural way to sample statistics along the length of the fiber bundle and reduces the registration of fiber bundle surfaces to that of 4D curves. We apply our framework to a population of 20 subjects (10 normal, 10 schizophrenic) and obtain excellent results with neural network based classification (90% sensitivity, 95% specificity) as well as unsupervised clustering (k-means). Further, we apply statistical analysis to the feature data and characterize the discrimination ability of local regions of the CB, as a step towards localizing CB regions most relevant to Schizophrenia.

  9. Unsupervised Learning —A Novel Clustering Method for Rolling Bearing Faults Identification

    NASA Astrophysics Data System (ADS)

    Kai, Li; Bo, Luo; Tao, Ma; Xuefeng, Yang; Guangming, Wang

    2017-12-01

    To promptly process the massive fault data and automatically provide accurate diagnosis results, numerous studies have been conducted on intelligent fault diagnosis of rolling bearing. Among these studies, such as artificial neural networks, support vector machines, decision trees and other supervised learning methods are used commonly. These methods can detect the failure of rolling bearing effectively, but to achieve better detection results, it often requires a lot of training samples. Based on above, a novel clustering method is proposed in this paper. This novel method is able to find the correct number of clusters automatically the effectiveness of the proposed method is validated using datasets from rolling element bearings. The diagnosis results show that the proposed method can accurately detect the fault types of small samples. Meanwhile, the diagnosis results are also relative high accuracy even for massive samples.

  10. Leveraging unsupervised training sets for multi-scale compartmentalization in renal pathology

    NASA Astrophysics Data System (ADS)

    Lutnick, Brendon; Tomaszewski, John E.; Sarder, Pinaki

    2017-03-01

    Clinical pathology relies on manual compartmentalization and quantification of biological structures, which is time consuming and often error-prone. Application of computer vision segmentation algorithms to histopathological image analysis, in contrast, can offer fast, reproducible, and accurate quantitative analysis to aid pathologists. Algorithms tunable to different biologically relevant structures can allow accurate, precise, and reproducible estimates of disease states. In this direction, we have developed a fast, unsupervised computational method for simultaneously separating all biologically relevant structures from histopathological images in multi-scale. Segmentation is achieved by solving an energy optimization problem. Representing the image as a graph, nodes (pixels) are grouped by minimizing a Potts model Hamiltonian, adopted from theoretical physics, modeling interacting electron spins. Pixel relationships (modeled as edges) are used to update the energy of the partitioned graph. By iteratively improving the clustering, the optimal number of segments is revealed. To reduce computational time, the graph is simplified using a Cantor pairing function to intelligently reduce the number of included nodes. The classified nodes are then used to train a multiclass support vector machine to apply the segmentation over the full image. Accurate segmentations of images with as many as 106 pixels can be completed only in 5 sec, allowing for attainable multi-scale visualization. To establish clinical potential, we employed our method in renal biopsies to quantitatively visualize for the first time scale variant compartments of heterogeneous intra- and extraglomerular structures simultaneously. Implications of the utility of our method extend to fields such as oncology, genomics, and non-biological problems.

  11. Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome.

    PubMed

    Tothill, Richard W; Tinker, Anna V; George, Joshy; Brown, Robert; Fox, Stephen B; Lade, Stephen; Johnson, Daryl S; Trivett, Melanie K; Etemadmoghadam, Dariush; Locandro, Bianca; Traficante, Nadia; Fereday, Sian; Hung, Jillian A; Chiew, Yoke-Eng; Haviv, Izhak; Gertig, Dorota; DeFazio, Anna; Bowtell, David D L

    2008-08-15

    The study aim to identify novel molecular subtypes of ovarian cancer by gene expression profiling with linkage to clinical and pathologic features. Microarray gene expression profiling was done on 285 serous and endometrioid tumors of the ovary, peritoneum, and fallopian tube. K-means clustering was applied to identify robust molecular subtypes. Statistical analysis identified differentially expressed genes, pathways, and gene ontologies. Laser capture microdissection, pathology review, and immunohistochemistry validated the array-based findings. Patient survival within k-means groups was evaluated using Cox proportional hazards models. Class prediction validated k-means groups in an independent dataset. A semisupervised survival analysis of the array data was used to compare against unsupervised clustering results. Optimal clustering of array data identified six molecular subtypes. Two subtypes represented predominantly serous low malignant potential and low-grade endometrioid subtypes, respectively. The remaining four subtypes represented higher grade and advanced stage cancers of serous and endometrioid morphology. A novel subtype of high-grade serous cancers reflected a mesenchymal cell type, characterized by overexpression of N-cadherin and P-cadherin and low expression of differentiation markers, including CA125 and MUC1. A poor prognosis subtype was defined by a reactive stroma gene expression signature, correlating with extensive desmoplasia in such samples. A similar poor prognosis signature could be found using a semisupervised analysis. Each subtype displayed distinct levels and patterns of immune cell infiltration. Class prediction identified similar subtypes in an independent ovarian dataset with similar prognostic trends. Gene expression profiling identified molecular subtypes of ovarian cancer of biological and clinical importance.

  12. MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering

    PubMed Central

    Kim, Eun-Youn; Kim, Seon-Young; Ashlock, Daniel; Nam, Dougu

    2009-01-01

    Background Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance. Results We present a cluster-number-based ensemble clustering algorithm, called MULTI-K, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple k-means runs by varying the number of clusters and identifies clusters that manifest the most robust co-memberships of elements. In addition to the original algorithm, we newly devised the entropy-plot to control the separation of singletons or small clusters. MULTI-K, unlike the simple k-means or other widely used methods, was able to capture clusters with complex and high-dimensional structures accurately. MULTI-K outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real gene-expression data sets. Conclusion The geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors. PMID:19698124

  13. Analytic Steering: Inserting Context into the Information Dialog

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bohn, Shawn J.; Calapristi, Augustin J.; Brown, Shyretha D.

    2011-10-23

    An analyst’s intrinsic domain knowledge is a primary asset in almost any analysis task. Unstructured text analysis systems that apply un-supervised content analysis approaches can be more effective if they can leverage this domain knowledge in a manner that augments the information discovery process without obfuscating new or unexpected content. Current unsupervised approaches rely upon the prowess of the analyst to submit the right queries or observe generalized document and term relationships from ranked or visual results. We propose a new approach which allows the user to control or steer the analytic view within the unsupervised space. This process ismore » controlled through the data characterization process via user supplied context in the form of a collection of key terms. We show that steering with an appropriate choice of key terms can provide better relevance to the analytic domain and still enable the analyst to uncover un-expected relationships; this paper discusses cases where various analytic steering approaches can provide enhanced analysis results and cases where analytic steering can have a negative impact on the analysis process.« less

  14. TU-CD-BRB-12: Radiogenomics of MRI-Guided Prostate Cancer Biopsy Habitats

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Stoyanova, R; Lynne, C; Abraham, S

    2015-06-15

    Purpose: Diagnostic prostate biopsies are subject to sampling bias. We hypothesize that quantitative imaging with multiparametric (MP)-MRI can more accurately direct targeted biopsies to index lesions associated with highest risk clinical and genomic features. Methods: Regionally distinct prostate habitats were delineated on MP-MRI (T2-weighted, perfusion and diffusion imaging). Directed biopsies were performed on 17 habitats from 6 patients using MRI-ultrasound fusion. Biopsy location was characterized with 52 radiographic features. Transcriptome-wide analysis of 1.4 million RNA probes was performed on RNA from each habitat. Genomics features with insignificant expression values (<0.25) and interquartile range <0.5 were filtered, leaving total of 212more » genes. Correlation between imaging features, genes and a 22 feature genomic classifier (GC), developed as a prognostic assay for metastasis after radical prostatectomy was investigated. Results: High quality genomic data was derived from 17 (100%) biopsies. Using the 212 ‘unbiased’ genes, the samples clustered by patient origin in unsupervised analysis. When only prostate cancer related genomic features were used, hierarchical clustering revealed samples clustered by needle-biopsy Gleason score (GS). Similarly, principal component analysis of the imaging features, found the primary source of variance segregated the samples into high (≥7) and low (6) GS. Pearson’s correlation analysis of genes with significant expression showed two main patterns of gene expression clustering prostate peripheral and transitional zone MRI features. Two-way hierarchical clustering of GC with radiomics features resulted in the expected groupings of high and low expressed genes in this metastasis signature. Conclusions: MP-MRI-targeted diagnostic biopsies can potentially improve risk stratification by directing pathological and genomic analysis to clinically significant index lesions. As determinant lesions are more reliably identified, targeting with radiotherapy should improve outcome. This is the first demonstration of a link between quantitative imaging features (radiomics) with genomic features in MRI-directed prostate biopsies. The research was supported by NIH- NCI R01 CA 189295 and R01 CA 189295; E Davicioni is partial owner of GenomeDx Biosciences, Inc. M Takhar, N Erho, L Lam, C Buerki and E Davicioni are current employees at GenomeDx Biosciences, Inc.« less

  15. Bacterial biofilm composition in caries and caries-free subjects.

    PubMed

    Wolff, D; Frese, C; Maier-Kraus, T; Krueger, T; Wolff, B

    2013-01-01

    Certain major pathogens such as Streptococcus mutans, Lactobacillus spp. and others have been reported to be involved in caries initiation and progression. Yet, in addition to those leading pathogens, microbial communities seem to be much more diverse and individually differing. The aim of this study, therefore, was to analyze the bacterial composition of carious dentin and the plaque of caries-free patients by using a custom-made, real-time quantitative polymerase chain reaction assay (RQ-PCR). The study included 26 patients with caries and 28 caries-free controls. Decayed tooth substance and plaque samples were harvested. Bacterial DNA was extracted and tested for the presence of 43 bacterial species or species groups using RQ-PCR. Relative quantification revealed that Propionibacterium acidifaciens was significantly more abundant in caries samples than were other microorganisms (fold change 169.12, p = 0.023). In the caries-free samples, typical health-associated species were significantly more prevalent. Unsupervised hierarchical cluster analysis showed a high abundance of P. acidifaciens in caries subjects and distinct but individually differing bacterial clusters in the caries-free subjects. The distribution of 11 bacteria allowed full discrimination between caries and caries-free subjects. Within the investigated cohort, P. acidifaciens was the only pathogen significantly more abundant in caries subjects. Cluster analysis yielded a diverse flora in caries-free subjects, whereas it was narrowed down to a small range of a few outcompeting members in caries subjects. Copyright © 2012 S. Karger AG, Basel.

  16. SEURAT: Visual analytics for the integrated analysis of microarray data

    PubMed Central

    2010-01-01

    Background In translational cancer research, gene expression data is collected together with clinical data and genomic data arising from other chip based high throughput technologies. Software tools for the joint analysis of such high dimensional data sets together with clinical data are required. Results We have developed an open source software tool which provides interactive visualization capability for the integrated analysis of high-dimensional gene expression data together with associated clinical data, array CGH data and SNP array data. The different data types are organized by a comprehensive data manager. Interactive tools are provided for all graphics: heatmaps, dendrograms, barcharts, histograms, eventcharts and a chromosome browser, which displays genetic variations along the genome. All graphics are dynamic and fully linked so that any object selected in a graphic will be highlighted in all other graphics. For exploratory data analysis the software provides unsupervised data analytics like clustering, seriation algorithms and biclustering algorithms. Conclusions The SEURAT software meets the growing needs of researchers to perform joint analysis of gene expression, genomical and clinical data. PMID:20525257

  17. Broad DNA methylation changes of spermatogenesis, inflammation and immune response-related genes in a subgroup of sperm samples for assisted reproduction.

    PubMed

    Schütte, B; El Hajj, N; Kuhtz, J; Nanda, I; Gromoll, J; Hahn, T; Dittrich, M; Schorsch, M; Müller, T; Haaf, T

    2013-11-01

    Aberrant sperm DNA methylation patterns, mainly in imprinted genes, have been associated with male subfertility and oligospermia. Here, we performed a genome-wide methylation analysis in sperm samples representing a wide range of semen parameters. Sperm DNA samples of 38 males attending a fertility centre were analysed with Illumina HumanMethylation27 BeadChips, which quantify methylation of >27 000 CpG sites in cis-regulatory regions of almost 15 000 genes. In an unsupervised analysis of methylation of all analysed sites, the patient samples clustered into a major and a minor group. The major group clustered with samples from normozoospermic healthy volunteers and, thus, may more closely resemble the normal situation. When correlating the clusters with semen and clinical parameters, the sperm counts were significantly different between groups with the minor group exhibiting sperm counts in the low normal range. A linear model identified almost 3000 CpGs with significant methylation differences between groups. Functional analysis revealed a broad gain of methylation in spermatogenesis-related genes and a loss of methylation in inflammation- and immune response-related genes. Quantitative bisulfite pyrosequencing validated differential methylation in three of five significant candidate genes on the array. Collectively, we identified a subgroup of sperm samples for assisted reproduction with sperm counts in the low normal range and broad methylation changes (affecting approximately 10% of analysed CpG sites) in specific pathways, most importantly spermatogenesis-related genes. We propose that epigenetic analysis can supplement traditional semen parameters and has the potential to provide new insights into the aetiology of male subfertility. © 2013 American Society of Andrology and European Academy of Andrology.

  18. A Novel Unsupervised Segmentation Quality Evaluation Method for Remote Sensing Images

    PubMed Central

    Tang, Yunwei; Jing, Linhai; Ding, Haifeng

    2017-01-01

    The segmentation of a high spatial resolution remote sensing image is a critical step in geographic object-based image analysis (GEOBIA). Evaluating the performance of segmentation without ground truth data, i.e., unsupervised evaluation, is important for the comparison of segmentation algorithms and the automatic selection of optimal parameters. This unsupervised strategy currently faces several challenges in practice, such as difficulties in designing effective indicators and limitations of the spectral values in the feature representation. This study proposes a novel unsupervised evaluation method to quantitatively measure the quality of segmentation results to overcome these problems. In this method, multiple spectral and spatial features of images are first extracted simultaneously and then integrated into a feature set to improve the quality of the feature representation of ground objects. The indicators designed for spatial stratified heterogeneity and spatial autocorrelation are included to estimate the properties of the segments in this integrated feature set. These two indicators are then combined into a global assessment metric as the final quality score. The trade-offs of the combined indicators are accounted for using a strategy based on the Mahalanobis distance, which can be exhibited geometrically. The method is tested on two segmentation algorithms and three testing images. The proposed method is compared with two existing unsupervised methods and a supervised method to confirm its capabilities. Through comparison and visual analysis, the results verified the effectiveness of the proposed method and demonstrated the reliability and improvements of this method with respect to other methods. PMID:29064416

  19. Automated unsupervised multi-parametric classification of adipose tissue depots in skeletal muscle

    PubMed Central

    Valentinitsch, Alexander; Karampinos, Dimitrios C.; Alizai, Hamza; Subburaj, Karupppasamy; Kumar, Deepak; Link, Thomas M.; Majumdar, Sharmila

    2012-01-01

    Purpose To introduce and validate an automated unsupervised multi-parametric method for segmentation of the subcutaneous fat and muscle regions in order to determine subcutaneous adipose tissue (SAT) and intermuscular adipose tissue (IMAT) areas based on data from a quantitative chemical shift-based water-fat separation approach. Materials and Methods Unsupervised standard k-means clustering was employed to define sets of similar features (k = 2) within the whole multi-modal image after the water-fat separation. The automated image processing chain was composed of three primary stages including tissue, muscle and bone region segmentation. The algorithm was applied on calf and thigh datasets to compute SAT and IMAT areas and was compared to a manual segmentation. Results The IMAT area using the automatic segmentation had excellent agreement with the IMAT area using the manual segmentation for all the cases in the thigh (R2: 0.96) and for cases with up to moderate IMAT area in the calf (R2: 0.92). The group with the highest grade of muscle fat infiltration in the calf had the highest error in the inner SAT contour calculation. Conclusion The proposed multi-parametric segmentation approach combined with quantitative water-fat imaging provides an accurate and reliable method for an automated calculation of the SAT and IMAT areas reducing considerably the total post-processing time. PMID:23097409

  20. Incrementally learning objects by touch: online discriminative and generative models for tactile-based recognition.

    PubMed

    Soh, Harold; Demiris, Yiannis

    2014-01-01

    Human beings not only possess the remarkable ability to distinguish objects through tactile feedback but are further able to improve upon recognition competence through experience. In this work, we explore tactile-based object recognition with learners capable of incremental learning. Using the sparse online infinite Echo-State Gaussian process (OIESGP), we propose and compare two novel discriminative and generative tactile learners that produce probability distributions over objects during object grasping/palpation. To enable iterative improvement, our online methods incorporate training samples as they become available. We also describe incremental unsupervised learning mechanisms, based on novelty scores and extreme value theory, when teacher labels are not available. We present experimental results for both supervised and unsupervised learning tasks using the iCub humanoid, with tactile sensors on its five-fingered anthropomorphic hand, and 10 different object classes. Our classifiers perform comparably to state-of-the-art methods (C4.5 and SVM classifiers) and findings indicate that tactile signals are highly relevant for making accurate object classifications. We also show that accurate "early" classifications are possible using only 20-30 percent of the grasp sequence. For unsupervised learning, our methods generate high quality clusterings relative to the widely-used sequential k-means and self-organising map (SOM), and we present analyses into the differences between the approaches.

  1. Automated age-related macular degeneration classification in OCT using unsupervised feature learning

    NASA Astrophysics Data System (ADS)

    Venhuizen, Freerk G.; van Ginneken, Bram; Bloemen, Bart; van Grinsven, Mark J. J. P.; Philipsen, Rick; Hoyng, Carel; Theelen, Thomas; Sánchez, Clara I.

    2015-03-01

    Age-related Macular Degeneration (AMD) is a common eye disorder with high prevalence in elderly people. The disease mainly affects the central part of the retina, and could ultimately lead to permanent vision loss. Optical Coherence Tomography (OCT) is becoming the standard imaging modality in diagnosis of AMD and the assessment of its progression. However, the evaluation of the obtained volumetric scan is time consuming, expensive and the signs of early AMD are easy to miss. In this paper we propose a classification method to automatically distinguish AMD patients from healthy subjects with high accuracy. The method is based on an unsupervised feature learning approach, and processes the complete image without the need for an accurate pre-segmentation of the retina. The method can be divided in two steps: an unsupervised clustering stage that extracts a set of small descriptive image patches from the training data, and a supervised training stage that uses these patches to create a patch occurrence histogram for every image on which a random forest classifier is trained. Experiments using 384 volume scans show that the proposed method is capable of identifying AMD patients with high accuracy, obtaining an area under the Receiver Operating Curve of 0:984. Our method allows for a quick and reliable assessment of the presence of AMD pathology in OCT volume scans without the need for accurate layer segmentation algorithms.

  2. High Throughput Ambient Mass Spectrometric Approach to Species Identification and Classification from Chemical Fingerprint Signatures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Musah, Rabi A.; Espinoza, Edgard O.; Cody, Robert B.

    A high throughput method for species identification and classification through chemometric processing of direct analysis in real time (DART) mass spectrometry-derived fingerprint signatures has been developed. The method entails introduction of samples to the open air space between the DART ion source and the mass spectrometer inlet, with the entire observed mass spectral fingerprint subjected to unsupervised hierarchical clustering processing. Moreover, a range of both polar and non-polar chemotypes are instantaneously detected. The result is identification and species level classification based on the entire DART-MS spectrum. In this paper, we illustrate how the method can be used to: (1) distinguishmore » between endangered woods regulated by the Convention for the International Trade of Endangered Flora and Fauna (CITES) treaty; (2) assess the origin and by extension the properties of biodiesel feedstocks; (3) determine insect species from analysis of puparial casings; (4) distinguish between psychoactive plants products; and (5) differentiate between Eucalyptus species. An advantage of the hierarchical clustering approach to processing of the DART-MS derived fingerprint is that it shows both similarities and differences between species based on their chemotypes. Furthermore, full knowledge of the identities of the constituents contained within the small molecule profile of analyzed samples is not required.« less

  3. High Throughput Ambient Mass Spectrometric Approach to Species Identification and Classification from Chemical Fingerprint Signatures

    DOE PAGES

    Musah, Rabi A.; Espinoza, Edgard O.; Cody, Robert B.; ...

    2015-07-09

    A high throughput method for species identification and classification through chemometric processing of direct analysis in real time (DART) mass spectrometry-derived fingerprint signatures has been developed. The method entails introduction of samples to the open air space between the DART ion source and the mass spectrometer inlet, with the entire observed mass spectral fingerprint subjected to unsupervised hierarchical clustering processing. Moreover, a range of both polar and non-polar chemotypes are instantaneously detected. The result is identification and species level classification based on the entire DART-MS spectrum. In this paper, we illustrate how the method can be used to: (1) distinguishmore » between endangered woods regulated by the Convention for the International Trade of Endangered Flora and Fauna (CITES) treaty; (2) assess the origin and by extension the properties of biodiesel feedstocks; (3) determine insect species from analysis of puparial casings; (4) distinguish between psychoactive plants products; and (5) differentiate between Eucalyptus species. An advantage of the hierarchical clustering approach to processing of the DART-MS derived fingerprint is that it shows both similarities and differences between species based on their chemotypes. Furthermore, full knowledge of the identities of the constituents contained within the small molecule profile of analyzed samples is not required.« less

  4. A protein and mRNA expression-based classification of gastric cancer.

    PubMed

    Setia, Namrata; Agoston, Agoston T; Han, Hye S; Mullen, John T; Duda, Dan G; Clark, Jeffrey W; Deshpande, Vikram; Mino-Kenudson, Mari; Srivastava, Amitabh; Lennerz, Jochen K; Hong, Theodore S; Kwak, Eunice L; Lauwers, Gregory Y

    2016-07-01

    The overall survival of gastric carcinoma patients remains poor despite improved control over known risk factors and surveillance. This highlights the need for new classifications, driven towards identification of potential therapeutic targets. Using sophisticated molecular technologies and analysis, three groups recently provided genetic and epigenetic molecular classifications of gastric cancer (The Cancer Genome Atlas, 'Singapore-Duke' study, and Asian Cancer Research Group). Suggested by these classifications, here, we examined the expression of 14 biomarkers in a cohort of 146 gastric adenocarcinomas and performed unsupervised hierarchical clustering analysis using less expensive and widely available immunohistochemistry and in situ hybridization. Ultimately, we identified five groups of gastric cancers based on Epstein-Barr virus (EBV) positivity, microsatellite instability, aberrant E-cadherin, and p53 expression; the remaining cases constituted a group characterized by normal p53 expression. In addition, the five categories correspond to the reported molecular subgroups by virtue of clinicopathologic features. Furthermore, evaluation between these clusters and survival using the Cox proportional hazards model showed a trend for superior survival in the EBV and microsatellite-instable related adenocarcinomas. In conclusion, we offer as a proposal a simplified algorithm that is able to reproduce the recently proposed molecular subgroups of gastric adenocarcinoma, using immunohistochemical and in situ hybridization techniques.

  5. Identifying seasonal mobility profiles from anonymized and aggregated mobile phone data. Application in food security.

    PubMed

    Zufiria, Pedro J; Pastor-Escuredo, David; Úbeda-Medina, Luis; Hernandez-Medina, Miguel A; Barriales-Valbuena, Iker; Morales, Alfredo J; Jacques, Damien C; Nkwambi, Wilfred; Diop, M Bamba; Quinn, John; Hidalgo-Sanchís, Paula; Luengo-Oroz, Miguel

    2018-01-01

    We propose a framework for the systematic analysis of mobile phone data to identify relevant mobility profiles in a population. The proposed framework allows finding distinct human mobility profiles based on the digital trace of mobile phone users characterized by a Matrix of Individual Trajectories (IT-Matrix). This matrix gathers a consistent and regularized description of individual trajectories that enables multi-scale representations along time and space, which can be used to extract aggregated indicators such as a dynamic multi-scale population count. Unsupervised clustering of individual trajectories generates mobility profiles (clusters of similar individual trajectories) which characterize relevant group behaviors preserving optimal aggregation levels for detailed and privacy-secured mobility characterization. The application of the proposed framework is illustrated by analyzing fully anonymized data on human mobility from mobile phones in Senegal at the arrondissement level over a calendar year. The analysis of monthly mobility patterns at the livelihood zone resolution resulted in the discovery and characterization of seasonal mobility profiles related with economic activities, agricultural calendars and rainfalls. The use of these mobility profiles could support the timely identification of mobility changes in vulnerable populations in response to external shocks (such as natural disasters, civil conflicts or sudden increases of food prices) to monitor food security.

  6. A High Throughput Ambient Mass Spectrometric Approach to Species Identification and Classification from Chemical Fingerprint Signatures

    PubMed Central

    Musah, Rabi A.; Espinoza, Edgard O.; Cody, Robert B.; Lesiak, Ashton D.; Christensen, Earl D.; Moore, Hannah E.; Maleknia, Simin; Drijfhout, Falko P.

    2015-01-01

    A high throughput method for species identification and classification through chemometric processing of direct analysis in real time (DART) mass spectrometry-derived fingerprint signatures has been developed. The method entails introduction of samples to the open air space between the DART ion source and the mass spectrometer inlet, with the entire observed mass spectral fingerprint subjected to unsupervised hierarchical clustering processing. A range of both polar and non-polar chemotypes are instantaneously detected. The result is identification and species level classification based on the entire DART-MS spectrum. Here, we illustrate how the method can be used to: (1) distinguish between endangered woods regulated by the Convention for the International Trade of Endangered Flora and Fauna (CITES) treaty; (2) assess the origin and by extension the properties of biodiesel feedstocks; (3) determine insect species from analysis of puparial casings; (4) distinguish between psychoactive plants products; and (5) differentiate between Eucalyptus species. An advantage of the hierarchical clustering approach to processing of the DART-MS derived fingerprint is that it shows both similarities and differences between species based on their chemotypes. Furthermore, full knowledge of the identities of the constituents contained within the small molecule profile of analyzed samples is not required. PMID:26156000

  7. A High Throughput Ambient Mass Spectrometric Approach to Species Identification and Classification from Chemical Fingerprint Signatures

    NASA Astrophysics Data System (ADS)

    Musah, Rabi A.; Espinoza, Edgard O.; Cody, Robert B.; Lesiak, Ashton D.; Christensen, Earl D.; Moore, Hannah E.; Maleknia, Simin; Drijfhout, Falko P.

    2015-07-01

    A high throughput method for species identification and classification through chemometric processing of direct analysis in real time (DART) mass spectrometry-derived fingerprint signatures has been developed. The method entails introduction of samples to the open air space between the DART ion source and the mass spectrometer inlet, with the entire observed mass spectral fingerprint subjected to unsupervised hierarchical clustering processing. A range of both polar and non-polar chemotypes are instantaneously detected. The result is identification and species level classification based on the entire DART-MS spectrum. Here, we illustrate how the method can be used to: (1) distinguish between endangered woods regulated by the Convention for the International Trade of Endangered Flora and Fauna (CITES) treaty; (2) assess the origin and by extension the properties of biodiesel feedstocks; (3) determine insect species from analysis of puparial casings; (4) distinguish between psychoactive plants products; and (5) differentiate between Eucalyptus species. An advantage of the hierarchical clustering approach to processing of the DART-MS derived fingerprint is that it shows both similarities and differences between species based on their chemotypes. Furthermore, full knowledge of the identities of the constituents contained within the small molecule profile of analyzed samples is not required.

  8. Towards a new classification of stable phase schizophrenia into major and simple neuro-cognitive psychosis: Results of unsupervised machine learning analysis.

    PubMed

    Kanchanatawan, Buranee; Sriswasdi, Sira; Thika, Supaksorn; Stoyanov, Drozdstoy; Sirivichayakul, Sunee; Carvalho, André F; Geffard, Michel; Maes, Michael

    2018-05-23

    Deficit schizophrenia, as defined by the Schedule for Deficit Syndrome, may represent a distinct diagnostic class defined by neurocognitive impairments coupled with changes in IgA/IgM responses to tryptophan catabolites (TRYCATs). Adequate classifications should be based on supervised and unsupervised learning rather than on consensus criteria. This study used machine learning as means to provide a more accurate classification of patients with stable phase schizophrenia. We found that using negative symptoms as discriminatory variables, schizophrenia patients may be divided into two distinct classes modelled by (A) impairments in IgA/IgM responses to noxious and generally more protective tryptophan catabolites, (B) impairments in episodic and semantic memory, paired associative learning and false memory creation, and (C) psychotic, excitation, hostility, mannerism, negative, and affective symptoms. The first cluster shows increased negative, psychotic, excitation, hostility, mannerism, depression and anxiety symptoms, and more neuroimmune and cognitive disorders and is therefore called "major neurocognitive psychosis" (MNP). The second cluster, called "simple neurocognitive psychosis" (SNP) is discriminated from normal controls by the same features although the impairments are less well developed than in MNP. The latter is additionally externally validated by lowered quality of life, body mass (reflecting a leptosome body type), and education (reflecting lower cognitive reserve). Previous distinctions including "type 1" (positive)/"type 2" (negative) and DSM-IV-TR (eg, paranoid) schizophrenia could not be validated using machine learning techniques. Previous names of the illness, including schizophrenia, are not very adequate because they do not describe the features of the illness, namely, interrelated neuroimmune, cognitive, and clinical features. Stable-phase schizophrenia consists of 2 relevant qualitatively distinct categories or nosological entities with SNP being a less well-developed phenotype, while MNP is the full blown phenotype or core illness. Major neurocognitive psychosis and SNP should be added to the DSM-5 and incorporated into the Research Domain Criteria project. © 2018 John Wiley & Sons, Ltd.

  9. EG-09EPIGENETIC PROFILING REVEALS A CpG HYPERMETHYLATION PHENOTYPE (CIMP) ASSOCIATED WITH WORSE PROGRESSION-FREE SURVIVAL IN MENINGIOMA

    PubMed Central

    Olar, Adriana; Wani, Khalida; Mansouri, Alireza; Zadeh, Gelareh; Wilson, Charmaine; DeMonte, Franco; Fuller, Gregory; Jones, David; Pfister, Stefan; von Deimling, Andreas; Sulman, Erik; Aldape, Kenneth

    2014-01-01

    BACKGROUND: Methylation profiling of solid tumors has revealed biologic subtypes, often with clinical implications. Methylation profiles of meningioma and their clinical implications are not well understood. METHODS: Ninety-two meningioma samples (n = 44 test set and n = 48 validation set) were profiled using the Illumina HumanMethylation450 BeadChip. Unsupervised clustering and analyses for recurrence-free survival (RFS) were performed. RESULTS: Unsupervised clustering of the test set using approximately 900 highly variable markers identified two clearly defined methylation subgroups. One of the groups (n = 19) showed global hypermethylation of a set of markers, analogous to CpG island methylator phenotype (CIMP). These findings were reproducible in the validation set, with 18/48 samples showing the CIMP-positive phenotype. Importantly, of 347 highly variable markers common to both the test and validation set analyses, 107 defined CIMP in the test set and 94 defined CIMP in the validation set, with an overlap of 83 markers between the two datasets. This number is much greater than expected by chance indicating reproducibly of the hypermethylated markers that define CIMP in meningioma. With respect to clinical correlation, the 37 CIMP-positive cases displayed significantly shorter RFS compared to the 55 non-CIMP cases (hazard ratio 2.9, p = 0.013). In an effort to develop a preliminary outcome predictor, a 155-marker subset correlated with RFS was identified in the test dataset. When interrogated in the validation dataset, this 155-marker subset showed a statistical trend (p < 0.1) towards distinguishing survival groups. CONCLUSIONS: This study defines the existence of a CIMP phenotype in meningioma, which involves a substantial proportion (37/92, 40%) of samples with clinical implications. Ongoing work will expand this cohort and examine identification of additional biologic differences (mutational and DNA copy number analysis) to further characterize the aberrant methylation subtype in meningioma. CIMP-positivity with aberrant methylation in recurrent/malignant meningioma suggests a potential therapeutic target for clinically aggressive cases.

  10. Embedded security system for multi-modal surveillance in a railway carriage

    NASA Astrophysics Data System (ADS)

    Zouaoui, Rhalem; Audigier, Romaric; Ambellouis, Sébastien; Capman, François; Benhadda, Hamid; Joudrier, Stéphanie; Sodoyer, David; Lamarque, Thierry

    2015-10-01

    Public transport security is one of the main priorities of the public authorities when fighting against crime and terrorism. In this context, there is a great demand for autonomous systems able to detect abnormal events such as violent acts aboard passenger cars and intrusions when the train is parked at the depot. To this end, we present an innovative approach which aims at providing efficient automatic event detection by fusing video and audio analytics and reducing the false alarm rate compared to classical stand-alone video detection. The multi-modal system is composed of two microphones and one camera and integrates onboard video and audio analytics and fusion capabilities. On the one hand, for detecting intrusion, the system relies on the fusion of "unusual" audio events detection with intrusion detections from video processing. The audio analysis consists in modeling the normal ambience and detecting deviation from the trained models during testing. This unsupervised approach is based on clustering of automatically extracted segments of acoustic features and statistical Gaussian Mixture Model (GMM) modeling of each cluster. The intrusion detection is based on the three-dimensional (3D) detection and tracking of individuals in the videos. On the other hand, for violent events detection, the system fuses unsupervised and supervised audio algorithms with video event detection. The supervised audio technique detects specific events such as shouts. A GMM is used to catch the formant structure of a shout signal. Video analytics use an original approach for detecting aggressive motion by focusing on erratic motion patterns specific to violent events. As data with violent events is not easily available, a normality model with structured motions from non-violent videos is learned for one-class classification. A fusion algorithm based on Dempster-Shafer's theory analyses the asynchronous detection outputs and computes the degree of belief of each probable event.

  11. Molecular Subgroup of Primary Prostate Cancer Presenting with Metastatic Biology.

    PubMed

    Walker, Steven M; Knight, Laura A; McCavigan, Andrena M; Logan, Gemma E; Berge, Viktor; Sherif, Amir; Pandha, Hardev; Warren, Anne Y; Davidson, Catherine; Uprichard, Adam; Blayney, Jaine K; Price, Bethanie; Jellema, Gera L; Steele, Christopher J; Svindland, Aud; McDade, Simon S; Eden, Christopher G; Foster, Chris; Mills, Ian G; Neal, David E; Mason, Malcolm D; Kay, Elaine W; Waugh, David J; Harkin, D Paul; Watson, R William; Clarke, Noel W; Kennedy, Richard D

    2017-10-01

    Approximately 4-25% of patients with early prostate cancer develop disease recurrence following radical prostatectomy. To identify a molecular subgroup of prostate cancers with metastatic potential at presentation resulting in a high risk of recurrence following radical prostatectomy. Unsupervised hierarchical clustering was performed using gene expression data from 70 primary resections, 31 metastatic lymph nodes, and 25 normal prostate samples. Independent assay validation was performed using 322 radical prostatectomy samples from four sites with a mean follow-up of 50.3 months. Molecular subgroups were identified using unsupervised hierarchical clustering. A partial least squares approach was used to generate a gene expression assay. Relationships with outcome (time to biochemical and metastatic recurrence) were analysed using multivariable Cox regression and log-rank analysis. A molecular subgroup of primary prostate cancer with biology similar to metastatic disease was identified. A 70-transcript signature (metastatic assay) was developed and independently validated in the radical prostatectomy samples. Metastatic assay positive patients had increased risk of biochemical recurrence (multivariable hazard ratio [HR] 1.62 [1.13-2.33]; p=0.0092) and metastatic recurrence (multivariable HR=3.20 [1.76-5.80]; p=0.0001). A combined model with Cancer of the Prostate Risk Assessment post surgical (CAPRA-S) identified patients at an increased risk of biochemical and metastatic recurrence superior to either model alone (HR=2.67 [1.90-3.75]; p<0.0001 and HR=7.53 [4.13-13.73]; p<0.0001, respectively). The retrospective nature of the study is acknowledged as a potential limitation. The metastatic assay may identify a molecular subgroup of primary prostate cancers with metastatic potential. The metastatic assay may improve the ability to detect patients at risk of metastatic recurrence following radical prostatectomy. The impact of adjuvant therapies should be assessed in this higher-risk population. Copyright © 2017 European Association of Urology. Published by Elsevier B.V. All rights reserved.

  12. A Pattern Recognition Approach to Acoustic Emission Data Originating from Fatigue of Wind Turbine Blades

    PubMed Central

    Tang, Jialin; Soua, Slim; Mares, Cristinel; Gan, Tat-Hean

    2017-01-01

    The identification of particular types of damage in wind turbine blades using acoustic emission (AE) techniques is a significant emerging field. In this work, a 45.7-m turbine blade was subjected to flap-wise fatigue loading for 21 days, during which AE was measured by internally mounted piezoelectric sensors. This paper focuses on using unsupervised pattern recognition methods to characterize different AE activities corresponding to different fracture mechanisms. A sequential feature selection method based on a k-means clustering algorithm is used to achieve a fine classification accuracy. The visualization of clusters in peak frequency−frequency centroid features is used to correlate the clustering results with failure modes. The positions of these clusters in time domain features, average frequency−MARSE, and average frequency−peak amplitude are also presented in this paper (where MARSE represents the Measured Area under Rectified Signal Envelope). The results show that these parameters are representative for the classification of the failure modes. PMID:29104245

  13. Clustering of tethered satellite system simulation data by an adaptive neuro-fuzzy algorithm

    NASA Technical Reports Server (NTRS)

    Mitra, Sunanda; Pemmaraju, Surya

    1992-01-01

    Recent developments in neuro-fuzzy systems indicate that the concepts of adaptive pattern recognition, when used to identify appropriate control actions corresponding to clusters of patterns representing system states in dynamic nonlinear control systems, may result in innovative designs. A modular, unsupervised neural network architecture, in which fuzzy learning rules have been embedded is used for on-line identification of similar states. The architecture and control rules involved in Adaptive Fuzzy Leader Clustering (AFLC) allow this system to be incorporated in control systems for identification of system states corresponding to specific control actions. We have used this algorithm to cluster the simulation data of Tethered Satellite System (TSS) to estimate the range of delta voltages necessary to maintain the desired length rate of the tether. The AFLC algorithm is capable of on-line estimation of the appropriate control voltages from the corresponding length error and length rate error without a priori knowledge of their membership functions and familarity with the behavior of the Tethered Satellite System.

  14. A Pattern Recognition Approach to Acoustic Emission Data Originating from Fatigue of Wind Turbine Blades.

    PubMed

    Tang, Jialin; Soua, Slim; Mares, Cristinel; Gan, Tat-Hean

    2017-11-01

    The identification of particular types of damage in wind turbine blades using acoustic emission (AE) techniques is a significant emerging field. In this work, a 45.7-m turbine blade was subjected to flap-wise fatigue loading for 21 days, during which AE was measured by internally mounted piezoelectric sensors. This paper focuses on using unsupervised pattern recognition methods to characterize different AE activities corresponding to different fracture mechanisms. A sequential feature selection method based on a k-means clustering algorithm is used to achieve a fine classification accuracy. The visualization of clusters in peak frequency-frequency centroid features is used to correlate the clustering results with failure modes. The positions of these clusters in time domain features, average frequency-MARSE, and average frequency-peak amplitude are also presented in this paper (where MARSE represents the Measured Area under Rectified Signal Envelope). The results show that these parameters are representative for the classification of the failure modes.

  15. Segmentation methodology for automated classification and differentiation of soft tissues in multiband images of high-resolution ultrasonic transmission tomography.

    PubMed

    Jeong, Jeong-Won; Shin, Dae C; Do, Synho; Marmarelis, Vasilis Z

    2006-08-01

    This paper presents a novel segmentation methodology for automated classification and differentiation of soft tissues using multiband data obtained with the newly developed system of high-resolution ultrasonic transmission tomography (HUTT) for imaging biological organs. This methodology extends and combines two existing approaches: the L-level set active contour (AC) segmentation approach and the agglomerative hierarchical kappa-means approach for unsupervised clustering (UC). To prevent the trapping of the current iterative minimization AC algorithm in a local minimum, we introduce a multiresolution approach that applies the level set functions at successively increasing resolutions of the image data. The resulting AC clusters are subsequently rearranged by the UC algorithm that seeks the optimal set of clusters yielding the minimum within-cluster distances in the feature space. The presented results from Monte Carlo simulations and experimental animal-tissue data demonstrate that the proposed methodology outperforms other existing methods without depending on heuristic parameters and provides a reliable means for soft tissue differentiation in HUTT images.

  16. Phenotyping asthma, rhinitis and eczema in MeDALL population-based birth cohorts: an allergic comorbidity cluster.

    PubMed

    Garcia-Aymerich, J; Benet, M; Saeys, Y; Pinart, M; Basagaña, X; Smit, H A; Siroux, V; Just, J; Momas, I; Rancière, F; Keil, T; Hohmann, C; Lau, S; Wahn, U; Heinrich, J; Tischer, C G; Fantini, M P; Lenzi, J; Porta, D; Koppelman, G H; Postma, D S; Berdel, D; Koletzko, S; Kerkhof, M; Gehring, U; Wickman, M; Melén, E; Hallberg, J; Bindslev-Jensen, C; Eller, E; Kull, I; Lødrup Carlsen, K C; Carlsen, K-H; Lambrecht, B N; Kogevinas, M; Sunyer, J; Kauffmann, F; Bousquet, J; Antó, J M

    2015-08-01

    Asthma, rhinitis and eczema often co-occur in children, but their interrelationships at the population level have been poorly addressed. We assessed co-occurrence of childhood asthma, rhinitis and eczema using unsupervised statistical techniques. We included 17 209 children at 4 years and 14 585 at 8 years from seven European population-based birth cohorts (MeDALL project). At each age period, children were grouped, using partitioning cluster analysis, according to the distribution of 23 variables covering symptoms 'ever' and 'in the last 12 months', doctor diagnosis, age of onset and treatments of asthma, rhinitis and eczema; immunoglobulin E sensitization; weight; and height. We tested the sensitivity of our estimates to subject and variable selections, and to different statistical approaches, including latent class analysis and self-organizing maps. Two groups were identified as the optimal way to cluster the data at both age periods and in all sensitivity analyses. The first (reference) group at 4 and 8 years (including 70% and 79% of children, respectively) was characterized by a low prevalence of symptoms and sensitization, whereas the second (symptomatic) group exhibited more frequent symptoms and sensitization. Ninety-nine percentage of children with comorbidities (co-occurrence of asthma, rhinitis and/or eczema) were included in the symptomatic group at both ages. The children's characteristics in both groups were consistent in all sensitivity analyses. At 4 and 8 years, at the population level, asthma, rhinitis and eczema can be classified together as an allergic comorbidity cluster. Future research including time-repeated assessments and biological data will help understanding the interrelationships between these diseases. © 2015 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  17. Unsupervised image matching based on manifold alignment.

    PubMed

    Pei, Yuru; Huang, Fengchun; Shi, Fuhao; Zha, Hongbin

    2012-08-01

    This paper challenges the issue of automatic matching between two image sets with similar intrinsic structures and different appearances, especially when there is no prior correspondence. An unsupervised manifold alignment framework is proposed to establish correspondence between data sets by a mapping function in the mutual embedding space. We introduce a local similarity metric based on parameterized distance curves to represent the connection of one point with the rest of the manifold. A small set of valid feature pairs can be found without manual interactions by matching the distance curve of one manifold with the curve cluster of the other manifold. To avoid potential confusions in image matching, we propose an extended affine transformation to solve the nonrigid alignment in the embedding space. The comparatively tight alignments and the structure preservation can be obtained simultaneously. The point pairs with the minimum distance after alignment are viewed as the matchings. We apply manifold alignment to image set matching problems. The correspondence between image sets of different poses, illuminations, and identities can be established effectively by our approach.

  18. Globally maximizing, locally minimizing: unsupervised discriminant projection with applications to face and palm biometrics.

    PubMed

    Yang, Jian; Zhang, David; Yang, Jing-Yu; Niu, Ben

    2007-04-01

    This paper develops an unsupervised discriminant projection (UDP) technique for dimensionality reduction of high-dimensional data in small sample size cases. UDP can be seen as a linear approximation of a multimanifolds-based learning framework which takes into account both the local and nonlocal quantities. UDP characterizes the local scatter as well as the nonlocal scatter, seeking to find a projection that simultaneously maximizes the nonlocal scatter and minimizes the local scatter. This characteristic makes UDP more intuitive and more powerful than the most up-to-date method, Locality Preserving Projection (LPP), which considers only the local scatter for clustering or classification tasks. The proposed method is applied to face and palm biometrics and is examined using the Yale, FERET, and AR face image databases and the PolyU palmprint database. The experimental results show that UDP consistently outperforms LPP and PCA and outperforms LDA when the training sample size per class is small. This demonstrates that UDP is a good choice for real-world biometrics applications.

  19. Identification of temporal variations in mental workload using locally-linear-embedding-based EEG feature reduction and support-vector-machine-based clustering and classification techniques.

    PubMed

    Yin, Zhong; Zhang, Jianhua

    2014-07-01

    Identifying the abnormal changes of mental workload (MWL) over time is quite crucial for preventing the accidents due to cognitive overload and inattention of human operators in safety-critical human-machine systems. It is known that various neuroimaging technologies can be used to identify the MWL variations. In order to classify MWL into a few discrete levels using representative MWL indicators and small-sized training samples, a novel EEG-based approach by combining locally linear embedding (LLE), support vector clustering (SVC) and support vector data description (SVDD) techniques is proposed and evaluated by using the experimentally measured data. The MWL indicators from different cortical regions are first elicited by using the LLE technique. Then, the SVC approach is used to find the clusters of these MWL indicators and thereby to detect MWL variations. It is shown that the clusters can be interpreted as the binary class MWL. Furthermore, a trained binary SVDD classifier is shown to be capable of detecting slight variations of those indicators. By combining the two schemes, a SVC-SVDD framework is proposed, where the clear-cut (smaller) cluster is detected by SVC first and then a subsequent SVDD model is utilized to divide the overlapped (larger) cluster into two classes. Finally, three-class MWL levels (low, normal and high) can be identified automatically. The experimental data analysis results are compared with those of several existing methods. It has been demonstrated that the proposed framework can lead to acceptable computational accuracy and has the advantages of both unsupervised and supervised training strategies. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  20. Multiplexed immunofluorescence delineates proteomic cancer cell states associated with metabolism

    PubMed Central

    Sood, Anup; Miller, Alexandra M.; Brogi, Edi; Sui, Yunxia; Armenia, Joshua; McDonough, Elizabeth; Santamaria-Pang, Alberto; Stamper, Aleksandra; Campos, Carl; Pang, Zhengyu; Li, Qing; Port, Elisa; Graeber, Thomas G.; Schultz, Nikolaus; Ginty, Fiona; Larson, Steven M.

    2016-01-01

    The phenotypic diversity of cancer results from genetic and nongenetic factors. Most studies of cancer heterogeneity have focused on DNA alterations, as technologies for proteomic measurements in clinical specimen are currently less advanced. Here, we used a multiplexed immunofluorescence staining platform to measure the expression of 27 proteins at the single-cell level in formalin-fixed and paraffin-embedded samples from treatment-naive stage II/III human breast cancer. Unsupervised clustering of protein expression data from 638,577 tumor cells in 26 breast cancers identified 8 clusters of protein coexpression. In about one-third of breast cancers, over 95% of all neoplastic cells expressed a single protein coexpression cluster. The remaining tumors harbored tumor cells representing multiple protein coexpression clusters, either in a regional distribution or intermingled throughout the tumor. Tumor uptake of the radiotracer 18F-fluorodeoxyglucose was associated with protein expression clusters characterized by hormone receptor loss, PTEN alteration, and HER2 gene amplification. Our study demonstrates an approach to generate cellular heterogeneity metrics in routinely collected solid tumor specimens and integrate them with in vivo cancer phenotypes. PMID:27182557

  1. Models in search of a brain.

    PubMed

    Love, Bradley C; Gureckis, Todd M

    2007-06-01

    Mental localization efforts tend to stress the where more than the what. We argue that the proper targets for localization are well-specified cognitive models. We make this case by relating an existing cognitive model of category learning to a learning circuit involving the hippocampus, perirhinal, and prefrontal cortices. Results from groups varying in function along this circuit (e.g., infants, amnesics, and older adults) are successfully simulated by reducing the model's ability to form new clusters in response to surprising events, such as an error in supervised learning or an unfamiliar stimulus in unsupervised learning. Clusters in the model are akin to conjunctive codes that are rooted in an episodic experience (the surprising event) yet can develop to resemble abstract codes as they are updated by subsequent experiences. Thus, the model holds that the line separating episodic and semantic information can become blurred. Dissociations (categorization vs. recognition) are explained in terms of cluster recruitment demands.

  2. An image analysis of TLC patterns for quality control of saffron based on soil salinity effect: A strategy for data (pre)-processing.

    PubMed

    Sereshti, Hassan; Poursorkh, Zahra; Aliakbarzadeh, Ghazaleh; Zarre, Shahin; Ataolahi, Sahar

    2018-01-15

    Quality of saffron, a valuable food additive, could considerably affect the consumers' health. In this work, a novel preprocessing strategy for image analysis of saffron thin layer chromatographic (TLC) patterns was introduced. This includes performing a series of image pre-processing techniques on TLC images such as compression, inversion, elimination of general baseline (using asymmetric least squares (AsLS)), removing spots shift and concavity (by correlation optimization warping (COW)), and finally conversion to RGB chromatograms. Subsequently, an unsupervised multivariate data analysis including principal component analysis (PCA) and k-means clustering was utilized to investigate the soil salinity effect, as a cultivation parameter, on saffron TLC patterns. This method was used as a rapid and simple technique to obtain the chemical fingerprints of saffron TLC images. Finally, the separated TLC spots were chemically identified using high-performance liquid chromatography-diode array detection (HPLC-DAD). Accordingly, the saffron quality from different areas of Iran was evaluated and classified. Copyright © 2017 Elsevier Ltd. All rights reserved.

  3. Robust demarcation of basal cell carcinoma by dependent component analysis-based segmentation of multi-spectral fluorescence images.

    PubMed

    Kopriva, Ivica; Persin, Antun; Puizina-Ivić, Neira; Mirić, Lina

    2010-07-02

    This study was designed to demonstrate robust performance of the novel dependent component analysis (DCA)-based approach to demarcation of the basal cell carcinoma (BCC) through unsupervised decomposition of the red-green-blue (RGB) fluorescent image of the BCC. Robustness to intensity fluctuation is due to the scale invariance property of DCA algorithms, which exploit spectral and spatial diversities between the BCC and the surrounding tissue. Used filtering-based DCA approach represents an extension of the independent component analysis (ICA) and is necessary in order to account for statistical dependence that is induced by spectral similarity between the BCC and surrounding tissue. This generates weak edges what represents a challenge for other segmentation methods as well. By comparative performance analysis with state-of-the-art image segmentation methods such as active contours (level set), K-means clustering, non-negative matrix factorization, ICA and ratio imaging we experimentally demonstrate good performance of DCA-based BCC demarcation in two demanding scenarios where intensity of the fluorescent image has been varied almost two orders of magnitude. Copyright 2010 Elsevier B.V. All rights reserved.

  4. Self-Organizing Maps for In Silico Screening and Data Visualization.

    PubMed

    Digles, Daniela; Ecker, Gerhard F

    2011-10-01

    Self-organizing maps, which are unsupervised artificial neural networks, have become a very useful tool in a wide area of disciplines, including medicinal chemistry. Here, we will focus on two applications of self-organizing maps: the use of self-organizing maps for in silico screening and for clustering and visualisation of large datasets. Additionally, the importance of parameter selection is discussed and some modifications to the original algorithm are summarised. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  5. Genome-wide analysis of the genetic regulation of gene expression in human neutrophils

    PubMed Central

    Andiappan, Anand Kumar; Melchiotti, Rossella; Poh, Tuang Yeow; Nah, Michelle; Puan, Kia Joo; Vigano, Elena; Haase, Doreen; Yusof, Nurhashikin; San Luis, Boris; Lum, Josephine; Kumar, Dilip; Foo, Shihui; Zhuang, Li; Vasudev, Anusha; Irwanto, Astrid; Lee, Bernett; Nardin, Alessandra; Liu, Hong; Zhang, Furen; Connolly, John; Liu, Jianjun; Mortellaro, Alessandra; Wang, De Yun; Poidinger, Michael; Larbi, Anis; Zolezzi, Francesca; Rotzschke, Olaf

    2015-01-01

    Neutrophils are an abundant immune cell type involved in both antimicrobial defence and autoimmunity. The regulation of their gene expression, however, is still largely unknown. Here we report an eQTL study on isolated neutrophils from 114 healthy individuals of Chinese ethnicity, identifying 21,210 eQTLs on 832 unique genes. Unsupervised clustering analysis of these eQTLs confirms their role in inflammatory responses and immunological diseases but also indicates strong involvement in dermatological pathologies. One of the strongest eQTL identified (rs2058660) is also the tagSNP of a linkage block reported to affect leprosy and Crohn's disease in opposite directions. In a functional study, we can link the C allele with low expression of the β-chain of IL18-receptor (IL18RAP). In neutrophils, this results in a reduced responsiveness to IL-18, detected both on the RNA and protein level. Thus, the polymorphic regulation of human neutrophils can impact beneficial as well as pathological inflammatory responses. PMID:26259071

  6. The relationship between unsupervised time after school and physical activity in adolescent girls.

    PubMed

    Rushovich, Berenice R; Voorhees, Carolyn C; Davis, C E; Neumark-Sztainer, Dianne; Pfeiffer, Karin A; Elder, John P; Going, Scott; Marino, Vivian G

    2006-07-31

    Rising obesity and declining physical activity levels are of great concern because of the associated health risks. Many children are left unsupervised after the school day ends, but little is known about the association between unsupervised time and physical activity levels. This paper seeks to determine whether adolescent girls who are without adult supervision after school are more or less active than their peers who have a caregiver at home. A random sample of girls from 36 middle schools at 6 field sites across the U.S. was selected during the fall of the 2002-2003 school year to participate in the baseline measurement activities of the Trial of Activity for Adolescent Girls (TAAG). Information was collected using six-day objectively measured physical activity, self-reported physical activity using a three-day recall, and socioeconomic and psychosocial measures. Complete information was available for 1422 out of a total of 1596 respondents.Categorical variables were analyzed using chi square and continuous variables were analyzed by t-tests. The four categories of time alone were compared using a mixed linear model controlling for clustering effects by study center. Girls who spent more time after school (> or = 2 hours per day, > or = 2 days per week) without adult supervision were more active than those with adult supervision (p = 0.01). Girls alone for > or = 2 hours after school, > or = 2 days a week, on average accrue 7.55 minutes more moderate to vigorous physical activity (MVPA) per day than do girls who are supervised (95% confidence interval ([C.I]). These results adjusted for ethnicity, parent's education, participation in the free/reduced lunch program, neighborhood resources, or available transportation. Unsupervised girls (n = 279) did less homework (53.1% vs. 63.3%), spent less time riding in a car or bus (48.0% vs. 56.6%), talked on the phone more (35.5% vs. 21.1%), and watched more television (59.9% vs. 52.6%) than supervised girls (n = 569). However, unsupervised girls also were more likely to be dancing (14.0% vs. 9.3%) and listening to music (20.8% vs. 12.0%) (p < .05). Girls in an unsupervised environment engaged in fewer structured activities and did not immediately do their homework, but they were more likely to be physically active than supervised girls. These results may have implications for parents, school, and community agencies as to how to structure activities in order to encourage teenage girls to be more physically active.

  7. The relationship between unsupervised time after school and physical activity in adolescent girls

    PubMed Central

    Rushovich, Berenice R; Voorhees, Carolyn C; Davis, CE; Neumark-Sztainer, Dianne; Pfeiffer, Karin A; Elder, John P; Going, Scott; Marino, Vivian G

    2006-01-01

    Background Rising obesity and declining physical activity levels are of great concern because of the associated health risks. Many children are left unsupervised after the school day ends, but little is known about the association between unsupervised time and physical activity levels. This paper seeks to determine whether adolescent girls who are without adult supervision after school are more or less active than their peers who have a caregiver at home. Methods A random sample of girls from 36 middle schools at 6 field sites across the U.S. was selected during the fall of the 2002–2003 school year to participate in the baseline measurement activities of the Trial of Activity for Adolescent Girls (TAAG). Information was collected using six-day objectively measured physical activity, self-reported physical activity using a three-day recall, and socioeconomic and psychosocial measures. Complete information was available for 1422 out of a total of 1596 respondents. Categorical variables were analyzed using chi square and continuous variables were analyzed by t-tests. The four categories of time alone were compared using a mixed linear model controlling for clustering effects by study center. Results Girls who spent more time after school (≥2 hours per day, ≥2 days per week) without adult supervision were more active than those with adult supervision (p = 0.01). Girls alone for ≥2 hours after school, ≥2 days a week, on average accrue 7.55 minutes more moderate to vigorous physical activity (MVPA) per day than do girls who are supervised (95% confidence interval ([C.I]). These results adjusted for ethnicity, parent's education, participation in the free/reduced lunch program, neighborhood resources, or available transportation. Unsupervised girls (n = 279) did less homework (53.1% vs. 63.3%), spent less time riding in a car or bus (48.0% vs. 56.6%), talked on the phone more (35.5% vs. 21.1%), and watched more television (59.9% vs. 52.6%) than supervised girls (n = 569). However, unsupervised girls also were more likely to be dancing (14.0% vs. 9.3%) and listening to music (20.8% vs. 12.0%) (p < .05). Conclusion Girls in an unsupervised environment engaged in fewer structured activities and did not immediately do their homework, but they were more likely to be physically active than supervised girls. These results may have implications for parents, school, and community agencies as to how to structure activities in order to encourage teenage girls to be more physically active. PMID:16879750

  8. Integrative Data Analysis of Multi-Platform Cancer Data with a Multimodal Deep Learning Approach.

    PubMed

    Liang, Muxuan; Li, Zhizhong; Chen, Ting; Zeng, Jianyang

    2015-01-01

    Identification of cancer subtypes plays an important role in revealing useful insights into disease pathogenesis and advancing personalized therapy. The recent development of high-throughput sequencing technologies has enabled the rapid collection of multi-platform genomic data (e.g., gene expression, miRNA expression, and DNA methylation) for the same set of tumor samples. Although numerous integrative clustering approaches have been developed to analyze cancer data, few of them are particularly designed to exploit both deep intrinsic statistical properties of each input modality and complex cross-modality correlations among multi-platform input data. In this paper, we propose a new machine learning model, called multimodal deep belief network (DBN), to cluster cancer patients from multi-platform observation data. In our integrative clustering framework, relationships among inherent features of each single modality are first encoded into multiple layers of hidden variables, and then a joint latent model is employed to fuse common features derived from multiple input modalities. A practical learning algorithm, called contrastive divergence (CD), is applied to infer the parameters of our multimodal DBN model in an unsupervised manner. Tests on two available cancer datasets show that our integrative data analysis approach can effectively extract a unified representation of latent features to capture both intra- and cross-modality correlations, and identify meaningful disease subtypes from multi-platform cancer data. In addition, our approach can identify key genes and miRNAs that may play distinct roles in the pathogenesis of different cancer subtypes. Among those key miRNAs, we found that the expression level of miR-29a is highly correlated with survival time in ovarian cancer patients. These results indicate that our multimodal DBN based data analysis approach may have practical applications in cancer pathogenesis studies and provide useful guidelines for personalized cancer therapy.

  9. Inflammatory Mediator Profiles Differ in Sepsis Patients With and Without Bacteremia.

    PubMed

    Mosevoll, Knut Anders; Skrede, Steinar; Markussen, Dagfinn Lunde; Fanebust, Hans Rune; Flaatten, Hans Kristian; Aßmus, Jörg; Reikvam, Håkon; Bruserud, Øystein

    2018-01-01

    Systemic levels of cytokines are altered during infection and sepsis. This prospective observational study aimed to investigate whether plasma levels of multiple inflammatory mediators differed between sepsis patients with and those without bacteremia during the initial phase of hospitalization. A total of 80 sepsis patients with proven bacterial infection and no immunosuppression were included in the study. Plasma samples were collected within 24 h of hospitalization, and Luminex ® analysis was performed on 35 mediators: 16 cytokines, six growth factors, four adhesion molecules, and nine matrix metalloproteases (MMPs)/tissue inhibitors of metalloproteinases (TIMPs). Forty-two patients (52.5%) and 38 (47.5%) patients showed positive and negative blood cultures, respectively. There were significant differences in plasma levels of six soluble mediators between the two "bacteremia" and "non-bacteremia" groups, using Mann-Whitney U test ( p  < 0.0014): tumor necrosis factor alpha (TNFα), CCL4, E-selectin, vascular cell adhesion molecule-1 (VCAM-1), intracellular adhesion molecule-1 (ICAM-1), and TIMP-1. Ten soluble mediators also significantly differed in plasma levels between the two groups, with p -values ranging between 0.05 and 0.0014: interleukin (IL)-1ra, IL-10, CCL2, CCL5, CXCL8, CXCL11, hepatocyte growth factor, MMP-8, TIMP-2, and TIMP-4. VCAM-1 showed the most robust results using univariate and multivariate logistic regression. Using unsupervised hierarchical clustering, we found that TNFα, CCL4, E-selectin, VCAM-1, ICAM-1, and TIMP-1 could be used to discriminate between patients with and those without bacteremia. Patients with bacteremia were mainly clustered in two separate groups (two upper clusters, 41/42, 98%), with higher levels of the mediators. One (2%) patient with bacteremia was clustered in the lower cluster, which compromised most of the patients without bacteremia (23/38, 61%) (χ 2 test, p  < 0.0001). Our study showed that analysis of the plasma inflammatory mediator profile could represent a potential strategy for early identification of patients with bacteremia.

  10. Joint Clustering and Component Analysis of Correspondenceless Point Sets: Application to Cardiac Statistical Modeling.

    PubMed

    Gooya, Ali; Lekadir, Karim; Alba, Xenia; Swift, Andrew J; Wild, Jim M; Frangi, Alejandro F

    2015-01-01

    Construction of Statistical Shape Models (SSMs) from arbitrary point sets is a challenging problem due to significant shape variation and lack of explicit point correspondence across the training data set. In medical imaging, point sets can generally represent different shape classes that span healthy and pathological exemplars. In such cases, the constructed SSM may not generalize well, largely because the probability density function (pdf) of the point sets deviates from the underlying assumption of Gaussian statistics. To this end, we propose a generative model for unsupervised learning of the pdf of point sets as a mixture of distinctive classes. A Variational Bayesian (VB) method is proposed for making joint inferences on the labels of point sets, and the principal modes of variations in each cluster. The method provides a flexible framework to handle point sets with no explicit point-to-point correspondences. We also show that by maximizing the marginalized likelihood of the model, the optimal number of clusters of point sets can be determined. We illustrate this work in the context of understanding the anatomical phenotype of the left and right ventricles in heart. To this end, we use a database containing hearts of healthy subjects, patients with Pulmonary Hypertension (PH), and patients with Hypertrophic Cardiomyopathy (HCM). We demonstrate that our method can outperform traditional PCA in both generalization and specificity measures.

  11. A comparative evaluation of supervised and unsupervised representation learning approaches for anaplastic medulloblastoma differentiation

    NASA Astrophysics Data System (ADS)

    Cruz-Roa, Angel; Arevalo, John; Basavanhally, Ajay; Madabhushi, Anant; González, Fabio

    2015-01-01

    Learning data representations directly from the data itself is an approach that has shown great success in different pattern recognition problems, outperforming state-of-the-art feature extraction schemes for different tasks in computer vision, speech recognition and natural language processing. Representation learning applies unsupervised and supervised machine learning methods to large amounts of data to find building-blocks that better represent the information in it. Digitized histopathology images represents a very good testbed for representation learning since it involves large amounts of high complex, visual data. This paper presents a comparative evaluation of different supervised and unsupervised representation learning architectures to specifically address open questions on what type of learning architectures (deep or shallow), type of learning (unsupervised or supervised) is optimal. In this paper we limit ourselves to addressing these questions in the context of distinguishing between anaplastic and non-anaplastic medulloblastomas from routine haematoxylin and eosin stained images. The unsupervised approaches evaluated were sparse autoencoders and topographic reconstruct independent component analysis, and the supervised approach was convolutional neural networks. Experimental results show that shallow architectures with more neurons are better than deeper architectures without taking into account local space invariances and that topographic constraints provide useful invariant features in scale and rotations for efficient tumor differentiation.

  12. Integrated genetic and epigenetic analysis identifies three different subclasses of colon cancer

    PubMed Central

    Shen, Lanlan; Toyota, Minoru; Kondo, Yutaka; Lin, E; Zhang, Li; Guo, Yi; Hernandez, Natalie Supunpong; Chen, Xinli; Ahmed, Saira; Konishi, Kazuo; Hamilton, Stanley R.; Issa, Jean-Pierre J.

    2007-01-01

    Colon cancer has been viewed as the result of progressive accumulation of genetic and epigenetic abnormalities. However, this view does not fully reflect the molecular heterogeneity of the disease. We have analyzed both genetic (mutations of BRAF, KRAS, and p53 and microsatellite instability) and epigenetic alterations (DNA methylation of 27 CpG island promoter regions) in 97 primary colorectal cancer patients. Two clustering analyses on the basis of either epigenetic profiling or a combination of genetic and epigenetic profiling were performed to identify subclasses with distinct molecular signatures. Unsupervised hierarchical clustering of the DNA methylation data identified three distinct groups of colon cancers named CpG island methylator phenotype (CIMP) 1, CIMP2, and CIMP negative. Genetically, these three groups correspond to very distinct profiles. CIMP1 are characterized by MSI (80%) and BRAF mutations (53%) and rare KRAS and p53 mutations (16% and 11%, respectively). CIMP2 is associated with 92% KRAS mutations and rare MSI, BRAF, or p53 mutations (0, 4, and 31% respectively). CIMP-negative cases have a high rate of p53 mutations (71%) and lower rates of MSI (12%) or mutations of BRAF (2%) or KRAS (33%). Clustering based on both genetic and epigenetic parameters also identifies three distinct (and homogeneous) groups that largely overlap with the previous classification. The three groups are independent of age, gender, or stage, but CIMP1 and 2 are more common in proximal tumors. Together, our integrated genetic and epigenetic analysis reveals that colon cancers correspond to three molecularly distinct subclasses of disease. PMID:18003927

  13. Vineyard zonal management for grape quality assessment by combining airborne remote sensed imagery and soil sensors

    NASA Astrophysics Data System (ADS)

    Bonilla, I.; Martínez De Toda, F.; Martínez-Casasnovas, J. A.

    2014-10-01

    Vineyard variability within the fields is well known by grape growers, producing different plant responses and fruit characteristics. Many technologies have been developed in last recent decades in order to assess this spatial variability, including remote sensing and soil sensors. In this paper we study the possibility of creating a stable classification system that better provides useful information for the grower, especially in terms of grape batch quality sorting. The work was carried out during 4 years in a rain-fed Tempranillo vineyard located in Rioja (Spain). NDVI was extracted from airborne imagery, and soil conductivity (EC) data was acquired by an EM38 sensor. Fifty-four vines were sampled at véraison for vegetative parameters and before harvest for yield and grape analysis. An Isocluster unsupervised classification in two classes was performed in 5 different ways, combining NDVI maps individually, collectively and combined with EC. The target vines were assigned in different zones depending on the clustering combination. Analysis of variance was performed in order to verify the ability of the combinations to provide the most accurate information. All combinations showed a similar behaviour concerning vegetative parameters. Yield parameters classify better by the EC-based clustering, whilst maturity grape parameters seemed to give more accuracy by combining all NDVIs and EC. Quality grape parameters (anthocyanins and phenolics), presented similar results for all combinations except for the NDVI map of the individual year, where the results were poorer. This results reveal that stable parameters (EC or/and NDVI all-together) clustering outcomes in better information for a vineyard zonal management strategy.

  14. A novel framework for feature extraction in multi-sensor action potential sorting.

    PubMed

    Wu, Shun-Chi; Swindlehurst, A Lee; Nenadic, Zoran

    2015-09-30

    Extracellular recordings of multi-unit neural activity have become indispensable in neuroscience research. The analysis of the recordings begins with the detection of the action potentials (APs), followed by a classification step where each AP is associated with a given neural source. A feature extraction step is required prior to classification in order to reduce the dimensionality of the data and the impact of noise, allowing source clustering algorithms to work more efficiently. In this paper, we propose a novel framework for multi-sensor AP feature extraction based on the so-called Matched Subspace Detector (MSD), which is shown to be a natural generalization of standard single-sensor algorithms. Clustering using both simulated data and real AP recordings taken in the locust antennal lobe demonstrates that the proposed approach yields features that are discriminatory and lead to promising results. Unlike existing methods, the proposed algorithm finds joint spatio-temporal feature vectors that match the dominant subspace observed in the two-dimensional data without needs for a forward propagation model and AP templates. The proposed MSD approach provides more discriminatory features for unsupervised AP sorting applications. Copyright © 2015 Elsevier B.V. All rights reserved.

  15. Low-cost multispectral imaging for remote sensing of lettuce health

    NASA Astrophysics Data System (ADS)

    Ren, David D. W.; Tripathi, Siddhant; Li, Larry K. B.

    2017-01-01

    In agricultural remote sensing, unmanned aerial vehicle (UAV) platforms offer many advantages over conventional satellite and full-scale airborne platforms. One of the most important advantages is their ability to capture high spatial resolution images (1-10 cm) on-demand and at different viewing angles. However, UAV platforms typically rely on the use of multiple cameras, which can be costly and difficult to operate. We present the development of a simple low-cost imaging system for remote sensing of crop health and demonstrate it on lettuce (Lactuca sativa) grown in Hong Kong. To identify the optimal vegetation index, we recorded images of both healthy and unhealthy lettuce, and used them as input in an expectation maximization cluster analysis with a Gaussian mixture model. Results from unsupervised and supervised clustering show that, among four widely used vegetation indices, the blue wide-dynamic range vegetation index is the most accurate. This study shows that it is readily possible to design and build a remote sensing system capable of determining the health status of lettuce at a reasonably low cost (

  16. Hierarchical clustering of EMD based interest points for road sign detection

    NASA Astrophysics Data System (ADS)

    Khan, Jesmin; Bhuiyan, Sharif; Adhami, Reza

    2014-04-01

    This paper presents an automatic road traffic signs detection and recognition system based on hierarchical clustering of interest points and joint transform correlation. The proposed algorithm consists of the three following stages: interest points detection, clustering of those points and similarity search. At the first stage, good discriminative, rotation and scale invariant interest points are selected from the image edges based on the 1-D empirical mode decomposition (EMD). We propose a two-step unsupervised clustering technique, which is adaptive and based on two criterion. In this context, the detected points are initially clustered based on the stable local features related to the brightness and color, which are extracted using Gabor filter. Then points belonging to each partition are reclustered depending on the dispersion of the points in the initial cluster using position feature. This two-step hierarchical clustering yields the possible candidate road signs or the region of interests (ROIs). Finally, a fringe-adjusted joint transform correlation (JTC) technique is used for matching the unknown signs with the existing known reference road signs stored in the database. The presented framework provides a novel way to detect a road sign from the natural scenes and the results demonstrate the efficacy of the proposed technique, which yields a very low false hit rate.

  17. Classifying seismic noise and sources from OBS data using unsupervised machine learning

    NASA Astrophysics Data System (ADS)

    Mosher, S. G.; Audet, P.

    2017-12-01

    The paradigm of plate tectonics was established mainly by recognizing the central role of oceanic plates in the production and destruction of tectonic plates at their boundaries. Since that realization, however, seismic studies of tectonic plates and their associated deformation have slowly shifted their attention toward continental plates due to the ease of installation and maintenance of high-quality seismic networks on land. The result has been a much more detailed understanding of the seismicity patterns associated with continental plate deformation in comparison with the low-magnitude deformation patterns within oceanic plates and at their boundaries. While the number of high-quality ocean-bottom seismometer (OBS) deployments within the past decade has demonstrated the potential to significantly increase our understanding of tectonic systems in oceanic settings, OBS data poses significant challenges to many of the traditional data processing techniques in seismology. In particular, problems involving the detection, location, and classification of seismic sources occurring within oceanic settings are much more difficult due to the extremely noisy seafloor environment in which data are recorded. However, classifying data without a priori constraints is a problem that is routinely pursued via unsupervised machine learning algorithms, which remain robust even in cases involving complicated datasets. In this research, we apply simple unsupervised machine learning algorithms (e.g., clustering) to OBS data from the Cascadia Initiative in an attempt to classify and detect a broad range of seismic sources, including various noise sources and tremor signals occurring within ocean settings.

  18. Towards a robust framework for catchment classification

    NASA Astrophysics Data System (ADS)

    Deshmukh, A.; Samal, A.; Singh, R.

    2017-12-01

    Classification of catchments based on various measures of similarity has emerged as an important technique to understand regional scale hydrologic behavior. Classification of catchment characteristics and/or streamflow response has been used reveal which characteristics are more likely to explain the observed variability of hydrologic response. However, numerous algorithms for supervised or unsupervised classification are available, making it hard to identify the algorithm most suitable for the dataset at hand. Consequently, existing catchment classification studies vary significantly in the classification algorithms employed with no previous attempt at understanding the degree of uncertainty in classification due to this algorithmic choice. This hinders the generalizability of interpretations related to hydrologic behavior. Our goal is to develop a protocol that can be followed while classifying hydrologic datasets. We focus on a classification framework for unsupervised classification and provide a step-by-step classification procedure. The steps include testing the clusterabiltiy of original dataset prior to classification, feature selection, validation of clustered data, and quantification of similarity of two clusterings. We test several commonly available methods within this framework to understand the level of similarity of classification results across algorithms. We apply the proposed framework on recently developed datasets for India to analyze to what extent catchment properties can explain observed catchment response. Our testing dataset includes watershed characteristics for over 200 watersheds which comprise of both natural (physio-climatic) characteristics and socio-economic characteristics. This framework allows us to understand the controls on observed hydrologic variability across India.

  19. LCC: Light Curves Classifier

    NASA Astrophysics Data System (ADS)

    Vo, Martin

    2017-08-01

    Light Curves Classifier uses data mining and machine learning to obtain and classify desired objects. This task can be accomplished by attributes of light curves or any time series, including shapes, histograms, or variograms, or by other available information about the inspected objects, such as color indices, temperatures, and abundances. After specifying features which describe the objects to be searched, the software trains on a given training sample, and can then be used for unsupervised clustering for visualizing the natural separation of the sample. The package can be also used for automatic tuning parameters of used methods (for example, number of hidden neurons or binning ratio). Trained classifiers can be used for filtering outputs from astronomical databases or data stored locally. The Light Curve Classifier can also be used for simple downloading of light curves and all available information of queried stars. It natively can connect to OgleII, OgleIII, ASAS, CoRoT, Kepler, Catalina and MACHO, and new connectors or descriptors can be implemented. In addition to direct usage of the package and command line UI, the program can be used through a web interface. Users can create jobs for ”training” methods on given objects, querying databases and filtering outputs by trained filters. Preimplemented descriptors, classifier and connectors can be picked by simple clicks and their parameters can be tuned by giving ranges of these values. All combinations are then calculated and the best one is used for creating the filter. Natural separation of the data can be visualized by unsupervised clustering.

  20. Subgraph augmented non-negative tensor factorization (SANTF) for modeling clinical narrative text

    PubMed Central

    Xin, Yu; Hochberg, Ephraim; Joshi, Rohit; Uzuner, Ozlem; Szolovits, Peter

    2015-01-01

    Objective Extracting medical knowledge from electronic medical records requires automated approaches to combat scalability limitations and selection biases. However, existing machine learning approaches are often regarded by clinicians as black boxes. Moreover, training data for these automated approaches at often sparsely annotated at best. The authors target unsupervised learning for modeling clinical narrative text, aiming at improving both accuracy and interpretability. Methods The authors introduce a novel framework named subgraph augmented non-negative tensor factorization (SANTF). In addition to relying on atomic features (e.g., words in clinical narrative text), SANTF automatically mines higher-order features (e.g., relations of lymphoid cells expressing antigens) from clinical narrative text by converting sentences into a graph representation and identifying important subgraphs. The authors compose a tensor using patients, higher-order features, and atomic features as its respective modes. We then apply non-negative tensor factorization to cluster patients, and simultaneously identify latent groups of higher-order features that link to patient clusters, as in clinical guidelines where a panel of immunophenotypic features and laboratory results are used to specify diagnostic criteria. Results and Conclusion SANTF demonstrated over 10% improvement in averaged F-measure on patient clustering compared to widely used non-negative matrix factorization (NMF) and k-means clustering methods. Multiple baselines were established by modeling patient data using patient-by-features matrices with different feature configurations and then performing NMF or k-means to cluster patients. Feature analysis identified latent groups of higher-order features that lead to medical insights. We also found that the latent groups of atomic features help to better correlate the latent groups of higher-order features. PMID:25862765

  1. Identification of biomarkers of response to abatacept in patients with SLE using deconvolution of whole blood transcriptomic data from a phase IIb clinical trial.

    PubMed

    Bandyopadhyay, Somnath; Connolly, Sean E; Jabado, Omar; Ye, June; Kelly, Sheila; Maldonado, Michael A; Westhovens, Rene; Nash, Peter; Merrill, Joan T; Townsend, Robert M

    2017-01-01

    To characterise patients with active SLE based on pretreatment gene expression-defined peripheral immune cell patterns and identify clusters enriched for potential responders to abatacept treatment. This post hoc analysis used baseline peripheral whole blood transcriptomic data from patients in a phase IIb trial of intravenous abatacept (~10 mg/kg/month). Cell-specific genes were used with a published deconvolution algorithm to identify immune cell proportions in patient samples, and unsupervised consensus clustering was generated. Efficacy data were re-analysed. Patient data (n=144: abatacept: n=98; placebo: n=46) were grouped into four main clusters (C) by predominant characteristic cells: C1-neutrophils; C2-cytotoxic T cells, B-cell receptor-ligated B cells, monocytes, IgG memory B cells, activated T helper cells; C3-plasma cells, activated dendritic cells, activated natural killer cells, neutrophils; C4-activated dendritic cells, cytotoxic T cells. C3 had the highest baseline total British Isles Lupus Assessment Group (BILAG) scores, highest antidouble-stranded DNA autoantibody levels and shortest time to flare (TTF), plus trends in favour of response to abatacept over placebo: adjusted mean difference in BILAG score over 1 year, -4.78 (95% CI -12.49 to 2.92); median TTF, 56 vs 6 days; greater normalisation of complement component 3 and 4 levels. Differential improvements with abatacept were not seen in other clusters, except for median TTF in C1 (201 vs 109 days). Immune cell clustering segmented disease severity and responsiveness to abatacept. Definition of immune response cell types may inform design and interpretation of SLE trials and treatment decisions. NCT00119678; results.

  2. Unsupervised Feature Selection Based on the Morisita Index for Hyperspectral Images

    NASA Astrophysics Data System (ADS)

    Golay, Jean; Kanevski, Mikhail

    2017-04-01

    Hyperspectral sensors are capable of acquiring images with hundreds of narrow and contiguous spectral bands. Compared with traditional multispectral imagery, the use of hyperspectral images allows better performance in discriminating between land-cover classes, but it also results in large redundancy and high computational data processing. To alleviate such issues, unsupervised feature selection techniques for redundancy minimization can be implemented. Their goal is to select the smallest subset of features (or bands) in such a way that all the information content of a data set is preserved as much as possible. The present research deals with the application to hyperspectral images of a recently introduced technique of unsupervised feature selection: the Morisita-Based filter for Redundancy Minimization (MBRM). MBRM is based on the (multipoint) Morisita index of clustering and on the Morisita estimator of Intrinsic Dimension (ID). The fundamental idea of the technique is to retain only the bands which contribute to increasing the ID of an image. In this way, redundant bands are disregarded, since they have no impact on the ID. Besides, MBRM has several advantages over benchmark techniques: in addition to its ability to deal with large data sets, it can capture highly-nonlinear dependences and its implementation is straightforward in any programming environment. Experimental results on freely available hyperspectral images show the good effectiveness of MBRM in remote sensing data processing. Comparisons with benchmark techniques are carried out and random forests are used to assess the performance of MBRM in reducing the data dimensionality without loss of relevant information. References [1] C. Traina Jr., A.J.M. Traina, L. Wu, C. Faloutsos, Fast feature selection using fractal dimension, in: Proceedings of the XV Brazilian Symposium on Databases, SBBD, pp. 158-171, 2000. [2] J. Golay, M. Kanevski, A new estimator of intrinsic dimension based on the multipoint Morisita index, Pattern Recognition 48(12), pp. 4070-4081, 2015. [3] J. Golay, M. Kanevski, Unsupervised feature selection based on the Morisita estimator of intrinsic dimension, arXiv:1608.05581, 2016.

  3. Multi-scale Modeling of Radiation Damage: Large Scale Data Analysis

    NASA Astrophysics Data System (ADS)

    Warrier, M.; Bhardwaj, U.; Bukkuru, S.

    2016-10-01

    Modification of materials in nuclear reactors due to neutron irradiation is a multiscale problem. These neutrons pass through materials creating several energetic primary knock-on atoms (PKA) which cause localized collision cascades creating damage tracks, defects (interstitials and vacancies) and defect clusters depending on the energy of the PKA. These defects diffuse and recombine throughout the whole duration of operation of the reactor, thereby changing the micro-structure of the material and its properties. It is therefore desirable to develop predictive computational tools to simulate the micro-structural changes of irradiated materials. In this paper we describe how statistical averages of the collision cascades from thousands of MD simulations are used to provide inputs to Kinetic Monte Carlo (KMC) simulations which can handle larger sizes, more defects and longer time durations. Use of unsupervised learning and graph optimization in handling and analyzing large scale MD data will be highlighted.

  4. Entanglement-Based Machine Learning on a Quantum Computer

    NASA Astrophysics Data System (ADS)

    Cai, X.-D.; Wu, D.; Su, Z.-E.; Chen, M.-C.; Wang, X.-L.; Li, Li; Liu, N.-L.; Lu, C.-Y.; Pan, J.-W.

    2015-03-01

    Machine learning, a branch of artificial intelligence, learns from previous experience to optimize performance, which is ubiquitous in various fields such as computer sciences, financial analysis, robotics, and bioinformatics. A challenge is that machine learning with the rapidly growing "big data" could become intractable for classical computers. Recently, quantum machine learning algorithms [Lloyd, Mohseni, and Rebentrost, arXiv.1307.0411] were proposed which could offer an exponential speedup over classical algorithms. Here, we report the first experimental entanglement-based classification of two-, four-, and eight-dimensional vectors to different clusters using a small-scale photonic quantum computer, which are then used to implement supervised and unsupervised machine learning. The results demonstrate the working principle of using quantum computers to manipulate and classify high-dimensional vectors, the core mathematical routine in machine learning. The method can, in principle, be scaled to larger numbers of qubits, and may provide a new route to accelerate machine learning.

  5. DEFINITION OF MULTIVARIATE GEOCHEMICAL ASSOCIATIONS WITH POLYMETALLIC MINERAL OCCURRENCES USING A SPATIALLY DEPENDENT CLUSTERING TECHNIQUE AND RASTERIZED STREAM SEDIMENT DATA - AN ALASKAN EXAMPLE.

    USGS Publications Warehouse

    Jenson, Susan K.; Trautwein, C.M.

    1984-01-01

    The application of an unsupervised, spatially dependent clustering technique (AMOEBA) to interpolated raster arrays of stream sediment data has been found to provide useful multivariate geochemical associations for modeling regional polymetallic resource potential. The technique is based on three assumptions regarding the compositional and spatial relationships of stream sediment data and their regional significance. These assumptions are: (1) compositionally separable classes exist and can be statistically distinguished; (2) the classification of multivariate data should minimize the pair probability of misclustering to establish useful compositional associations; and (3) a compositionally defined class represented by three or more contiguous cells within an array is a more important descriptor of a terrane than a class represented by spatial outliers.

  6. Effects of Supervised vs. Unsupervised Training Programs on Balance and Muscle Strength in Older Adults: A Systematic Review and Meta-Analysis.

    PubMed

    Lacroix, André; Hortobágyi, Tibor; Beurskens, Rainer; Granacher, Urs

    2017-11-01

    Balance and resistance training can improve healthy older adults' balance and muscle strength. Delivering such exercise programs at home without supervision may facilitate participation for older adults because they do not have to leave their homes. To date, no systematic literature analysis has been conducted to determine if supervision affects the effectiveness of these programs to improve healthy older adults' balance and muscle strength/power. The objective of this systematic review and meta-analysis was to quantify the effectiveness of supervised vs. unsupervised balance and/or resistance training programs on measures of balance and muscle strength/power in healthy older adults. In addition, the impact of supervision on training-induced adaptive processes was evaluated in the form of dose-response relationships by analyzing randomized controlled trials that compared supervised with unsupervised trials. A computerized systematic literature search was performed in the electronic databases PubMed, Web of Science, and SportDiscus to detect articles examining the role of supervision in balance and/or resistance training in older adults. The initially identified 6041 articles were systematically screened. Studies were included if they examined balance and/or resistance training in adults aged ≥65 years with no relevant diseases and registered at least one behavioral balance (e.g., time during single leg stance) and/or muscle strength/power outcome (e.g., time for 5-Times-Chair-Rise-Test). Finally, 11 studies were eligible for inclusion in this meta-analysis. Weighted mean standardized mean differences between subjects (SMD bs ) of supervised vs. unsupervised balance/resistance training studies were calculated. The included studies were coded for the following variables: number of participants, sex, age, number and type of interventions, type of balance/strength tests, and change (%) from pre- to post-intervention values. Additionally, we coded training according to the following modalities: period, frequency, volume, modalities of supervision (i.e., number of supervised/unsupervised sessions within the supervised or unsupervised training groups, respectively). Heterogeneity was computed using I 2 and χ 2 statistics. The methodological quality of the included studies was evaluated using the Physiotherapy Evidence Database scale. Our analyses revealed that in older adults, supervised balance/resistance training was superior compared with unsupervised balance/resistance training in improving measures of static steady-state balance (mean SMD bs  = 0.28, p = 0.39), dynamic steady-state balance (mean SMD bs  = 0.35, p = 0.02), proactive balance (mean SMD bs  = 0.24, p = 0.05), balance test batteries (mean SMD bs  = 0.53, p = 0.02), and measures of muscle strength/power (mean SMD bs  = 0.51, p = 0.04). Regarding the examined dose-response relationships, our analyses showed that a number of 10-29 additional supervised sessions in the supervised training groups compared with the unsupervised training groups resulted in the largest effects for static steady-state balance (mean SMD bs  = 0.35), dynamic steady-state balance (mean SMD bs  = 0.37), and muscle strength/power (mean SMD bs  = 1.12). Further, ≥30 additional supervised sessions in the supervised training groups were needed to produce the largest effects on proactive balance (mean SMD bs  = 0.30) and balance test batteries (mean SMD bs  = 0.77). Effects in favor of supervised programs were larger for studies that did not include any supervised sessions in their unsupervised programs (mean SMD bs : 0.28-1.24) compared with studies that implemented a few supervised sessions in their unsupervised programs (e.g., three supervised sessions throughout the entire intervention program; SMD bs : -0.06 to 0.41). The present findings have to be interpreted with caution because of the low number of eligible studies and the moderate methodological quality of the included studies, which is indicated by a median Physiotherapy Evidence Database scale score of 5. Furthermore, we indirectly compared dose-response relationships across studies and not from single controlled studies. Our analyses suggest that supervised balance and/or resistance training improved measures of balance and muscle strength/power to a greater extent than unsupervised programs in older adults. Owing to the small number of available studies, we were unable to establish a clear dose-response relationship with regard to the impact of supervision. However, the positive effects of supervised training are particularly prominent when compared with completely unsupervised training programs. It is therefore recommended to include supervised sessions (i.e., two out of three sessions/week) in balance/resistance training programs to effectively improve balance and muscle strength/power in older adults.

  7. Application of global metabolomic profiling of synovial fluid for osteoarthritis biomarkers.

    PubMed

    Carlson, Alyssa K; Rawle, Rachel A; Adams, Erik; Greenwood, Mark C; Bothner, Brian; June, Ronald K

    2018-05-05

    Osteoarthritis affects over 250 million individuals worldwide. Currently, there are no options for early diagnosis of osteoarthritis, demonstrating the need for biomarker discovery. To find biomarkers of osteoarthritis in human synovial fluid, we used high performance liquid-chromatography mass spectrometry for global metabolomic profiling. Metabolites were extracted from human osteoarthritic (n = 5), rheumatoid arthritic (n = 3), and healthy (n = 5) synovial fluid, and a total of 1233 metabolites were detected. Principal components analysis clearly distinguished the metabolomic profiles of diseased from healthy synovial fluid. Synovial fluid from rheumatoid arthritis patients contained expected metabolites consistent with the inflammatory nature of the disease. Similarly, unsupervised clustering analysis found that each disease state was associated with distinct metabolomic profiles and clusters of co-regulated metabolites. For osteoarthritis, co-regulated metabolites that were upregulated compared to healthy synovial fluid mapped to known disease processes including chondroitin sulfate degradation, arginine and proline metabolism, and nitric oxide metabolism. We utilized receiver operating characteristic analysis to determine the diagnostic value of each metabolite and identified 35 metabolites as potential biomarkers of osteoarthritis, with an area under the receiver operating characteristic curve >0.9. These metabolites included phosphatidylcholine, lysophosphatidylcholine, ceramides, myristate derivatives, and carnitine derivatives. This pilot study provides strong justification for a larger cohort-based study of human osteoarthritic synovial fluid using global metabolomics. The significance of these data is the demonstration that metabolomic profiling of synovial fluid can identify relevant biomarkers of joint disease. Copyright © 2018 Elsevier Inc. All rights reserved.

  8. User Activity Recognition in Smart Homes Using Pattern Clustering Applied to Temporal ANN Algorithm.

    PubMed

    Bourobou, Serge Thomas Mickala; Yoo, Younghwan

    2015-05-21

    This paper discusses the possibility of recognizing and predicting user activities in the IoT (Internet of Things) based smart environment. The activity recognition is usually done through two steps: activity pattern clustering and activity type decision. Although many related works have been suggested, they had some limited performance because they focused only on one part between the two steps. This paper tries to find the best combination of a pattern clustering method and an activity decision algorithm among various existing works. For the first step, in order to classify so varied and complex user activities, we use a relevant and efficient unsupervised learning method called the K-pattern clustering algorithm. In the second step, the training of smart environment for recognizing and predicting user activities inside his/her personal space is done by utilizing the artificial neural network based on the Allen's temporal relations. The experimental results show that our combined method provides the higher recognition accuracy for various activities, as compared with other data mining classification algorithms. Furthermore, it is more appropriate for a dynamic environment like an IoT based smart home.

  9. An evaluation of ISOCLS and CLASSY clustering algorithms for forest classification in northern Idaho. [Elk River quadrange of the Clearwater National Forest

    NASA Technical Reports Server (NTRS)

    Werth, L. F. (Principal Investigator)

    1981-01-01

    Both the iterative self-organizing clustering system (ISOCLS) and the CLASSY algorithms were applied to forest and nonforest classes for one 1:24,000 quadrangle map of northern Idaho and the classification and mapping accuracies were evaluated with 1:30,000 color infrared aerial photography. Confusion matrices for the two clustering algorithms were generated and studied to determine which is most applicable to forest and rangeland inventories in future projects. In an unsupervised mode, ISOCLS requires many trial-and-error runs to find the proper parameters to separate desired information classes. CLASSY tells more in a single run concerning the classes that can be separated, shows more promise for forest stratification than ISOCLS, and shows more promise for consistency. One major drawback to CLASSY is that important forest and range classes that are smaller than a minimum cluster size will be combined with other classes. The algorithm requires so much computer storage that only data sets as small as a quadrangle can be used at one time.

  10. Unsupervised Machine Learning for Developing Personalised Behaviour Models Using Activity Data.

    PubMed

    Fiorini, Laura; Cavallo, Filippo; Dario, Paolo; Eavis, Alexandra; Caleb-Solly, Praminda

    2017-05-04

    The goal of this study is to address two major issues that undermine the large scale deployment of smart home sensing solutions in people's homes. These include the costs associated with having to install and maintain a large number of sensors, and the pragmatics of annotating numerous sensor data streams for activity classification. Our aim was therefore to propose a method to describe individual users' behavioural patterns starting from unannotated data analysis of a minimal number of sensors and a "blind" approach for activity recognition. The methodology included processing and analysing sensor data from 17 older adults living in community-based housing to extract activity information at different times of the day. The findings illustrate that 55 days of sensor data from a sensor configuration comprising three sensors, and extracting appropriate features including a "busyness" measure, are adequate to build robust models which can be used for clustering individuals based on their behaviour patterns with a high degree of accuracy (>85%). The obtained clusters can be used to describe individual behaviour over different times of the day. This approach suggests a scalable solution to support optimising the personalisation of care by utilising low-cost sensing and analysis. This approach could be used to track a person's needs over time and fine-tune their care plan on an ongoing basis in a cost-effective manner.

  11. Unsupervised Machine Learning for Developing Personalised Behaviour Models Using Activity Data

    PubMed Central

    Fiorini, Laura; Cavallo, Filippo; Dario, Paolo; Eavis, Alexandra; Caleb-Solly, Praminda

    2017-01-01

    The goal of this study is to address two major issues that undermine the large scale deployment of smart home sensing solutions in people’s homes. These include the costs associated with having to install and maintain a large number of sensors, and the pragmatics of annotating numerous sensor data streams for activity classification. Our aim was therefore to propose a method to describe individual users’ behavioural patterns starting from unannotated data analysis of a minimal number of sensors and a ”blind” approach for activity recognition. The methodology included processing and analysing sensor data from 17 older adults living in community-based housing to extract activity information at different times of the day. The findings illustrate that 55 days of sensor data from a sensor configuration comprising three sensors, and extracting appropriate features including a “busyness” measure, are adequate to build robust models which can be used for clustering individuals based on their behaviour patterns with a high degree of accuracy (>85%). The obtained clusters can be used to describe individual behaviour over different times of the day. This approach suggests a scalable solution to support optimising the personalisation of care by utilising low-cost sensing and analysis. This approach could be used to track a person’s needs over time and fine-tune their care plan on an ongoing basis in a cost-effective manner. PMID:28471405

  12. IMMAN: free software for information theory-based chemometric analysis.

    PubMed

    Urias, Ricardo W Pino; Barigye, Stephen J; Marrero-Ponce, Yovani; García-Jacas, César R; Valdes-Martiní, José R; Perez-Gimenez, Facundo

    2015-05-01

    The features and theoretical background of a new and free computational program for chemometric analysis denominated IMMAN (acronym for Information theory-based CheMoMetrics ANalysis) are presented. This is multi-platform software developed in the Java programming language, designed with a remarkably user-friendly graphical interface for the computation of a collection of information-theoretic functions adapted for rank-based unsupervised and supervised feature selection tasks. A total of 20 feature selection parameters are presented, with the unsupervised and supervised frameworks represented by 10 approaches in each case. Several information-theoretic parameters traditionally used as molecular descriptors (MDs) are adapted for use as unsupervised rank-based feature selection methods. On the other hand, a generalization scheme for the previously defined differential Shannon's entropy is discussed, as well as the introduction of Jeffreys information measure for supervised feature selection. Moreover, well-known information-theoretic feature selection parameters, such as information gain, gain ratio, and symmetrical uncertainty are incorporated to the IMMAN software ( http://mobiosd-hub.com/imman-soft/ ), following an equal-interval discretization approach. IMMAN offers data pre-processing functionalities, such as missing values processing, dataset partitioning, and browsing. Moreover, single parameter or ensemble (multi-criteria) ranking options are provided. Consequently, this software is suitable for tasks like dimensionality reduction, feature ranking, as well as comparative diversity analysis of data matrices. Simple examples of applications performed with this program are presented. A comparative study between IMMAN and WEKA feature selection tools using the Arcene dataset was performed, demonstrating similar behavior. In addition, it is revealed that the use of IMMAN unsupervised feature selection methods improves the performance of both IMMAN and WEKA supervised algorithms. Graphic representation for Shannon's distribution of MD calculating software.

  13. Unsupervised algorithms for intrusion detection and identification in wireless ad hoc sensor networks

    NASA Astrophysics Data System (ADS)

    Hortos, William S.

    2009-05-01

    In previous work by the author, parameters across network protocol layers were selected as features in supervised algorithms that detect and identify certain intrusion attacks on wireless ad hoc sensor networks (WSNs) carrying multisensor data. The algorithms improved the residual performance of the intrusion prevention measures provided by any dynamic key-management schemes and trust models implemented among network nodes. The approach of this paper does not train algorithms on the signature of known attack traffic, but, instead, the approach is based on unsupervised anomaly detection techniques that learn the signature of normal network traffic. Unsupervised learning does not require the data to be labeled or to be purely of one type, i.e., normal or attack traffic. The approach can be augmented to add any security attributes and quantified trust levels, established during data exchanges among nodes, to the set of cross-layer features from the WSN protocols. A two-stage framework is introduced for the security algorithms to overcome the problems of input size and resource constraints. The first stage is an unsupervised clustering algorithm which reduces the payload of network data packets to a tractable size. The second stage is a traditional anomaly detection algorithm based on a variation of support vector machines (SVMs), whose efficiency is improved by the availability of data in the packet payload. In the first stage, selected algorithms are adapted to WSN platforms to meet system requirements for simple parallel distributed computation, distributed storage and data robustness. A set of mobile software agents, acting like an ant colony in securing the WSN, are distributed at the nodes to implement the algorithms. The agents move among the layers involved in the network response to the intrusions at each active node and trustworthy neighborhood, collecting parametric values and executing assigned decision tasks. This minimizes the need to move large amounts of audit-log data through resource-limited nodes and locates routines closer to that data. Performance of the unsupervised algorithms is evaluated against the network intrusions of black hole, flooding, Sybil and other denial-of-service attacks in simulations of published scenarios. Results for scenarios with intentionally malfunctioning sensors show the robustness of the two-stage approach to intrusion anomalies.

  14. A SOFTWARE PACKAGE FOR UNSUPERVISED PATTERN RECOGNITION AND SYNOPTIC REPRESENTATION OF RESULTS: APPLICATION TO VOLCANIC TREMOR DATA OF MT ETNA

    NASA Astrophysics Data System (ADS)

    Langer, H. K.; Falsaperla, S. M.; Behncke, B.; Messina, A.; Spampinato, S.

    2009-12-01

    Artificial Intelligence (AI) has found broad applications in volcano observatories worldwide with the aim of reducing volcanic hazard. The need to process larger and larger quantity of data makes indeed AI techniques appealing for monitoring purposes. Tools based on Artificial Neural Networks and Support Vector Machine have proved to be particularly successful in the classification of seismic events and volcanic tremor changes heralding eruptive activity, such as paroxysmal explosions and lava fountaining at Stromboli and Mt Etna, Italy (e.g., Falsaperla et al., 1996; Langer et al., 2009). Moving on from the excellent results obtained from these applications, we present KKAnalysis, a MATLAB based software which combines several unsupervised pattern classification methods, exploiting routines of the SOM Toolbox 2 for MATLAB (http://www.cis.hut.fi/projects/somtoolbox). KKAnalysis is based on Self Organizing Maps (SOM) and clustering methods consisting of K-Means, Fuzzy C-Means, and a scheme based on a metrics accounting for correlation between components of the feature vector. We show examples of applications of this tool to volcanic tremor data recorded at Mt Etna between 2007 and 2009. This time span - during which Strombolian explosions, 7 episodes of lava fountaining and effusive activity occurred - is particularly interesting, as it encompassed different states of volcanic activity (i.e., non-eruptive, eruptive according to different styles) for the unsupervised classifier to identify, highlighting their development in time. Even subtle changes in the signal characteristics allow the unsupervised classifier to recognize features belonging to the different classes and stages of volcanic activity. A convenient color-code representation shows up the temporal development of the different classes of signal, making this method extremely helpful for monitoring purposes and surveillance. Though being developed for volcanic tremor classification, KKAnalysis is generally applicable to any type of physical or chemical pattern, provided that feature vectors are given in numerical form. References: Falsaperla, S., S. Graziani, G. Nunnari, and S. Spampinato (1996). Automatic classification of volcanic earthquakes by using multy-layered neural networks. Natural Hazard, 13, 205-228. Langer, H., S. Falsaperla, M. Masotti, R. Campanini, S. Spampinato, and A. Messina (2008). Synopsis of supervised and unsupervised pattern classification techniques applied to volcanic tremor data at Mt Etna, Italy. Geophys. J. Int., doi:10.1111/j.1365-246X.2009.04179.x.

  15. True Zero-Training Brain-Computer Interfacing – An Online Study

    PubMed Central

    Kindermans, Pieter-Jan; Schreuder, Martijn; Schrauwen, Benjamin; Müller, Klaus-Robert; Tangermann, Michael

    2014-01-01

    Despite several approaches to realize subject-to-subject transfer of pre-trained classifiers, the full performance of a Brain-Computer Interface (BCI) for a novel user can only be reached by presenting the BCI system with data from the novel user. In typical state-of-the-art BCI systems with a supervised classifier, the labeled data is collected during a calibration recording, in which the user is asked to perform a specific task. Based on the known labels of this recording, the BCI's classifier can learn to decode the individual's brain signals. Unfortunately, this calibration recording consumes valuable time. Furthermore, it is unproductive with respect to the final BCI application, e.g. text entry. Therefore, the calibration period must be reduced to a minimum, which is especially important for patients with a limited concentration ability. The main contribution of this manuscript is an online study on unsupervised learning in an auditory event-related potential (ERP) paradigm. Our results demonstrate that the calibration recording can be bypassed by utilizing an unsupervised trained classifier, that is initialized randomly and updated during usage. Initially, the unsupervised classifier tends to make decoding mistakes, as the classifier might not have seen enough data to build a reliable model. Using a constant re-analysis of the previously spelled symbols, these initially misspelled symbols can be rectified posthoc when the classifier has learned to decode the signals. We compare the spelling performance of our unsupervised approach and of the unsupervised posthoc approach to the standard supervised calibration-based dogma for n = 10 healthy users. To assess the learning behavior of our approach, it is unsupervised trained from scratch three times per user. Even with the relatively low SNR of an auditory ERP paradigm, the results show that after a limited number of trials (30 trials), the unsupervised approach performs comparably to a classic supervised model. PMID:25068464

  16. Unsupervised object segmentation with a hybrid graph model (HGM).

    PubMed

    Liu, Guangcan; Lin, Zhouchen; Yu, Yong; Tang, Xiaoou

    2010-05-01

    In this work, we address the problem of performing class-specific unsupervised object segmentation, i.e., automatic segmentation without annotated training images. Object segmentation can be regarded as a special data clustering problem where both class-specific information and local texture/color similarities have to be considered. To this end, we propose a hybrid graph model (HGM) that can make effective use of both symmetric and asymmetric relationship among samples. The vertices of a hybrid graph represent the samples and are connected by directed edges and/or undirected ones, which represent the asymmetric and/or symmetric relationship between them, respectively. When applied to object segmentation, vertices are superpixels, the asymmetric relationship is the conditional dependence of occurrence, and the symmetric relationship is the color/texture similarity. By combining the Markov chain formed by the directed subgraph and the minimal cut of the undirected subgraph, the object boundaries can be determined for each image. Using the HGM, we can conveniently achieve simultaneous segmentation and recognition by integrating both top-down and bottom-up information into a unified process. Experiments on 42 object classes (9,415 images in total) show promising results.

  17. UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets.

    PubMed

    Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K

    2015-06-04

    Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.

  18. The Initial Development of Object Knowledge by a Learning Robot

    PubMed Central

    Modayil, Joseph; Kuipers, Benjamin

    2008-01-01

    We describe how a robot can develop knowledge of the objects in its environment directly from unsupervised sensorimotor experience. The object knowledge consists of multiple integrated representations: trackers that form spatio-temporal clusters of sensory experience, percepts that represent properties for the tracked objects, classes that support efficient generalization from past experience, and actions that reliably change object percepts. We evaluate how well this intrinsically acquired object knowledge can be used to solve externally specified tasks including object recognition and achieving goals that require both planning and continuous control. PMID:19953188

  19. 3D Visualization of Machine Learning Algorithms with Astronomical Data

    NASA Astrophysics Data System (ADS)

    Kent, Brian R.

    2016-01-01

    We present innovative machine learning (ML) methods using unsupervised clustering with minimum spanning trees (MSTs) to study 3D astronomical catalogs. Utilizing Python code to build trees based on galaxy catalogs, we can render the results with the visualization suite Blender to produce interactive 360 degree panoramic videos. The catalogs and their ML results can be explored in a 3D space using mobile devices, tablets or desktop browsers. We compare the statistics of the MST results to a number of machine learning methods relating to optimization and efficiency.

  20. Adaptive Water Sampling based on Unsupervised Clustering

    NASA Astrophysics Data System (ADS)

    Py, F.; Ryan, J.; Rajan, K.; Sherman, A.; Bird, L.; Fox, M.; Long, D.

    2007-12-01

    Autonomous Underwater Vehicles (AUVs) are widely used for oceanographic surveys, during which data is collected from a number of on-board sensors. Engineers and scientists at MBARI have extended this approach by developing a water sampler specialy for the AUV, which can sample a specific patch of water at a specific time. The sampler, named the Gulper, captures 2 liters of seawater in less than 2 seconds on a 21" MBARI Odyssey AUV. Each sample chamber of the Gulper is filled with seawater through a one-way valve, which protrudes through the fairing of the AUV. This new kind of device raises a new problem: when to trigger the gulper autonomously? For example, scientists interested in studying the mobilization and transport of shelf sediments would like to detect intermediate nepheloïd layers (INLs). To be able to detect this phenomenon we need to extract a model based on AUV sensors that can detect this feature in-situ. The formation of such a model is not obvious as identification of this feature is generally based on data from multiple sensors. We have developed an unsupervised data clustering technique to extract the different features which will then be used for on-board classification and triggering of the Gulper. We use a three phase approach: 1) use data from past missions to learn the different classes of data from sensor inputs. The clustering algorithm will then extract the set of features that can be distinguished within this large data set. 2) Scientists on shore then identify these features and point out which correspond to those of interest (e.g. nepheloïd layer, upwelling material etc) 3) Embed the corresponding classifier into the AUV control system to indicate the most probable feature of the water depending on sensory input. The triggering algorithm looks to this result and triggers the Gulper if the classifier indicates that we are within the feature of interest with a predetermined threshold of confidence. We have deployed this method of online classification and sampling based on AUV depth and HOBI Labs Hydroscat-2 sensor data. Using approximately 20,000 data samples the clustering algorithm generated 14 clusters with one identified as corresponding to a nepheloïd layer. We demonstrate that such a technique can be used to reliably and efficiently sample water based on multiple sources of data in real-time.

  1. Housing and sexual health among street-involved youth.

    PubMed

    Kumar, Maya M; Nisenbaum, Rosane; Barozzino, Tony; Sgro, Michael; Bonifacio, Herbert J; Maguire, Jonathon L

    2015-10-01

    Street-involved youth (SIY) carry a disproportionate burden of sexually transmitted diseases (STD). Studies among adults suggest that improving housing stability may be an effective primary prevention strategy for improving sexual health. Housing options available to SIY offer varying degrees of stability and adult supervision. This study investigated whether housing options offering more stability and adult supervision are associated with fewer STD and related risk behaviors among SIY. A cross-sectional study was performed using public health survey and laboratory data collected from Toronto SIY in 2010. Three exposure categories were defined a priori based on housing situation: (1) stable and supervised housing, (2) stable and unsupervised housing, and (3) unstable and unsupervised housing. Multivariate logistic regression was used to test the association between housing category and current or recent STD. Secondary analyses were performed using the following secondary outcomes: blood-borne infection, recent binge-drinking, and recent high-risk sexual behavior. The final analysis included 184 SIY. Of these, 28.8 % had a current or recent STD. Housing situation was stable and supervised for 12.5 %, stable and unsupervised for 46.2 %, and unstable and unsupervised for 41.3 %. Compared to stable and supervised housing, there was no significant association between current or recent STD among stable and unsupervised housing or unstable and unsupervised housing. There was no significant association between housing category and risk of blood-borne infection, binge-drinking, or high-risk sexual behavior. Although we did not demonstrate a significant association between stable and supervised housing and lower STD risk, our incorporation of both housing stability and adult supervision into a priori defined exposure groups may inform future studies of housing-related prevention strategies among SIY. Multi-modal interventions beyond housing alone may also be required to prevent sexual morbidity among these vulnerable youth.

  2. Parametric Analysis of a Hover Test Vehicle using Advanced Test Generation and Data Analysis

    NASA Technical Reports Server (NTRS)

    Gundy-Burlet, Karen; Schumann, Johann; Menzies, Tim; Barrett, Tony

    2009-01-01

    Large complex aerospace systems are generally validated in regions local to anticipated operating points rather than through characterization of the entire feasible operational envelope of the system. This is due to the large parameter space, and complex, highly coupled nonlinear nature of the different systems that contribute to the performance of the aerospace system. We have addressed the factors deterring such an analysis by applying a combination of technologies to the area of flight envelop assessment. We utilize n-factor (2,3) combinatorial parameter variations to limit the number of cases, but still explore important interactions in the parameter space in a systematic fashion. The data generated is automatically analyzed through a combination of unsupervised learning using a Bayesian multivariate clustering technique (AutoBayes) and supervised learning of critical parameter ranges using the machine-learning tool TAR3, a treatment learner. Covariance analysis with scatter plots and likelihood contours are used to visualize correlations between simulation parameters and simulation results, a task that requires tool support, especially for large and complex models. We present results of simulation experiments for a cold-gas-powered hover test vehicle.

  3. Variation of δ2H, δ18O & δ13C in crude palm oil from different regions in Malaysia: Potential of stable isotope signatures as a key traceability parameter.

    PubMed

    Muhammad, Syahidah Akmal; Seow, Eng-Keng; Mohd Omar, A K; Rodhi, Ainolsyakira Mohd; Mat Hassan, Hasnuri; Lalung, Japareng; Lee, Sze-Chi; Ibrahim, Baharudin

    2018-01-01

    A total of 33 crude palm oil samples were randomly collected from different regions in Malaysia. Stable carbon isotopic composition (δ 13 C) was determined using Flash 2000 elemental analyzer while hydrogen and oxygen isotopic compositions (δ 2 H and δ 18 O) were analyzed by Thermo Finnigan TC/EA, wherein both instruments were coupled to an isotope ratio mass spectrometer. The bulk δ 2 H, δ 18 O and δ 13 C of the samples were analyzed by Hierarchical Cluster Analysis (HCA), Principal Component Analysis (PCA) and Orthogonal Partial Least Square-Discriminant Analysis (OPLS-DA). Unsupervised HCA and PCA methods have demonstrated that crude palm oil samples were grouped into clusters according to respective state. A predictive model was constructed by supervised OPLS-DA with good predictive power of 52.60%. Robustness of the predictive model was validated with overall accuracy of 71.43%. Blind test samples were correctly assigned to their respective cluster except for samples from southern region. δ 18 O was proposed as the promising discriminatory marker for discerning crude palm oil samples obtained from different regions. Stable isotopes profile was proven to be useful for origin traceability of crude palm oil samples at a narrower geographical area, i.e. based on regions in Malaysia. Predictive power and accuracy of the predictive model was expected to improve with the increase in sample size. Conclusively, the results in this study has fulfilled the main objective of this work where the simple approach of combining stable isotope analysis with chemometrics can be used to discriminate crude palm oil samples obtained from different regions in Malaysia. Overall, this study shows the feasibility of this approach to be used as a traceability assessment of crude palm oils. Copyright © 2017 The Chartered Society of Forensic Sciences. Published by Elsevier B.V. All rights reserved.

  4. Diagnostic index of 3D osteoarthritic changes in TMJ condylar morphology

    NASA Astrophysics Data System (ADS)

    Gomes, Liliane R.; Gomes, Marcelo; Jung, Bryan; Paniagua, Beatriz; Ruellas, Antonio C.; Gonçalves, João. Roberto; Styner, Martin A.; Wolford, Larry; Cevidanes, Lucia

    2015-03-01

    The aim of this study was to investigate imaging statistical approaches for classifying 3D osteoarthritic morphological variations among 169 Temporomandibular Joint (TMJ) condyles. Cone beam Computed Tomography (CBCT) scans were acquired from 69 patients with long-term TMJ Osteoarthritis (OA) (39.1 ± 15.7 years), 15 patients at initial diagnosis of OA (44.9 ± 14.8 years) and 7 healthy controls (43 ± 12.4 years). 3D surface models of the condyles were constructed and Shape Correspondence was used to establish correspondent points on each model. The statistical framework included a multivariate analysis of covariance (MANCOVA) and Direction-Projection- Permutation (DiProPerm) for testing statistical significance of the differences between healthy control and the OA group determined by clinical and radiographic diagnoses. Unsupervised classification using hierarchical agglomerative clustering (HAC) was then conducted. Condylar morphology in OA and healthy subjects varied widely. Compared with healthy controls, OA average condyle was statistically significantly smaller in all dimensions except its anterior surface. Significant flattening of the lateral pole was noticed at initial diagnosis (p < 0.05). It was observed areas of 3.88 mm bone resorption at the superior surface and 3.10 mm bone apposition at the anterior aspect of the long-term OA average model. 1000 permutation statistics of DiProPerm supported a significant difference between the healthy control group and OA group (t = 6.7, empirical p-value = 0.001). Clinically meaningful unsupervised classification of TMJ condylar morphology determined a preliminary diagnostic index of 3D osteoarthritic changes, which may be the first step towards a more targeted diagnosis of this condition.

  5. BinSanity: unsupervised clustering of environmental microbial assemblies using coverage and affinity propagation

    PubMed Central

    Heidelberg, John F.; Tully, Benjamin J.

    2017-01-01

    Metagenomics has become an integral part of defining microbial diversity in various environments. Many ecosystems have characteristically low biomass and few cultured representatives. Linking potential metabolisms to phylogeny in environmental microorganisms is important for interpreting microbial community functions and the impacts these communities have on geochemical cycles. However, with metagenomic studies there is the computational hurdle of ‘binning’ contigs into phylogenetically related units or putative genomes. Binning methods have been implemented with varying approaches such as k-means clustering, Gaussian mixture models, hierarchical clustering, neural networks, and two-way clustering; however, many of these suffer from biases against low coverage/abundance organisms and closely related taxa/strains. We are introducing a new binning method, BinSanity, that utilizes the clustering algorithm affinity propagation (AP), to cluster assemblies using coverage with compositional based refinement (tetranucleotide frequency and percent GC content) to optimize bins containing multiple source organisms. This separation of composition and coverage based clustering reduces bias for closely related taxa. BinSanity was developed and tested on artificial metagenomes varying in size and complexity. Results indicate that BinSanity has a higher precision, recall, and Adjusted Rand Index compared to five commonly implemented methods. When tested on a previously published environmental metagenome, BinSanity generated high completion and low redundancy bins corresponding with the published metagenome-assembled genomes. PMID:28289564

  6. Competitive repetition suppression (CoRe) clustering: a biologically inspired learning model with application to robust clustering.

    PubMed

    Bacciu, Davide; Starita, Antonina

    2008-11-01

    Determining a compact neural coding for a set of input stimuli is an issue that encompasses several biological memory mechanisms as well as various artificial neural network models. In particular, establishing the optimal network structure is still an open problem when dealing with unsupervised learning models. In this paper, we introduce a novel learning algorithm, named competitive repetition-suppression (CoRe) learning, inspired by a cortical memory mechanism called repetition suppression (RS). We show how such a mechanism is used, at various levels of the cerebral cortex, to generate compact neural representations of the visual stimuli. From the general CoRe learning model, we derive a clustering algorithm, named CoRe clustering, that can automatically estimate the unknown cluster number from the data without using a priori information concerning the input distribution. We illustrate how CoRe clustering, besides its biological plausibility, posses strong theoretical properties in terms of robustness to noise and outliers, and we provide an error function describing CoRe learning dynamics. Such a description is used to analyze CoRe relationships with the state-of-the art clustering models and to highlight CoRe similitude with rival penalized competitive learning (RPCL), showing how CoRe extends such a model by strengthening the rival penalization estimation by means of loss functions from robust statistics.

  7. Unsupervised Outlier Profile Analysis

    PubMed Central

    Ghosh, Debashis; Li, Song

    2014-01-01

    In much of the analysis of high-throughput genomic data, “interesting” genes have been selected based on assessment of differential expression between two groups or generalizations thereof. Most of the literature focuses on changes in mean expression or the entire distribution. In this article, we explore the use of C(α) tests, which have been applied in other genomic data settings. Their use for the outlier expression problem, in particular with continuous data, is problematic but nevertheless motivates new statistics that give an unsupervised analog to previously developed outlier profile analysis approaches. Some simulation studies are used to evaluate the proposal. A bivariate extension is described that can accommodate data from two platforms on matched samples. The proposed methods are applied to data from a prostate cancer study. PMID:25452686

  8. Network analysis of patient flow in two UK acute care hospitals identifies key sub-networks for A&E performance

    PubMed Central

    Stringer, Clive; Beeknoo, Neeraj

    2017-01-01

    The topology of the patient flow network in a hospital is complex, comprising hundreds of overlapping patient journeys, and is a determinant of operational efficiency. To understand the network architecture of patient flow, we performed a data-driven network analysis of patient flow through two acute hospital sites of King’s College Hospital NHS Foundation Trust. Administration databases were queried for all intra-hospital patient transfers in an 18-month period and modelled as a dynamic weighted directed graph. A ‘core’ subnetwork containing only 13–17% of all edges channelled 83–90% of the patient flow, while an ‘ephemeral’ network constituted the remainder. Unsupervised cluster analysis and differential network analysis identified sub-networks where traffic is most associated with A&E performance. Increased flow to clinical decision units was associated with the best A&E performance in both sites. The component analysis also detected a weekend effect on patient transfers which was not associated with performance. We have performed the first data-driven hypothesis-free analysis of patient flow which can enhance understanding of whole healthcare systems. Such analysis can drive transformation in healthcare as it has in industries such as manufacturing. PMID:28968472

  9. Prognostic relevance of aberrant DNA methylation in g1 and g2 pancreatic neuroendocrine tumors.

    PubMed

    Stefanoli, Michele; La Rosa, Stefano; Sahnane, Nora; Romualdi, Chiara; Pastorino, Roberta; Marando, Alessandro; Capella, Carlo; Sessa, Fausto; Furlan, Daniela

    2014-01-01

    The occurrence and clinical relevance of DNA hypermethylation and global hypomethylation in pancreatic neuroendocrine tumours (PanNETs) are still unknown. We evaluated the frequency of both epigenetic alterations in PanNETs to assess the relationship between methylation profiles and chromosomal instability, tumour phenotypes and prognosis. In a well-characterized series of 56 sporadic G1 and G2 PanNETs, methylation-sensitive multiple ligation-dependent probe amplification was performed to assess hypermethylayion of 33 genes and copy number alterations (CNAs) of 53 chromosomal regions. Long interspersed nucleotide element-1 (LINE-1) hypomethylation was quantified by pyrosequencing. Unsupervised hierarchical clustering allowed to identify a subset of 22 PanNETs (39%) exhibiting high frequency of gene-specific methylation and low CNA percentages. This tumour cluster was significantly associated with stage IV (p = 0.04) and with poor prognosis in univariable analysis (p = 0.004). LINE-1 methylation levels in PanNETs were significantly lower than in normal samples (p < 0.01) and were approximately normally distributed. 12 tumours (21%) were highly hypomethylated, showing variable levels of CNA. Interestingly, only 5 PanNETs (9%) were observed to show simultaneously LINE-1 hypomethylation and high frequency of gene-specific methylation. LINE-1 hypomethylation was strongly correlated with advanced stage (p = 0.002) and with poor prognosis (p < 0.0001). In the multivariable analysis, low LINE-1 methylation status and methylation clusters were the only independent significant predictors of outcome (p = 0.034 and p = 0.029, respectively). The combination of global DNA hypomethylation and gene hypermethylation analyses may be useful to define distinct subsets of PanNETs. Both alterations are common in PanNETs and could be directly correlated with tumour progression. © 2014 S. Karger AG, Basel.

  10. AUTOMATED UNSUPERVISED CLASSIFICATION OF THE SLOAN DIGITAL SKY SURVEY STELLAR SPECTRA USING k-MEANS CLUSTERING

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sanchez Almeida, J.; Allende Prieto, C., E-mail: jos@iac.es, E-mail: callende@iac.es

    2013-01-20

    Large spectroscopic surveys require automated methods of analysis. This paper explores the use of k-means clustering as a tool for automated unsupervised classification of massive stellar spectral catalogs. The classification criteria are defined by the data and the algorithm, with no prior physical framework. We work with a representative set of stellar spectra associated with the Sloan Digital Sky Survey (SDSS) SEGUE and SEGUE-2 programs, which consists of 173,390 spectra from 3800 to 9200 A sampled on 3849 wavelengths. We classify the original spectra as well as the spectra with the continuum removed. The second set only contains spectral lines,more » and it is less dependent on uncertainties of the flux calibration. The classification of the spectra with continuum renders 16 major classes. Roughly speaking, stars are split according to their colors, with enough finesse to distinguish dwarfs from giants of the same effective temperature, but with difficulties to separate stars with different metallicities. There are classes corresponding to particular MK types, intrinsically blue stars, dust-reddened, stellar systems, and also classes collecting faulty spectra. Overall, there is no one-to-one correspondence between the classes we derive and the MK types. The classification of spectra without continuum renders 13 classes, the color separation is not so sharp, but it distinguishes stars of the same effective temperature and different metallicities. Some classes thus obtained present a fairly small range of physical parameters (200 K in effective temperature, 0.25 dex in surface gravity, and 0.35 dex in metallicity), so that the classification can be used to estimate the main physical parameters of some stars at a minimum computational cost. We also analyze the outliers of the classification. Most of them turn out to be failures of the reduction pipeline, but there are also high redshift QSOs, multiple stellar systems, dust-reddened stars, galaxies, and, finally, odd spectra whose nature we have not deciphered. The template spectra representative of the classes are publicly available in the online journal.« less

  11. Wavelet-based Gaussian-mixture hidden Markov model for the detection of multistage seizure dynamics: A proof-of-concept study

    PubMed Central

    2011-01-01

    Background Epilepsy is a common neurological disorder characterized by recurrent electrophysiological activities, known as seizures. Without the appropriate detection strategies, these seizure episodes can dramatically affect the quality of life for those afflicted. The rationale of this study is to develop an unsupervised algorithm for the detection of seizure states so that it may be implemented along with potential intervention strategies. Methods Hidden Markov model (HMM) was developed to interpret the state transitions of the in vitro rat hippocampal slice local field potentials (LFPs) during seizure episodes. It can be used to estimate the probability of state transitions and the corresponding characteristics of each state. Wavelet features were clustered and used to differentiate the electrophysiological characteristics at each corresponding HMM states. Using unsupervised training method, the HMM and the clustering parameters were obtained simultaneously. The HMM states were then assigned to the electrophysiological data using expert guided technique. Minimum redundancy maximum relevance (mRMR) analysis and Akaike Information Criterion (AICc) were applied to reduce the effect of over-fitting. The sensitivity, specificity and optimality index of chronic seizure detection were compared for various HMM topologies. The ability of distinguishing early and late tonic firing patterns prior to chronic seizures were also evaluated. Results Significant improvement in state detection performance was achieved when additional wavelet coefficient rates of change information were used as features. The final HMM topology obtained using mRMR and AICc was able to detect non-ictal (interictal), early and late tonic firing, chronic seizures and postictal activities. A mean sensitivity of 95.7%, mean specificity of 98.9% and optimality index of 0.995 in the detection of chronic seizures was achieved. The detection of early and late tonic firing was validated with experimental intracellular electrical recordings of seizures. Conclusions The HMM implementation of a seizure dynamics detector is an improvement over existing approaches using visual detection and complexity measures. The subjectivity involved in partitioning the observed data prior to training can be eliminated. It can also decipher the probabilities of seizure state transitions using the magnitude and rate of change wavelet information of the LFPs. PMID:21504608

  12. A 6-gene signature identifies four molecular subgroups of neuroblastoma

    PubMed Central

    2011-01-01

    Background There are currently three postulated genomic subtypes of the childhood tumour neuroblastoma (NB); Type 1, Type 2A, and Type 2B. The most aggressive forms of NB are characterized by amplification of the oncogene MYCN (MNA) and low expression of the favourable marker NTRK1. Recently, mutations or high expression of the familial predisposition gene Anaplastic Lymphoma Kinase (ALK) was associated to unfavourable biology of sporadic NB. Also, various other genes have been linked to NB pathogenesis. Results The present study explores subgroup discrimination by gene expression profiling using three published microarray studies on NB (47 samples). Four distinct clusters were identified by Principal Components Analysis (PCA) in two separate data sets, which could be verified by an unsupervised hierarchical clustering in a third independent data set (101 NB samples) using a set of 74 discriminative genes. The expression signature of six NB-associated genes ALK, BIRC5, CCND1, MYCN, NTRK1, and PHOX2B, significantly discriminated the four clusters (p < 0.05, one-way ANOVA test). PCA clusters p1, p2, and p3 were found to correspond well to the postulated subtypes 1, 2A, and 2B, respectively. Remarkably, a fourth novel cluster was detected in all three independent data sets. This cluster comprised mainly 11q-deleted MNA-negative tumours with low expression of ALK, BIRC5, and PHOX2B, and was significantly associated with higher tumour stage, poor outcome and poor survival compared to the Type 1-corresponding favourable group (INSS stage 4 and/or dead of disease, p < 0.05, Fisher's exact test). Conclusions Based on expression profiling we have identified four molecular subgroups of neuroblastoma, which can be distinguished by a 6-gene signature. The fourth subgroup has not been described elsewhere, and efforts are currently made to further investigate this group's specific characteristics. PMID:21492432

  13. Semiautomatic mapping of permafrost in the Yukon Flats, Alaska

    NASA Astrophysics Data System (ADS)

    Gulbrandsen, Mats Lundh; Minsley, Burke J.; Ball, Lyndsay B.; Hansen, Thomas Mejer

    2016-12-01

    Thawing of permafrost due to global warming can have major impacts on hydrogeological processes, climate feedback, arctic ecology, and local environments. To understand these effects and processes, it is crucial to know the distribution of permafrost. In this study we exploit the fact that airborne electromagnetic (AEM) data are sensitive to the distribution of permafrost and demonstrate how the distribution of permafrost in the Yukon Flats, Alaska, is mapped in an efficient (semiautomatic) way, using a combination of supervised and unsupervised (machine) learning algorithms, i.e., Smart Interpretation and K-means clustering. Clustering is used to sort unfrozen and frozen regions, and Smart Interpretation is used to predict the depth of permafrost based on expert interpretations. This workflow allows, for the first time, a quantitative and objective approach to efficiently map permafrost based on large amounts of AEM data.

  14. Semiautomatic mapping of permafrost in the Yukon Flats, Alaska

    USGS Publications Warehouse

    Gulbrandsen, Mats Lundh; Minsley, Burke J.; Ball, Lyndsay B.; Hansen, Thomas Mejer

    2016-01-01

    Thawing of permafrost due to global warming can have major impacts on hydrogeological processes, climate feedback, arctic ecology, and local environments. To understand these effects and processes, it is crucial to know the distribution of permafrost. In this study we exploit the fact that airborne electromagnetic (AEM) data are sensitive to the distribution of permafrost and demonstrate how the distribution of permafrost in the Yukon Flats, Alaska, is mapped in an efficient (semiautomatic) way, using a combination of supervised and unsupervised (machine) learning algorithms, i.e., Smart Interpretation and K-means clustering. Clustering is used to sort unfrozen and frozen regions, and Smart Interpretation is used to predict the depth of permafrost based on expert interpretations. This workflow allows, for the first time, a quantitative and objective approach to efficiently map permafrost based on large amounts of AEM data.

  15. Automated classification of dolphin echolocation click types from the Gulf of Mexico.

    PubMed

    Frasier, Kaitlin E; Roch, Marie A; Soldevilla, Melissa S; Wiggins, Sean M; Garrison, Lance P; Hildebrand, John A

    2017-12-01

    Delphinids produce large numbers of short duration, broadband echolocation clicks which may be useful for species classification in passive acoustic monitoring efforts. A challenge in echolocation click classification is to overcome the many sources of variability to recognize underlying patterns across many detections. An automated unsupervised network-based classification method was developed to simulate the approach a human analyst uses when categorizing click types: Clusters of similar clicks were identified by incorporating multiple click characteristics (spectral shape and inter-click interval distributions) to distinguish within-type from between-type variation, and identify distinct, persistent click types. Once click types were established, an algorithm for classifying novel detections using existing clusters was tested. The automated classification method was applied to a dataset of 52 million clicks detected across five monitoring sites over two years in the Gulf of Mexico (GOM). Seven distinct click types were identified, one of which is known to be associated with an acoustically identifiable delphinid (Risso's dolphin) and six of which are not yet identified. All types occurred at multiple monitoring locations, but the relative occurrence of types varied, particularly between continental shelf and slope locations. Automatically-identified click types from autonomous seafloor recorders without verifiable species identification were compared with clicks detected on sea-surface towed hydrophone arrays in the presence of visually identified delphinid species. These comparisons suggest potential species identities for the animals producing some echolocation click types. The network-based classification method presented here is effective for rapid, unsupervised delphinid click classification across large datasets in which the click types may not be known a priori.

  16. Automated classification of dolphin echolocation click types from the Gulf of Mexico

    PubMed Central

    Roch, Marie A.; Soldevilla, Melissa S.; Wiggins, Sean M.; Garrison, Lance P.; Hildebrand, John A.

    2017-01-01

    Delphinids produce large numbers of short duration, broadband echolocation clicks which may be useful for species classification in passive acoustic monitoring efforts. A challenge in echolocation click classification is to overcome the many sources of variability to recognize underlying patterns across many detections. An automated unsupervised network-based classification method was developed to simulate the approach a human analyst uses when categorizing click types: Clusters of similar clicks were identified by incorporating multiple click characteristics (spectral shape and inter-click interval distributions) to distinguish within-type from between-type variation, and identify distinct, persistent click types. Once click types were established, an algorithm for classifying novel detections using existing clusters was tested. The automated classification method was applied to a dataset of 52 million clicks detected across five monitoring sites over two years in the Gulf of Mexico (GOM). Seven distinct click types were identified, one of which is known to be associated with an acoustically identifiable delphinid (Risso’s dolphin) and six of which are not yet identified. All types occurred at multiple monitoring locations, but the relative occurrence of types varied, particularly between continental shelf and slope locations. Automatically-identified click types from autonomous seafloor recorders without verifiable species identification were compared with clicks detected on sea-surface towed hydrophone arrays in the presence of visually identified delphinid species. These comparisons suggest potential species identities for the animals producing some echolocation click types. The network-based classification method presented here is effective for rapid, unsupervised delphinid click classification across large datasets in which the click types may not be known a priori. PMID:29216184

  17. A comparison of performance of automatic cloud coverage assessment algorithm for Formosat-2 image using clustering-based and spatial thresholding methods

    NASA Astrophysics Data System (ADS)

    Hsu, Kuo-Hsien

    2012-11-01

    Formosat-2 image is a kind of high-spatial-resolution (2 meters GSD) remote sensing satellite data, which includes one panchromatic band and four multispectral bands (Blue, Green, Red, near-infrared). An essential sector in the daily processing of received Formosat-2 image is to estimate the cloud statistic of image using Automatic Cloud Coverage Assessment (ACCA) algorithm. The information of cloud statistic of image is subsequently recorded as an important metadata for image product catalog. In this paper, we propose an ACCA method with two consecutive stages: preprocessing and post-processing analysis. For pre-processing analysis, the un-supervised K-means classification, Sobel's method, thresholding method, non-cloudy pixels reexamination, and cross-band filter method are implemented in sequence for cloud statistic determination. For post-processing analysis, Box-Counting fractal method is implemented. In other words, the cloud statistic is firstly determined via pre-processing analysis, the correctness of cloud statistic of image of different spectral band is eventually cross-examined qualitatively and quantitatively via post-processing analysis. The selection of an appropriate thresholding method is very critical to the result of ACCA method. Therefore, in this work, We firstly conduct a series of experiments of the clustering-based and spatial thresholding methods that include Otsu's, Local Entropy(LE), Joint Entropy(JE), Global Entropy(GE), and Global Relative Entropy(GRE) method, for performance comparison. The result shows that Otsu's and GE methods both perform better than others for Formosat-2 image. Additionally, our proposed ACCA method by selecting Otsu's method as the threshoding method has successfully extracted the cloudy pixels of Formosat-2 image for accurate cloud statistic estimation.

  18. Unsupervised learning of discriminative edge measures for vehicle matching between nonoverlapping cameras.

    PubMed

    Shan, Ying; Sawhney, Harpreet S; Kumar, Rakesh

    2008-04-01

    This paper proposes a novel unsupervised algorithm learning discriminative features in the context of matching road vehicles between two non-overlapping cameras. The matching problem is formulated as a same-different classification problem, which aims to compute the probability of vehicle images from two distinct cameras being from the same vehicle or different vehicle(s). We employ a novel measurement vector that consists of three independent edge-based measures and their associated robust measures computed from a pair of aligned vehicle edge maps. The weight of each measure is determined by an unsupervised learning algorithm that optimally separates the same-different classes in the combined measurement space. This is achieved with a weak classification algorithm that automatically collects representative samples from same-different classes, followed by a more discriminative classifier based on Fisher' s Linear Discriminants and Gibbs Sampling. The robustness of the match measures and the use of unsupervised discriminant analysis in the classification ensures that the proposed method performs consistently in the presence of missing/false features, temporally and spatially changing illumination conditions, and systematic misalignment caused by different camera configurations. Extensive experiments based on real data of over 200 vehicles at different times of day demonstrate promising results.

  19. Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences

    NASA Technical Reports Server (NTRS)

    Budalakoti, Suratna; Srivastava, Ashok N.; Akella, Ram; Turkov, Eugene

    2006-01-01

    This paper addresses the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. The approach taken uses unsupervised clustering of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by detailed analysis of outliers to detect anomalies. As the LCS measure is expensive to compute, the first part of the paper discusses existing algorithms, such as the Hunt-Szymanski algorithm, that have low time-complexity. We then discuss why these algorithms often do not work well in practice and present a new hybrid algorithm for computing the LCS that, in our tests, outperforms the Hunt-Szymanski algorithm by a factor of five. The second part of the paper presents new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. The algorithms provide a coherent description to an analyst of the anomalies in the sequence, compared to more normal sequences. The algorithms we present are general and domain-independent, so we discuss applications in related areas such as anomaly detection.

  20. Unsupervised classification of operator workload from brain signals.

    PubMed

    Schultze-Kraft, Matthias; Dähne, Sven; Gugler, Manfred; Curio, Gabriel; Blankertz, Benjamin

    2016-06-01

    In this study we aimed for the classification of operator workload as it is expected in many real-life workplace environments. We explored brain-signal based workload predictors that differ with respect to the level of label information required for training, including entirely unsupervised approaches. Subjects executed a task on a touch screen that required continuous effort of visual and motor processing with alternating difficulty. We first employed classical approaches for workload state classification that operate on the sensor space of EEG and compared those to the performance of three state-of-the-art spatial filtering methods: common spatial patterns (CSPs) analysis, which requires binary label information; source power co-modulation (SPoC) analysis, which uses the subjects' error rate as a target function; and canonical SPoC (cSPoC) analysis, which solely makes use of cross-frequency power correlations induced by different states of workload and thus represents an unsupervised approach. Finally, we investigated the effects of fusing brain signals and peripheral physiological measures (PPMs) and examined the added value for improving classification performance. Mean classification accuracies of 94%, 92% and 82% were achieved with CSP, SPoC, cSPoC, respectively. These methods outperformed the approaches that did not use spatial filtering and they extracted physiologically plausible components. The performance of the unsupervised cSPoC is significantly increased by augmenting it with PPM features. Our analyses ensured that the signal sources used for classification were of cortical origin and not contaminated with artifacts. Our findings show that workload states can be successfully differentiated from brain signals, even when less and less information from the experimental paradigm is used, thus paving the way for real-world applications in which label information may be noisy or entirely unavailable.

  1. Validation of a free software for unsupervised assessment of abdominal fat in MRI.

    PubMed

    Maddalo, Michele; Zorza, Ivan; Zubani, Stefano; Nocivelli, Giorgio; Calandra, Giulio; Soldini, Pierantonio; Mascaro, Lorella; Maroldi, Roberto

    2017-05-01

    To demonstrate the accuracy of an unsupervised (fully automated) software for fat segmentation in magnetic resonance imaging. The proposed software is a freeware solution developed in ImageJ that enables the quantification of metabolically different adipose tissues in large cohort studies. The lumbar part of the abdomen (19cm in craniocaudal direction, centered in L3) of eleven healthy volunteers (age range: 21-46years, BMI range: 21.7-31.6kg/m 2 ) was examined in a breath hold on expiration with a GE T1 Dixon sequence. Single-slice and volumetric data were considered for each subject. The results of the visceral and subcutaneous adipose tissue assessments obtained by the unsupervised software were compared to supervised segmentations of reference. The associated statistical analysis included Pearson correlations, Bland-Altman plots and volumetric differences (VD % ). Values calculated by the unsupervised software significantly correlated with corresponding supervised segmentations of reference for both subcutaneous adipose tissue - SAT (R=0.9996, p<0.001) and visceral adipose tissue - VAT (R=0.995, p<0.001). Bland-Altman plots showed the absence of systematic errors and a limited spread of the differences. In the single-slice analysis, VD % were (1.6±2.9)% for SAT and (4.9±6.9)% for VAT. In the volumetric analysis, VD % were (1.3±0.9)% for SAT and (2.9±2.7)% for VAT. The developed software is capable of segmenting the metabolically different adipose tissues with a high degree of accuracy. This free add-on software for ImageJ can easily have a widespread and enable large-scale population studies regarding the adipose tissue and its related diseases. Copyright © 2017 Associazione Italiana di Fisica Medica. Published by Elsevier Ltd. All rights reserved.

  2. Unsupervised classification of operator workload from brain signals

    NASA Astrophysics Data System (ADS)

    Schultze-Kraft, Matthias; Dähne, Sven; Gugler, Manfred; Curio, Gabriel; Blankertz, Benjamin

    2016-06-01

    Objective. In this study we aimed for the classification of operator workload as it is expected in many real-life workplace environments. We explored brain-signal based workload predictors that differ with respect to the level of label information required for training, including entirely unsupervised approaches. Approach. Subjects executed a task on a touch screen that required continuous effort of visual and motor processing with alternating difficulty. We first employed classical approaches for workload state classification that operate on the sensor space of EEG and compared those to the performance of three state-of-the-art spatial filtering methods: common spatial patterns (CSPs) analysis, which requires binary label information; source power co-modulation (SPoC) analysis, which uses the subjects’ error rate as a target function; and canonical SPoC (cSPoC) analysis, which solely makes use of cross-frequency power correlations induced by different states of workload and thus represents an unsupervised approach. Finally, we investigated the effects of fusing brain signals and peripheral physiological measures (PPMs) and examined the added value for improving classification performance. Main results. Mean classification accuracies of 94%, 92% and 82% were achieved with CSP, SPoC, cSPoC, respectively. These methods outperformed the approaches that did not use spatial filtering and they extracted physiologically plausible components. The performance of the unsupervised cSPoC is significantly increased by augmenting it with PPM features. Significance. Our analyses ensured that the signal sources used for classification were of cortical origin and not contaminated with artifacts. Our findings show that workload states can be successfully differentiated from brain signals, even when less and less information from the experimental paradigm is used, thus paving the way for real-world applications in which label information may be noisy or entirely unavailable.

  3. Self-organized neural maps of human protein sequences.

    PubMed Central

    Ferrán, E. A.; Pflugfelder, B.; Ferrara, P.

    1994-01-01

    We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen's unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis. PMID:8019421

  4. Impact of Upfront Cellular Enrichment by Laser Capture Microdissection on Protein and Phosphoprotein Drug Target Signaling Activation Measurements in Human Lung Cancer: Implications for Personalized Medicine

    PubMed Central

    Elisa, Baldelli; B., Haura Eric; Lucio, Crinò; Douglas, Cress W.; Vienna, Ludovini; B., Schabath Matthew; A., Liotta Lance; F., Petricoin Emanuel; Mariaelena, Pierobon

    2015-01-01

    Purpose The aim of this study was to evaluate whether upfront cellular enrichment via laser capture microdissection is necessary for accurately quantifying predictive biomarkers in non-small cell lung cancer tumors. Experimental design Fifteen snap frozen surgical biopsies were analyzed. Whole tissue lysate and matched highly enriched tumor epithelium via laser capture microdissection (LCM) were obtained for each patient. The expression and activation/phosphorylation levels of 26 proteins were measured by reverse phase protein microarray. Differences in signaling architecture of dissected and undissected matched pairs were visualized using unsupervised clustering analysis, bar graphs, and scatter plots. Results Overall patient matched LCM and undissected material displayed very distinct and differing signaling architectures with 93% of the matched pairs clustering separately. These differences were seen regardless of the amount of starting tumor epithelial content present in the specimen. Conclusions and clinical relevance These results indicate that LCM driven upfront cellular enrichment is necessary to accurately determine the expression/activation levels of predictive protein signaling markers although results should be evaluated in larger clinical settings. Upfront cellular enrichment of the target cell appears to be an important part of the workflow needed for the accurate quantification of predictive protein signaling biomarkers. Larger independent studies are warranted. PMID:25676683

  5. Unsupervised real-time speaker identification for daily movies

    NASA Astrophysics Data System (ADS)

    Li, Ying; Kuo, C.-C. Jay

    2002-07-01

    The problem of identifying speakers for movie content analysis is addressed in this paper. While most previous work on speaker identification was carried out in a supervised mode using pure audio data, more robust results can be obtained in real-time by integrating knowledge from multiple media sources in an unsupervised mode. In this work, both audio and visual cues will be employed and subsequently combined in a probabilistic framework to identify speakers. Particularly, audio information is used to identify speakers with a maximum likelihood (ML)-based approach while visual information is adopted to distinguish speakers by detecting and recognizing their talking faces based on face detection/recognition and mouth tracking techniques. Moreover, to accommodate for speakers' acoustic variations along time, we update their models on the fly by adapting to their newly contributed speech data. Encouraging results have been achieved through extensive experiments, which shows a promising future of the proposed audiovisual-based unsupervised speaker identification system.

  6. Neural network-based multiple robot simultaneous localization and mapping.

    PubMed

    Saeedi, Sajad; Paull, Liam; Trentini, Michael; Li, Howard

    2011-12-01

    In this paper, a decentralized platform for simultaneous localization and mapping (SLAM) with multiple robots is developed. Each robot performs single robot view-based SLAM using an extended Kalman filter to fuse data from two encoders and a laser ranger. To extend this approach to multiple robot SLAM, a novel occupancy grid map fusion algorithm is proposed. Map fusion is achieved through a multistep process that includes image preprocessing, map learning (clustering) using neural networks, relative orientation extraction using norm histogram cross correlation and a Radon transform, relative translation extraction using matching norm vectors, and then verification of the results. The proposed map learning method is a process based on the self-organizing map. In the learning phase, the obstacles of the map are learned by clustering the occupied cells of the map into clusters. The learning is an unsupervised process which can be done on the fly without any need to have output training patterns. The clusters represent the spatial form of the map and make further analyses of the map easier and faster. Also, clusters can be interpreted as features extracted from the occupancy grid map so the map fusion problem becomes a task of matching features. Results of the experiments from tests performed on a real environment with multiple robots prove the effectiveness of the proposed solution.

  7. Highly efficient classification and identification of human pathogenic bacteria by MALDI-TOF MS.

    PubMed

    Hsieh, Sen-Yung; Tseng, Chiao-Li; Lee, Yun-Shien; Kuo, An-Jing; Sun, Chien-Feng; Lin, Yen-Hsiu; Chen, Jen-Kun

    2008-02-01

    Accurate and rapid identification of pathogenic microorganisms is of critical importance in disease treatment and public health. Conventional work flows are time-consuming, and procedures are multifaceted. MS can be an alternative but is limited by low efficiency for amino acid sequencing as well as low reproducibility for spectrum fingerprinting. We systematically analyzed the feasibility of applying MS for rapid and accurate bacterial identification. Directly applying bacterial colonies without further protein extraction to MALDI-TOF MS analysis revealed rich peak contents and high reproducibility. The MS spectra derived from 57 isolates comprising six human pathogenic bacterial species were analyzed using both unsupervised hierarchical clustering and supervised model construction via the Genetic Algorithm. Hierarchical clustering analysis categorized the spectra into six groups precisely corresponding to the six bacterial species. Precise classification was also maintained in an independently prepared set of bacteria even when the numbers of m/z values were reduced to six. In parallel, classification models were constructed via Genetic Algorithm analysis. A model containing 18 m/z values accurately classified independently prepared bacteria and identified those species originally not used for model construction. Moreover bacteria fewer than 10(4) cells and different species in bacterial mixtures were identified using the classification model approach. In conclusion, the application of MALDI-TOF MS in combination with a suitable model construction provides a highly accurate method for bacterial classification and identification. The approach can identify bacteria with low abundance even in mixed flora, suggesting that a rapid and accurate bacterial identification using MS techniques even before culture can be attained in the near future.

  8. Machine learning for neuroimaging with scikit-learn.

    PubMed

    Abraham, Alexandre; Pedregosa, Fabian; Eickenberg, Michael; Gervais, Philippe; Mueller, Andreas; Kossaifi, Jean; Gramfort, Alexandre; Thirion, Bertrand; Varoquaux, Gaël

    2014-01-01

    Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model high-dimensional datasets, e.g., multivariate analysis of activation images or resting-state time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learning can uncover hidden structures in sets of images (e.g., resting state functional MRI) or find sub-populations in large cohorts. By considering different functional neuroimaging applications, we illustrate how scikit-learn, a Python machine learning library, can be used to perform some key analysis steps. Scikit-learn contains a very large set of statistical learning algorithms, both supervised and unsupervised, and its application to neuroimaging data provides a versatile tool to study the brain.

  9. Machine learning for neuroimaging with scikit-learn

    PubMed Central

    Abraham, Alexandre; Pedregosa, Fabian; Eickenberg, Michael; Gervais, Philippe; Mueller, Andreas; Kossaifi, Jean; Gramfort, Alexandre; Thirion, Bertrand; Varoquaux, Gaël

    2014-01-01

    Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model high-dimensional datasets, e.g., multivariate analysis of activation images or resting-state time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learning can uncover hidden structures in sets of images (e.g., resting state functional MRI) or find sub-populations in large cohorts. By considering different functional neuroimaging applications, we illustrate how scikit-learn, a Python machine learning library, can be used to perform some key analysis steps. Scikit-learn contains a very large set of statistical learning algorithms, both supervised and unsupervised, and its application to neuroimaging data provides a versatile tool to study the brain. PMID:24600388

  10. Data Analytics for Smart Parking Applications.

    PubMed

    Piovesan, Nicola; Turi, Leo; Toigo, Enrico; Martinez, Borja; Rossi, Michele

    2016-09-23

    We consider real-life smart parking systems where parking lot occupancy data are collected from field sensor devices and sent to backend servers for further processing and usage for applications. Our objective is to make these data useful to end users, such as parking managers, and, ultimately, to citizens. To this end, we concoct and validate an automated classification algorithm having two objectives: (1) outlier detection: to detect sensors with anomalous behavioral patterns, i.e., outliers; and (2) clustering: to group the parking sensors exhibiting similar patterns into distinct clusters. We first analyze the statistics of real parking data, obtaining suitable simulation models for parking traces. We then consider a simple classification algorithm based on the empirical complementary distribution function of occupancy times and show its limitations. Hence, we design a more sophisticated algorithm exploiting unsupervised learning techniques (self-organizing maps). These are tuned following a supervised approach using our trace generator and are compared against other clustering schemes, namely expectation maximization, k-means clustering and DBSCAN, considering six months of data from a real sensor deployment. Our approach is found to be superior in terms of classification accuracy, while also being capable of identifying all of the outliers in the dataset.

  11. User Activity Recognition in Smart Homes Using Pattern Clustering Applied to Temporal ANN Algorithm

    PubMed Central

    Bourobou, Serge Thomas Mickala; Yoo, Younghwan

    2015-01-01

    This paper discusses the possibility of recognizing and predicting user activities in the IoT (Internet of Things) based smart environment. The activity recognition is usually done through two steps: activity pattern clustering and activity type decision. Although many related works have been suggested, they had some limited performance because they focused only on one part between the two steps. This paper tries to find the best combination of a pattern clustering method and an activity decision algorithm among various existing works. For the first step, in order to classify so varied and complex user activities, we use a relevant and efficient unsupervised learning method called the K-pattern clustering algorithm. In the second step, the training of smart environment for recognizing and predicting user activities inside his/her personal space is done by utilizing the artificial neural network based on the Allen’s temporal relations. The experimental results show that our combined method provides the higher recognition accuracy for various activities, as compared with other data mining classification algorithms. Furthermore, it is more appropriate for a dynamic environment like an IoT based smart home. PMID:26007738

  12. Change detection in synthetic aperture radar images based on image fusion and fuzzy clustering.

    PubMed

    Gong, Maoguo; Zhou, Zhiqiang; Ma, Jingjing

    2012-04-01

    This paper presents an unsupervised distribution-free change detection approach for synthetic aperture radar (SAR) images based on an image fusion strategy and a novel fuzzy clustering algorithm. The image fusion technique is introduced to generate a difference image by using complementary information from a mean-ratio image and a log-ratio image. In order to restrain the background information and enhance the information of changed regions in the fused difference image, wavelet fusion rules based on an average operator and minimum local area energy are chosen to fuse the wavelet coefficients for a low-frequency band and a high-frequency band, respectively. A reformulated fuzzy local-information C-means clustering algorithm is proposed for classifying changed and unchanged regions in the fused difference image. It incorporates the information about spatial context in a novel fuzzy way for the purpose of enhancing the changed information and of reducing the effect of speckle noise. Experiments on real SAR images show that the image fusion strategy integrates the advantages of the log-ratio operator and the mean-ratio operator and gains a better performance. The change detection results obtained by the improved fuzzy clustering algorithm exhibited lower error than its preexistences.

  13. Data Analytics for Smart Parking Applications

    PubMed Central

    Piovesan, Nicola; Turi, Leo; Toigo, Enrico; Martinez, Borja; Rossi, Michele

    2016-01-01

    We consider real-life smart parking systems where parking lot occupancy data are collected from field sensor devices and sent to backend servers for further processing and usage for applications. Our objective is to make these data useful to end users, such as parking managers, and, ultimately, to citizens. To this end, we concoct and validate an automated classification algorithm having two objectives: (1) outlier detection: to detect sensors with anomalous behavioral patterns, i.e., outliers; and (2) clustering: to group the parking sensors exhibiting similar patterns into distinct clusters. We first analyze the statistics of real parking data, obtaining suitable simulation models for parking traces. We then consider a simple classification algorithm based on the empirical complementary distribution function of occupancy times and show its limitations. Hence, we design a more sophisticated algorithm exploiting unsupervised learning techniques (self-organizing maps). These are tuned following a supervised approach using our trace generator and are compared against other clustering schemes, namely expectation maximization, k-means clustering and DBSCAN, considering six months of data from a real sensor deployment. Our approach is found to be superior in terms of classification accuracy, while also being capable of identifying all of the outliers in the dataset. PMID:27669259

  14. Support vector machine multiuser receiver for DS-CDMA signals in multipath channels.

    PubMed

    Chen, S; Samingan, A K; Hanzo, L

    2001-01-01

    The problem of constructing an adaptive multiuser detector (MUD) is considered for direct sequence code division multiple access (DS-CDMA) signals transmitted through multipath channels. The emerging learning technique, called support vector machines (SVM), is proposed as a method of obtaining a nonlinear MUD from a relatively small training data block. Computer simulation is used to study this SVM MUD, and the results show that it can closely match the performance of the optimal Bayesian one-shot detector. Comparisons with an adaptive radial basis function (RBF) MUD trained by an unsupervised clustering algorithm are discussed.

  15. Discharge-nitrate data clustering for characterizing surface-subsurface flow interaction and calibration of a hydrologic model

    NASA Astrophysics Data System (ADS)

    Shrestha, R. R.; Rode, M.

    2008-12-01

    Concentration of reactive chemicals has different chemical signatures in baseflow and surface runoff. Previous studies on nitrate export from a catchment indicate that the transport processes are driven by subsurface flow. Therefore nitrate signature can be used for understanding the event and pre-event contributions to streamflow and surface-subsurface flow interactions. The study uses flow and nitrate concentration time series data for understanding the relationship between these two variables. Unsupervised artificial neural network based learning method called self organizing map is used for the identification of clusters in the datasets. Based on the cluster results, five different pattern in the datasets are identified which correspond to (i) baseflow, (ii) subsurface flow increase, (iii) surface runoff increase, (iv) surface runoff recession, and (v) subsurface flow decrease regions. The cluster results in combination with a hydrologic model are used for discharge separation. For this purpose, a multi-objective optimization tool NSGA-II is used, where violation of cluster results is used as one of the objective functions. The results show that the use of cluster results as supplementary information for the calibration of a hydrologic model gives a plausible simulation of subsurface flow as well total runoff at the catchment outlet. The study is undertaken using data from the Weida catchment in the North-Eastern Germany, which is a sub-catchment of the Weisse Elster river in the Elbe river basin.

  16. Statistical mechanics of unsupervised feature learning in a restricted Boltzmann machine with binary synapses

    NASA Astrophysics Data System (ADS)

    Huang, Haiping

    2017-05-01

    Revealing hidden features in unlabeled data is called unsupervised feature learning, which plays an important role in pretraining a deep neural network. Here we provide a statistical mechanics analysis of the unsupervised learning in a restricted Boltzmann machine with binary synapses. A message passing equation to infer the hidden feature is derived, and furthermore, variants of this equation are analyzed. A statistical analysis by replica theory describes the thermodynamic properties of the model. Our analysis confirms an entropy crisis preceding the non-convergence of the message passing equation, suggesting a discontinuous phase transition as a key characteristic of the restricted Boltzmann machine. Continuous phase transition is also confirmed depending on the embedded feature strength in the data. The mean-field result under the replica symmetric assumption agrees with that obtained by running message passing algorithms on single instances of finite sizes. Interestingly, in an approximate Hopfield model, the entropy crisis is absent, and a continuous phase transition is observed instead. We also develop an iterative equation to infer the hyper-parameter (temperature) hidden in the data, which in physics corresponds to iteratively imposing Nishimori condition. Our study provides insights towards understanding the thermodynamic properties of the restricted Boltzmann machine learning, and moreover important theoretical basis to build simplified deep networks.

  17. Genomic copy number analysis of Chernobyl papillary thyroid carcinoma in the Ukrainian–American Cohort

    PubMed Central

    Selmansberger, Martin; Braselmann, Herbert; Hess, Julia; Bogdanova, Tetiana; Abend, Michael; Tronko, Mykola; Brenner, Alina; Zitzelsberger, Horst; Unger, Kristian

    2015-01-01

    One of the major consequences of the 1986 Chernobyl reactor accident was a dramatic increase in papillary thyroid carcinoma (PTC) incidence, predominantly in patients exposed to the radioiodine fallout at young age. The present study is the first on genomic copy number alterations (CNAs) of PTCs of the Ukrainian–American cohort (UkrAm) generated by array comparative genomic hybridization (aCGH). Unsupervised hierarchical clustering of CNA profiles revealed a significant enrichment of a subgroup of patients with female gender, long latency (>17 years) and negative lymph node status. Further, we identified single CNAs that were significantly associated with latency, gender, radiation dose and BRAF V600E mutation status. Multivariate analysis revealed no interactions but additive effects of parameters gender, latency and dose on CNAs. The previously identified radiation-associated gain of the chromosomal bands 7q11.22-11.23 was present in 29% of cases. Moreover, comparison of our radiation-associated PTC data set with the TCGA data set on sporadic PTCs revealed altered copy numbers of the tumor driver genes NF2 and CHEK2. Further, we integrated the CNA data with transcriptomic data that were available on a subset of the herein analyzed cohort and did not find statistically significant associations between the two molecular layers. However, applying hierarchical clustering on a ‘BRAF-like/RAS-like’ transcriptome signature split the cases into four groups, one of which containing all BRAF-positive cases validating the signature in an independent data set. PMID:26320103

  18. Statewide land cover derived from multiseasonal Landsat TM data: A retrospective of the WISCLAND project

    USGS Publications Warehouse

    Reese, H.M.; Lillesand, T.M.; Nagel, D.E.; Stewart, J.S.; Goldmann, R.A.; Simmons, T.E.; Chipman, J.W.; Tessar, P.A.

    2002-01-01

    Landsat Thematic Mapper (TM) data were the basis in production of a statewide land cover data set for Wisconsin, undertaken in partnership with U.S. Geological Survey's (USGS) Gap Analysis Program (GAP). The data set contained seven classes comparable to Anderson Level I and 24 classes comparable to Anderson Level II/III. Twelve scenes of dual-date TM data were processed with methods that included principal components analysis, stratification into spectrally consistent units, separate classification of upland, wetland, and urban areas, and a hybrid supervised/unsupervised classification called "guided clustering." The final data had overall accuracies of 94% for Anderson Level I upland classes, 77% for Level II/III upland classes, and 84% for Level II/III wetland classes. Classification accuracies for deciduous and coniferous forest were 95% and 93%, respectively, and forest species' overall accuracies ranged from 70% to 84%. Limited availability of acceptable imagery necessitated use of an early May date in a majority of scene pairs, perhaps contributing to lower accuracy for upland deciduous forest species. The mixed deciduous/coniferous forest class had the lowest accuracy, most likely due to distinctly classifying a purely mixed class. Mixed forest signatures containing oak were often confused with pure oak. Guided clustering was seen as an efficient classification method, especially at the tree species level, although its success relied in part on image dates, accurate ground troth, and some analyst intervention. ?? 2002 Elsevier Science Inc. All rights reserved.

  19. Integrative analysis of gene expression and DNA methylation using unsupervised feature extraction for detecting candidate cancer biomarkers.

    PubMed

    Moon, Myungjin; Nakai, Kenta

    2018-04-01

    Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box-Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.

  20. Ripening-dependent metabolic changes in the volatiles of pineapple (Ananas comosus (L.) Merr.) fruit: II. Multivariate statistical profiling of pineapple aroma compounds based on comprehensive two-dimensional gas chromatography-mass spectrometry.

    PubMed

    Steingass, Christof Björn; Jutzi, Manfred; Müller, Jenny; Carle, Reinhold; Schmarr, Hans-Georg

    2015-03-01

    Ripening-dependent changes of pineapple volatiles were studied in a nontargeted profiling analysis. Volatiles were isolated via headspace solid phase microextraction and analyzed by comprehensive 2D gas chromatography and mass spectrometry (HS-SPME-GC×GC-qMS). Profile patterns presented in the contour plots were evaluated applying image processing techniques and subsequent multivariate statistical data analysis. Statistical methods comprised unsupervised hierarchical cluster analysis (HCA) and principal component analysis (PCA) to classify the samples. Supervised partial least squares discriminant analysis (PLS-DA) and partial least squares (PLS) regression were applied to discriminate different ripening stages and describe the development of volatiles during postharvest storage, respectively. Hereby, substantial chemical markers allowing for class separation were revealed. The workflow permitted the rapid distinction between premature green-ripe pineapples and postharvest-ripened sea-freighted fruits. Volatile profiles of fully ripe air-freighted pineapples were similar to those of green-ripe fruits postharvest ripened for 6 days after simulated sea freight export, after PCA with only two principal components. However, PCA considering also the third principal component allowed differentiation between air-freighted fruits and the four progressing postharvest maturity stages of sea-freighted pineapples.

  1. Microarray identifies ADAM family members as key responders to TGF-beta1 in alveolar epithelial cells.

    PubMed

    Keating, Dominic T; Sadlier, Denise M; Patricelli, Andrea; Smith, Sinead M; Walls, Dermot; Egan, Jim J; Doran, Peter P

    2006-09-01

    The molecular mechanisms of Idiopathic Pulmonary Fibrosis (IPF) remain elusive. Transforming Growth Factor beta 1(TGF-beta1) is a key effector cytokine in the development of lung fibrosis. We used microarray and computational biology strategies to identify genes whose expression is significantly altered in alveolar epithelial cells (A549) in response to TGF-beta1, IL-4 and IL-13 and Epstein Barr virus. A549 cells were exposed to 10 ng/ml TGF-beta1, IL-4 and IL-13 at serial time points. Total RNA was used for hybridisation to Affymetrix Human Genome U133A microarrays. Each in vitro time-point was studied in duplicate and an average RMA value computed. Expression data for each time point was compared to control and a signal log ratio of 0.6 or greater taken to identify significant differential regulation. Using normalised RMA values and unsupervised Average Linkage Hierarchical Cluster Analysis, a list of 312 extracellular matrix (ECM) proteins or modulators of matrix turnover was curated via Onto-Compare and Gene-Ontology (GO) databases for baited cluster analysis of ECM associated genes. Interrogation of the dataset using ontological classification focused cluster analysis revealed coordinate differential expression of a large cohort of extracellular matrix associated genes. Of this grouping members of the ADAM (A disintegrin and Metalloproteinase domain containing) family of genes were differentially expressed. ADAM gene expression was also identified in EBV infected A549 cells as well as IL-13 and IL-4 stimulated cells. We probed pathologenomic activities (activation and functional activity) of ADAM19 and ADAMTS9 using siRNA and collagen assays. Knockdown of these genes resulted in diminished production of collagen in A549 cells exposed to TGF-beta1, suggesting a potential role for these molecules in ECM accumulation in IPF.

  2. Unsupervised machine learning account of magnetic transitions in the Hubbard model

    NASA Astrophysics Data System (ADS)

    Ch'ng, Kelvin; Vazquez, Nick; Khatami, Ehsan

    2018-01-01

    We employ several unsupervised machine learning techniques, including autoencoders, random trees embedding, and t -distributed stochastic neighboring ensemble (t -SNE), to reduce the dimensionality of, and therefore classify, raw (auxiliary) spin configurations generated, through Monte Carlo simulations of small clusters, for the Ising and Fermi-Hubbard models at finite temperatures. Results from a convolutional autoencoder for the three-dimensional Ising model can be shown to produce the magnetization and the susceptibility as a function of temperature with a high degree of accuracy. Quantum fluctuations distort this picture and prevent us from making such connections between the output of the autoencoder and physical observables for the Hubbard model. However, we are able to define an indicator based on the output of the t -SNE algorithm that shows a near perfect agreement with the antiferromagnetic structure factor of the model in two and three spatial dimensions in the weak-coupling regime. t -SNE also predicts a transition to the canted antiferromagnetic phase for the three-dimensional model when a strong magnetic field is present. We show that these techniques cannot be expected to work away from half filling when the "sign problem" in quantum Monte Carlo simulations is present.

  3. Evolution patterns and parameter regimes in edge localized modes on the National Spherical Torus Experiment

    DOE Data Explorer

    Smith, D. R. [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); Bell, R. E. [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); Podesta, M. [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); Smith, D. R. [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); Fonck, R. J. [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); McKee, G. R. [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); Diallo, A. [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); Kaye, S. M. [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); LeBlanc, B. P. [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States); Sabbagh, S. A. [Princeton Plasma Physics Lab. (PPPL), Princeton, NJ (United States)

    2015-09-01

    We implement unsupervised machine learning techniques to identify characteristic evolution patterns and associated parameter regimes in edge localized mode (ELM) events observed on the National Spherical Torus Experiment. Multi-channel, localized measurements spanning the pedestal region capture the complex evolution patterns of ELM events on Alfven timescales. Some ELM events are active for less than 100~microsec, but others persist for up to 1~ms. Also, some ELM events exhibit a single dominant perturbation, but others are oscillatory. Clustering calculations with time-series similarity metrics indicate the ELM database contains at least two and possibly three groups of ELMs with similar evolution patterns. The identified ELM groups trigger similar stored energy loss, but the groups occupy distinct parameter regimes for ELM-relevant quantities like plasma current, triangularity, and pedestal height. Notably, the pedestal electron pressure gradient is not an effective parameter for distinguishing the ELM groups, but the ELM groups segregate in terms of electron density gradient and electron temperature gradient. The ELM evolution patterns and corresponding parameter regimes can shape the formulation or validation of nonlinear ELM models. Finally, the techniques and results demonstrate an application of unsupervised machine learning at a data-rich fusion facility.

  4. Why so GLUMM? Detecting depression clusters through graphing lifestyle-environs using machine-learning methods (GLUMM).

    PubMed

    Dipnall, J F; Pasco, J A; Berk, M; Williams, L J; Dodd, S; Jacka, F N; Meyer, D

    2017-01-01

    Key lifestyle-environ risk factors are operative for depression, but it is unclear how risk factors cluster. Machine-learning (ML) algorithms exist that learn, extract, identify and map underlying patterns to identify groupings of depressed individuals without constraints. The aim of this research was to use a large epidemiological study to identify and characterise depression clusters through "Graphing lifestyle-environs using machine-learning methods" (GLUMM). Two ML algorithms were implemented: unsupervised Self-organised mapping (SOM) to create GLUMM clusters and a supervised boosted regression algorithm to describe clusters. Ninety-six "lifestyle-environ" variables were used from the National health and nutrition examination study (2009-2010). Multivariate logistic regression validated clusters and controlled for possible sociodemographic confounders. The SOM identified two GLUMM cluster solutions. These solutions contained one dominant depressed cluster (GLUMM5-1, GLUMM7-1). Equal proportions of members in each cluster rated as highly depressed (17%). Alcohol consumption and demographics validated clusters. Boosted regression identified GLUMM5-1 as more informative than GLUMM7-1. Members were more likely to: have problems sleeping; unhealthy eating; ≤2 years in their home; an old home; perceive themselves underweight; exposed to work fumes; experienced sex at ≤14 years; not perform moderate recreational activities. A positive relationship between GLUMM5-1 (OR: 7.50, P<0.001) and GLUMM7-1 (OR: 7.88, P<0.001) with depression was found, with significant interactions with those married/living with partner (P=0.001). Using ML based GLUMM to form ordered depressive clusters from multitudinous lifestyle-environ variables enabled a deeper exploration of the heterogeneous data to uncover better understandings into relationships between the complex mental health factors. Copyright © 2016 Elsevier Masson SAS. All rights reserved.

  5. Time-resolved metabolomics reveals metabolic modulation in rice foliage

    PubMed Central

    Sato, Shigeru; Arita, Masanori; Soga, Tomoyoshi; Nishioka, Takaaki; Tomita, Masaru

    2008-01-01

    Background To elucidate the interaction of dynamics among modules that constitute biological systems, comprehensive datasets obtained from "omics" technologies have been used. In recent plant metabolomics approaches, the reconstruction of metabolic correlation networks has been attempted using statistical techniques. However, the results were unsatisfactory and effective data-mining techniques that apply appropriate comprehensive datasets are needed. Results Using capillary electrophoresis mass spectrometry (CE-MS) and capillary electrophoresis diode-array detection (CE-DAD), we analyzed the dynamic changes in the level of 56 basic metabolites in plant foliage (Oryza sativa L. ssp. japonica) at hourly intervals over a 24-hr period. Unsupervised clustering of comprehensive metabolic profiles using Kohonen's self-organizing map (SOM) allowed classification of the biochemical pathways activated by the light and dark cycle. The carbon and nitrogen (C/N) metabolism in both periods was also visualized as a phenotypic linkage map that connects network modules on the basis of traditional metabolic pathways rather than pairwise correlations among metabolites. The regulatory networks of C/N assimilation/dissimilation at each time point were consistent with previous works on plant metabolism. In response to environmental stress, glutathione and spermidine fluctuated synchronously with their regulatory targets. Adenine nucleosides and nicotinamide coenzymes were regulated by phosphorylation and dephosphorylation. We also demonstrated that SOM analysis was applicable to the estimation of unidentifiable metabolites in metabolome analysis. Hierarchical clustering of a correlation coefficient matrix could help identify the bottleneck enzymes that regulate metabolic networks. Conclusion Our results showed that our SOM analysis with appropriate metabolic time-courses effectively revealed the synchronous dynamics among metabolic modules and elucidated the underlying biochemical functions. The application of discrimination of unidentified metabolites and the identification of bottleneck enzymatic steps even to non-targeted comprehensive analysis promise to facilitate an understanding of large-scale interactions among components in biological systems. PMID:18564421

  6. Rapid and sensitive analysis of 27 underivatized free amino acids, dipeptides, and tripeptides in fruits of Siraitia grosvenorii Swingle using HILIC-UHPLC-QTRAP(®)/MS (2) combined with chemometrics methods.

    PubMed

    Zhou, Guisheng; Wang, Mengyue; Li, Yang; Peng, Ying; Li, Xiaobo

    2015-08-01

    In the present study, a new strategy based on chemical analysis and chemometrics methods was proposed for the comprehensive analysis and profiling of underivatized free amino acids (FAAs) and small peptides among various Luo-Han-Guo (LHG) samples. Firstly, the ultrasound-assisted extraction (UAE) parameters were optimized using Plackett-Burman (PB) screening and Box-Behnken designs (BBD), and the following optimal UAE conditions were obtained: ultrasound power of 280 W, extraction time of 43 min, and the solid-liquid ratio of 302 mL/g. Secondly, a rapid and sensitive analytical method was developed for simultaneous quantification of 24 FAAs and 3 active small peptides in LHG at trace levels using hydrophilic interaction ultra-performance liquid chromatography coupled with triple-quadrupole linear ion-trap tandem mass spectrometry (HILIC-UHPLC-QTRAP(®)/MS(2)). The analytical method was validated by matrix effects, linearity, LODs, LOQs, precision, repeatability, stability, and recovery. Thirdly, the proposed optimal UAE conditions and analytical methods were applied to measurement of LHG samples. It was shown that LHG was rich in essential amino acids, which were beneficial nutrient substances for human health. Finally, based on the contents of the 27 analytes, the chemometrics methods of unsupervised principal component analysis (PCA) and supervised counter propagation artificial neural network (CP-ANN) were applied to differentiate and classify the 40 batches of LHG samples from different cultivated forms, regions, and varieties. As a result, these samples were mainly clustered into three clusters, which illustrated the cultivating disparity among the samples. In summary, the presented strategy had potential for the investigation of edible plants and agricultural products containing FAAs and small peptides.

  7. An extended transfer operator approach to identify separatrices in open flows

    NASA Astrophysics Data System (ADS)

    Lünsmann, Benedict; Kantz, Holger

    2018-05-01

    Vortices of coherent fluid volume are considered to have a substantial impact on transport processes in turbulent media. Yet, due to their Lagrangian nature, detecting these structures is highly nontrivial. In this respect, transfer operator approaches have been proven to provide useful tools: Approximating a possibly time-dependent flow as a discrete Markov process in space and time, information about coherent structures is contained in the operator's eigenvectors, which is usually extracted by employing clustering methods. Here, we propose an extended approach that couples surrounding filaments using "mixing boundary conditions" and focuses on the separation of the inner coherent set and embedding outer flow. The approach refrains from using unsupervised machine learning techniques such as clustering and uses physical arguments by maximizing a coherence ratio instead. We show that this technique improves the reconstruction of separatrices in stationary open flows and succeeds in finding almost-invariant sets in periodically perturbed flows.

  8. Automatic Cell Segmentation Using a Shape-Classification Model in Immunohistochemically Stained Cytological Images

    NASA Astrophysics Data System (ADS)

    Shah, Shishir

    This paper presents a segmentation method for detecting cells in immunohistochemically stained cytological images. A two-phase approach to segmentation is used where an unsupervised clustering approach coupled with cluster merging based on a fitness function is used as the first phase to obtain a first approximation of the cell locations. A joint segmentation-classification approach incorporating ellipse as a shape model is used as the second phase to detect the final cell contour. The segmentation model estimates a multivariate density function of low-level image features from training samples and uses it as a measure of how likely each image pixel is to be a cell. This estimate is constrained by the zero level set, which is obtained as a solution to an implicit representation of an ellipse. Results of segmentation are presented and compared to ground truth measurements.

  9. The effect of the atmosphere on the classification of satellite observations to identify surface features

    NASA Technical Reports Server (NTRS)

    Fraser, R. S.; Bahethi, O. P.; Al-Abbas, A. H.

    1977-01-01

    The effect of differences in atmospheric turbidity on the classification of Landsat 1 observations of a rural scene is presented. The observations are classified by an unsupervised clustering technique. These clusters serve as a training set for use of a maximum-likelihood algorithm. The measured radiances in each of the four spectral bands are then changed by amounts measured by Landsat 1. These changes can be associated with a decrease in atmospheric turbidity by a factor of 1.3. The classification of 22% of the pixels changes as a result of the modification. The modified observations are then reclassified as an independent set. Only 3% of the pixels have a different classification than the unmodified set. Hence, if classification errors of rural areas are not to exceed 15%, a new training set has to be developed whenever the difference in turbidity between the training and test sets reaches unity.

  10. Correlation between aircraft MSS and LIDAR remotely sensed data on a forested wetland in South Carolina

    NASA Technical Reports Server (NTRS)

    Jensen, John R.; Hodgson, Michael E.; Mackey, Halkard E., Jr.; Krabill, William

    1987-01-01

    Wetlands in a portion of the Savannah River swamp forest, the Steel Creek Delta, were mapped using April 26, 1985 high-resolution aircraft multispectral scanner (MSS) data. Due to the complex spectral characteristics of the wetland vegetation, it was necessary to implement several techniques in the classification of the MSS imagery of the Steel Creek Delta. In particular, when performing unsupervised classification, an iterative cluster busting technique was used which simplified the cluster labeling process. In addition to the MSS data, light detecting and ranging (LIDAR) data were acquired by National Aeronautics and Space Administration (NASA) personnel along two flightlines over the Steel Creek Delta. These data were registered with the wetland classification map and correlated. Statistical analyses demonstrated that the laser derived canopy height information was significantly correlated with the Steel Creek Delta wetland classes encountered along the profiling transect of the LIDAR data.

  11. Quantify spatial relations to discover handwritten graphical symbols

    NASA Astrophysics Data System (ADS)

    Li, Jinpeng; Mouchère, Harold; Viard-Gaudin, Christian

    2012-01-01

    To model a handwritten graphical language, spatial relations describe how the strokes are positioned in the 2-dimensional space. Most of existing handwriting recognition systems make use of some predefined spatial relations. However, considering a complex graphical language, it is hard to express manually all the spatial relations. Another possibility would be to use a clustering technique to discover the spatial relations. In this paper, we discuss how to create a relational graph between strokes (nodes) labeled with graphemes in a graphical language. Then we vectorize spatial relations (edges) for clustering and quantization. As the targeted application, we extract the repetitive sub-graphs (graphical symbols) composed of graphemes and learned spatial relations. On two handwriting databases, a simple mathematical expression database and a complex flowchart database, the unsupervised spatial relations outperform the predefined spatial relations. In addition, we visualize the frequent patterns on two text-lines containing Chinese characters.

  12. Flexible Kernel Memory

    PubMed Central

    Nowicki, Dimitri; Siegelmann, Hava

    2010-01-01

    This paper introduces a new model of associative memory, capable of both binary and continuous-valued inputs. Based on kernel theory, the memory model is on one hand a generalization of Radial Basis Function networks and, on the other, is in feature space, analogous to a Hopfield network. Attractors can be added, deleted, and updated on-line simply, without harming existing memories, and the number of attractors is independent of input dimension. Input vectors do not have to adhere to a fixed or bounded dimensionality; they can increase and decrease it without relearning previous memories. A memory consolidation process enables the network to generalize concepts and form clusters of input data, which outperforms many unsupervised clustering techniques; this process is demonstrated on handwritten digits from MNIST. Another process, reminiscent of memory reconsolidation is introduced, in which existing memories are refreshed and tuned with new inputs; this process is demonstrated on series of morphed faces. PMID:20552013

  13. Characterization of computer network events through simultaneous feature selection and clustering of intrusion alerts

    NASA Astrophysics Data System (ADS)

    Chen, Siyue; Leung, Henry; Dondo, Maxwell

    2014-05-01

    As computer network security threats increase, many organizations implement multiple Network Intrusion Detection Systems (NIDS) to maximize the likelihood of intrusion detection and provide a comprehensive understanding of intrusion activities. However, NIDS trigger a massive number of alerts on a daily basis. This can be overwhelming for computer network security analysts since it is a slow and tedious process to manually analyse each alert produced. Thus, automated and intelligent clustering of alerts is important to reveal the structural correlation of events by grouping alerts with common features. As the nature of computer network attacks, and therefore alerts, is not known in advance, unsupervised alert clustering is a promising approach to achieve this goal. We propose a joint optimization technique for feature selection and clustering to aggregate similar alerts and to reduce the number of alerts that analysts have to handle individually. More precisely, each identified feature is assigned a binary value, which reflects the feature's saliency. This value is treated as a hidden variable and incorporated into a likelihood function for clustering. Since computing the optimal solution of the likelihood function directly is analytically intractable, we use the Expectation-Maximisation (EM) algorithm to iteratively update the hidden variable and use it to maximize the expected likelihood. Our empirical results, using a labelled Defense Advanced Research Projects Agency (DARPA) 2000 reference dataset, show that the proposed method gives better results than the EM clustering without feature selection in terms of the clustering accuracy.

  14. Dissecting psychiatric spectrum disorders by generative embedding☆☆☆

    PubMed Central

    Brodersen, Kay H.; Deserno, Lorenz; Schlagenhauf, Florian; Lin, Zhihao; Penny, Will D.; Buhmann, Joachim M.; Stephan, Klaas E.

    2013-01-01

    This proof-of-concept study examines the feasibility of defining subgroups in psychiatric spectrum disorders by generative embedding, using dynamical system models which infer neuronal circuit mechanisms from neuroimaging data. To this end, we re-analysed an fMRI dataset of 41 patients diagnosed with schizophrenia and 42 healthy controls performing a numerical n-back working-memory task. In our generative-embedding approach, we used parameter estimates from a dynamic causal model (DCM) of a visual–parietal–prefrontal network to define a model-based feature space for the subsequent application of supervised and unsupervised learning techniques. First, using a linear support vector machine for classification, we were able to predict individual diagnostic labels significantly more accurately (78%) from DCM-based effective connectivity estimates than from functional connectivity between (62%) or local activity within the same regions (55%). Second, an unsupervised approach based on variational Bayesian Gaussian mixture modelling provided evidence for two clusters which mapped onto patients and controls with nearly the same accuracy (71%) as the supervised approach. Finally, when restricting the analysis only to the patients, Gaussian mixture modelling suggested the existence of three patient subgroups, each of which was characterised by a different architecture of the visual–parietal–prefrontal working-memory network. Critically, even though this analysis did not have access to information about the patients' clinical symptoms, the three neurophysiologically defined subgroups mapped onto three clinically distinct subgroups, distinguished by significant differences in negative symptom severity, as assessed on the Positive and Negative Syndrome Scale (PANSS). In summary, this study provides a concrete example of how psychiatric spectrum diseases may be split into subgroups that are defined in terms of neurophysiological mechanisms specified by a generative model of network dynamics such as DCM. The results corroborate our previous findings in stroke patients that generative embedding, compared to analyses of more conventional measures such as functional connectivity or regional activity, can significantly enhance both the interpretability and performance of computational approaches to clinical classification. PMID:24363992

  15. MutSα's Multi-Domain Allosteric Response to Three DNA Damage Types Revealed by Machine Learning

    NASA Astrophysics Data System (ADS)

    Melvin, Ryan L.; Thompson, William G.; Godwin, Ryan C.; Gmeiner, William H.; Salsbury, Freddie R.

    2017-03-01

    MutSalpha is a key component in the mismatch repair (MMR) pathway. This protein is responsible for initiating the signaling pathways for DNA repair or cell death. Herein we investigate this heterodimer’s post-recognition, post-binding response to three types of DNA damage involving cytotoxic, anti-cancer agents - carboplatin, cisplatin, and FdU. Through a combination of supervised and unsupervised machine learning techniques along with more traditional structural and kinetic analysis applied to all-atom molecular dynamics (MD) calculations, we predict that MutSalpha has a distinct response to each of the three damage types. Via a binary classification tree (a supervised machine learning technique), we identify key hydrogen bond motifs unique to each type of damage and suggest residues for experimental mutation studies. Through a combination of a recently developed clustering (unsupervised learning) algorithm, RMSF calculations, PCA, and correlated motions we predict that each type of damage causes MutS↵to explore a specific region of conformation space. Detailed analysis suggests a short range effect for carboplatin - primarily altering the structures and kinetics of residues within 10 angstroms of the damaged DNA - and distinct longer-range effects for cisplatin and FdU. In our simulations, we also observe that a key phenylalanine residue - known to stack with a mismatched or unmatched bases in MMR - stacks with the base complementary to the damaged base in 88.61% of MD frames containing carboplatinated DNA. Similarly, this Phe71 stacks with the base complementary to damage in 91.73% of frames with cisplatinated DNA. This residue, however, stacks with the damaged base itself in 62.18% of trajectory frames with FdU-substituted DNA and has no stacking interaction at all in 30.72% of these frames. Each drug investigated here induces a unique perturbation in the MutS↵complex, indicating the possibility of a distinct signaling event and specific repair or death pathway (or set of pathways) for a given type of damage.

  16. AHaH computing-from metastable switches to attractors to machine learning.

    PubMed

    Nugent, Michael Alexander; Molter, Timothy Wesley

    2014-01-01

    Modern computing architecture based on the separation of memory and processing leads to a well known problem called the von Neumann bottleneck, a restrictive limit on the data bandwidth between CPU and RAM. This paper introduces a new approach to computing we call AHaH computing where memory and processing are combined. The idea is based on the attractor dynamics of volatile dissipative electronics inspired by biological systems, presenting an attractive alternative architecture that is able to adapt, self-repair, and learn from interactions with the environment. We envision that both von Neumann and AHaH computing architectures will operate together on the same machine, but that the AHaH computing processor may reduce the power consumption and processing time for certain adaptive learning tasks by orders of magnitude. The paper begins by drawing a connection between the properties of volatility, thermodynamics, and Anti-Hebbian and Hebbian (AHaH) plasticity. We show how AHaH synaptic plasticity leads to attractor states that extract the independent components of applied data streams and how they form a computationally complete set of logic functions. After introducing a general memristive device model based on collections of metastable switches, we show how adaptive synaptic weights can be formed from differential pairs of incremental memristors. We also disclose how arrays of synaptic weights can be used to build a neural node circuit operating AHaH plasticity. By configuring the attractor states of the AHaH node in different ways, high level machine learning functions are demonstrated. This includes unsupervised clustering, supervised and unsupervised classification, complex signal prediction, unsupervised robotic actuation and combinatorial optimization of procedures-all key capabilities of biological nervous systems and modern machine learning algorithms with real world application.

  17. AHaH Computing–From Metastable Switches to Attractors to Machine Learning

    PubMed Central

    Nugent, Michael Alexander; Molter, Timothy Wesley

    2014-01-01

    Modern computing architecture based on the separation of memory and processing leads to a well known problem called the von Neumann bottleneck, a restrictive limit on the data bandwidth between CPU and RAM. This paper introduces a new approach to computing we call AHaH computing where memory and processing are combined. The idea is based on the attractor dynamics of volatile dissipative electronics inspired by biological systems, presenting an attractive alternative architecture that is able to adapt, self-repair, and learn from interactions with the environment. We envision that both von Neumann and AHaH computing architectures will operate together on the same machine, but that the AHaH computing processor may reduce the power consumption and processing time for certain adaptive learning tasks by orders of magnitude. The paper begins by drawing a connection between the properties of volatility, thermodynamics, and Anti-Hebbian and Hebbian (AHaH) plasticity. We show how AHaH synaptic plasticity leads to attractor states that extract the independent components of applied data streams and how they form a computationally complete set of logic functions. After introducing a general memristive device model based on collections of metastable switches, we show how adaptive synaptic weights can be formed from differential pairs of incremental memristors. We also disclose how arrays of synaptic weights can be used to build a neural node circuit operating AHaH plasticity. By configuring the attractor states of the AHaH node in different ways, high level machine learning functions are demonstrated. This includes unsupervised clustering, supervised and unsupervised classification, complex signal prediction, unsupervised robotic actuation and combinatorial optimization of procedures–all key capabilities of biological nervous systems and modern machine learning algorithms with real world application. PMID:24520315

  18. Iterative quantization: a Procrustean approach to learning binary codes for large-scale image retrieval.

    PubMed

    Gong, Yunchao; Lazebnik, Svetlana; Gordo, Albert; Perronnin, Florent

    2013-12-01

    This paper addresses the problem of learning similarity-preserving binary codes for efficient similarity search in large-scale image collections. We formulate this problem in terms of finding a rotation of zero-centered data so as to minimize the quantization error of mapping this data to the vertices of a zero-centered binary hypercube, and propose a simple and efficient alternating minimization algorithm to accomplish this task. This algorithm, dubbed iterative quantization (ITQ), has connections to multiclass spectral clustering and to the orthogonal Procrustes problem, and it can be used both with unsupervised data embeddings such as PCA and supervised embeddings such as canonical correlation analysis (CCA). The resulting binary codes significantly outperform several other state-of-the-art methods. We also show that further performance improvements can result from transforming the data with a nonlinear kernel mapping prior to PCA or CCA. Finally, we demonstrate an application of ITQ to learning binary attributes or "classemes" on the ImageNet data set.

  19. Weak Genetic Structure in Northern African Dromedary Camels Reflects Their Unique Evolutionary History

    PubMed Central

    Cherifi, Youcef Amine; Gaouar, Suheil Bechir Semir; Guastamacchia, Rosangela; El-Bahrawy, Khalid Ahmed; Abushady, Asmaa Mohammed Aly; Sharaf, Abdoallah Aboelnasr; Harek, Derradji; Lacalandra, Giovanni Michele; Saïdi-Mehtar, Nadhira

    2017-01-01

    Knowledge on genetic diversity and structure of camel populations is fundamental for sustainable herd management and breeding program implementation in this species. Here we characterized a total of 331 camels from Northern Africa, representative of six populations and thirteen Algerian and Egyptian geographic regions, using 20 STR markers. The nineteen polymorphic loci displayed an average of 9.79 ± 5.31 alleles, ranging from 2 (CVRL8) to 24 (CVRL1D). Average He was 0.647 ± 0.173. Eleven loci deviated significantly from Hardy-Weinberg proportions (P<0.05), due to excess of homozygous genotypes in all cases except one (CMS18). Distribution of genetic diversity along a weak geographic gradient as suggested by network analysis was not supported by either unsupervised and supervised Bayesian clustering. Traditional extensive/nomadic herding practices, together with the historical use as a long-range beast of burden and its peculiar evolutionary history, with domestication likely occurring from a bottlenecked and geographically confined wild progenitor, may explain the observed genetic patterns. PMID:28103238

  20. PANTHER. Trajectory Analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rintoul, Mark Daniel; Wilson, Andrew T.; Valicka, Christopher G.

    We want to organize a body of trajectories in order to identify, search for, classify and predict behavior among objects such as aircraft and ships. Existing compari- son functions such as the Fr'echet distance are computationally expensive and yield counterintuitive results in some cases. We propose an approach using feature vectors whose components represent succinctly the salient information in trajectories. These features incorporate basic information such as total distance traveled and distance be- tween start/stop points as well as geometric features related to the properties of the convex hull, trajectory curvature and general distance geometry. Additionally, these features can generallymore » be mapped easily to behaviors of interest to humans that are searching large databases. Most of these geometric features are invariant under rigid transformation. We demonstrate the use of different subsets of these features to iden- tify trajectories similar to an exemplar, cluster a database of several hundred thousand trajectories, predict destination and apply unsupervised machine learning algorithms.« less

  1. A segmentation and classification scheme for single tooth in MicroCT images based on 3D level set and k-means+.

    PubMed

    Wang, Liansheng; Li, Shusheng; Chen, Rongzhen; Liu, Sze-Yu; Chen, Jyh-Cheng

    2017-04-01

    Accurate classification of different anatomical structures of teeth from medical images provides crucial information for the stress analysis in dentistry. Usually, the anatomical structures of teeth are manually labeled by experienced clinical doctors, which is time consuming. However, automatic segmentation and classification is a challenging task because the anatomical structures and surroundings of the tooth in medical images are rather complex. Therefore, in this paper, we propose an effective framework which is designed to segment the tooth with a Selective Binary and Gaussian Filtering Regularized Level Set (GFRLS) method improved by fully utilizing 3 dimensional (3D) information, and classify the tooth by employing unsupervised learning i.e., k-means++ method. In order to evaluate the proposed method, the experiments are conducted on the sufficient and extensive datasets of mandibular molars. The experimental results show that our method can achieve higher accuracy and robustness compared to other three clustering methods. Copyright © 2016 Elsevier Ltd. All rights reserved.

  2. Classification of simulated and actual NOAA-6 AVHRR data for hydrologic land-surface feature definition. [Advanced Very High Resolution Radiometer

    NASA Technical Reports Server (NTRS)

    Ormsby, J. P.

    1982-01-01

    An examination of the possibilities of using Landsat data to simulate NOAA-6 Advanced Very High Resolution Radiometer (AVHRR) data on two channels, as well as using actual NOAA-6 imagery, for large-scale hydrological studies is presented. A running average was obtained of 18 consecutive pixels of 1 km resolution taken by the Landsat scanners were scaled up to 8-bit data and investigated for different gray levels. AVHRR data comprising five channels of 10-bit, band-interleaved information covering 10 deg latitude were analyzed and a suitable pixel grid was chosen for comparison with the Landsat data in a supervised classification format, an unsupervised mode, and with ground truth. Landcover delineation was explored by removing snow, water, and cloud features from the cluster analysis, and resulted in less than 10% difference. Low resolution large-scale data was determined useful for characterizing some landcover features if weekly and/or monthly updates are maintained.

  3. Experiments in automatic word class and word sense identification for information retrieval

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gauch, S.; Futrelle, R.P.

    Automatic identification of related words and automatic detection of word senses are two long-standing goals of researchers in natural language processing. Word class information and word sense identification may enhance the performance of information retrieval system4ms. Large online corpora and increased computational capabilities make new techniques based on corpus linguisitics feasible. Corpus-based analysis is especially needed for corpora from specialized fields for which no electronic dictionaries or thesauri exist. The methods described here use a combination of mutual information and word context to establish word similarities. Then, unsupervised classification is done using clustering in the word space, identifying word classesmore » without pretagging. We also describe an extension of the method to handle the difficult problems of disambiguation and of determining part-of-speech and semantic information for low-frequency words. The method is powerful enough to produce high-quality results on a small corpus of 200,000 words from abstracts in a field of molecular biology.« less

  4. Mapping the Philippines' mangrove forests using Landsat imagery

    USGS Publications Warehouse

    Long, Jordan; Giri, Chandra

    2011-01-01

    Current, accurate, and reliable information on the areal extent and spatial distribution of mangrove forests in the Philippines is limited. Previous estimates of mangrove extent do not illustrate the spatial distribution for the entire country. This study, part of a global assessment of mangrove dynamics, mapped the spatial distribution and areal extent of the Philippines’ mangroves circa 2000. We used publicly available Landsat data acquired primarily from the Global Land Survey to map the total extent and spatial distribution. ISODATA clustering, an unsupervised classification technique, was applied to 61 Landsat images. Statistical analysis indicates the total area of mangrove forest cover was approximately 256,185 hectares circa 2000 with overall classification accuracy of 96.6% and a kappa coefficient of 0.926. These results differ substantially from most recent estimates of mangrove area in the Philippines. The results of this study may assist the decision making processes for rehabilitation and conservation efforts that are currently needed to protect and restore the Philippines’ degraded mangrove forests.

  5. Data Exploration using Unsupervised Feature Extraction for Mixed Micro-Seismic Signals

    NASA Astrophysics Data System (ADS)

    Meyer, Matthias; Weber, Samuel; Beutel, Jan

    2017-04-01

    We present a system for the analysis of data originating in a multi-sensor and multi-year experiment focusing on slope stability and its underlying processes in fractured permafrost rock walls undertaken at 3500m a.s.l. on the Matterhorn Hörnligrat, (Zermatt, Switzerland). This system incorporates facilities for the transmission, management and storage of large-scales of data ( 7 GB/day), preprocessing and aggregation of multiple sensor types, machine-learning based automatic feature extraction for micro-seismic and acoustic emission data and interactive web-based visualization of the data. Specifically, a combination of three types of sensors are used to profile the frequency spectrum from 1 Hz to 80 kHz with the goal to identify the relevant destructive processes (e.g. micro-cracking and fracture propagation) leading to the eventual destabilization of large rock masses. The sensors installed for this profiling experiment (2 geophones, 1 accelerometers and 2 piezo-electric sensors for detecting acoustic emission), are further augmented with sensors originating from a previous activity focusing on long-term monitoring of temperature evolution and rock kinematics with the help of wireless sensor networks (crackmeters, cameras, weather station, rock temperature profiles, differential GPS) [Hasler2012]. In raw format, the data generated by the different types of sensors, specifically the micro-seismic and acoustic emission sensors, is strongly heterogeneous, in part unsynchronized and the storage and processing demand is large. Therefore, a purpose-built signal preprocessing and event-detection system is used. While the analysis of data from each individual sensor follows established methods, the application of all these sensor types in combination within a field experiment is unique. Furthermore, experience and methods from using such sensors in laboratory settings cannot be readily transferred to the mountain field site setting with its scale and full exposure to the natural environment. Consequently, many state-of-the-art algorithms for big data analysis and event classification requiring a ground truth dataset cannot be applied. The above mentioned challenges require a tool for data exploration. In the presented system, data exploration is supported by unsupervised feature learning based on convolutional neural networks, which is used to automatically extract common features for preliminary clustering and outlier detection. With this information, an interactive web-tool allows for a fast identification of interesting time segments on which segment-selective algorithms for visualization, feature extraction and statistics can be applied. The combination of manual labeling based and unsupervised feature extraction provides an event catalog for classification of different characteristic events related to internal progression of micro-crack in steep fractured bedrock permafrost. References Hasler, A., S. Gruber, and J. Beutel (2012), Kinematics of steep bedrock permafrost, J. Geophys. Res., 117, F01016, doi:10.1029/2011JF001981.

  6. Analysis of thematic mapper simulator data collected over eastern North Dakota

    NASA Technical Reports Server (NTRS)

    Anderson, J. E. (Principal Investigator)

    1982-01-01

    The results of the analysis of aircraft-acquired thematic mapper simulator (TMS) data, collected to investigate the utility of thematic mapper data in crop area and land cover estimates, are discussed. Results of the analysis indicate that the seven-channel TMS data are capable of delineating the 13 crop types included in the study to an overall pixel classification accuracy of 80.97% correct, with relative efficiencies for four crop types examined between 1.62 and 26.61. Both supervised and unsupervised spectral signature development techniques were evaluated. The unsupervised methods proved to be inferior (based on analysis of variance) for the majority of crop types considered. Given the ground truth data set used for spectral signature development as well as evaluation of performance, it is possible to demonstrate which signature development technique would produce the highest percent correct classification for each crop type.

  7. Model-Based Clustering of Regression Time Series Data via APECM -- An AECM Algorithm Sung to an Even Faster Beat

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chen, Wei-Chen; Maitra, Ranjan

    2011-01-01

    We propose a model-based approach for clustering time series regression data in an unsupervised machine learning framework to identify groups under the assumption that each mixture component follows a Gaussian autoregressive regression model of order p. Given the number of groups, the traditional maximum likelihood approach of estimating the parameters using the expectation-maximization (EM) algorithm can be employed, although it is computationally demanding. The somewhat fast tune to the EM folk song provided by the Alternating Expectation Conditional Maximization (AECM) algorithm can alleviate the problem to some extent. In this article, we develop an alternative partial expectation conditional maximization algorithmmore » (APECM) that uses an additional data augmentation storage step to efficiently implement AECM for finite mixture models. Results on our simulation experiments show improved performance in both fewer numbers of iterations and computation time. The methodology is applied to the problem of clustering mutual funds data on the basis of their average annual per cent returns and in the presence of economic indicators.« less

  8. A New MI-Based Visualization Aided Validation Index for Mining Big Longitudinal Web Trial Data

    PubMed Central

    Zhang, Zhaoyang; Fang, Hua; Wang, Honggang

    2016-01-01

    Web-delivered clinical trials generate big complex data. To help untangle the heterogeneity of treatment effects, unsupervised learning methods have been widely applied. However, identifying valid patterns is a priority but challenging issue for these methods. This paper, built upon our previous research on multiple imputation (MI)-based fuzzy clustering and validation, proposes a new MI-based Visualization-aided validation index (MIVOOS) to determine the optimal number of clusters for big incomplete longitudinal Web-trial data with inflated zeros. Different from a recently developed fuzzy clustering validation index, MIVOOS uses a more suitable overlap and separation measures for Web-trial data but does not depend on the choice of fuzzifiers as the widely used Xie and Beni (XB) index. Through optimizing the view angles of 3-D projections using Sammon mapping, the optimal 2-D projection-guided MIVOOS is obtained to better visualize and verify the patterns in conjunction with trajectory patterns. Compared with XB and VOS, our newly proposed MIVOOS shows its robustness in validating big Web-trial data under different missing data mechanisms using real and simulated Web-trial data. PMID:27482473

  9. Fast detection of vascular plaque in optical coherence tomography images using a reduced feature set

    NASA Astrophysics Data System (ADS)

    Prakash, Ammu; Ocana Macias, Mariano; Hewko, Mark; Sowa, Michael; Sherif, Sherif

    2018-03-01

    Optical coherence tomography (OCT) images are capable of detecting vascular plaque by using the full set of 26 Haralick textural features and a standard K-means clustering algorithm. However, the use of the full set of 26 textural features is computationally expensive and may not be feasible for real time implementation. In this work, we identified a reduced set of 3 textural feature which characterizes vascular plaque and used a generalized Fuzzy C-means clustering algorithm. Our work involves three steps: 1) the reduction of a full set 26 textural feature to a reduced set of 3 textural features by using genetic algorithm (GA) optimization method 2) the implementation of an unsupervised generalized clustering algorithm (Fuzzy C-means) on the reduced feature space, and 3) the validation of our results using histology and actual photographic images of vascular plaque. Our results show an excellent match with histology and actual photographic images of vascular tissue. Therefore, our results could provide an efficient pre-clinical tool for the detection of vascular plaque in real time OCT imaging.

  10. An unsupervised classification scheme for improving predictions of prokaryotic TIS.

    PubMed

    Tech, Maike; Meinicke, Peter

    2006-03-09

    Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation of prokaryotic TIS. However, inherent difficulties of these approaches arise from the considerable variation of TIS characteristics across different species. Therefore prior assumptions about the properties of prokaryotic gene starts may cause suboptimal predictions for newly sequenced genomes with TIS signals differing from those of well-investigated genomes. We introduce a clustering algorithm for completely unsupervised scoring of potential TIS, based on positionally smoothed probability matrices. The algorithm requires an initial gene prediction and the genomic sequence of the organism to perform the reannotation. As compared with other methods for improving predictions of gene starts in bacterial genomes, our approach is not based on any specific assumptions about prokaryotic TIS. Despite the generality of the underlying algorithm, the prediction rate of our method is competitive on experimentally verified test data from E. coli and B. subtilis. Regarding genomes with high G+C content, in contrast to some previously proposed methods, our algorithm also provides good performance on P. aeruginosa, B. pseudomallei and R. solanacearum. On reliable test data we showed that our method provides good results in post-processing the predictions of the widely-used program GLIMMER. The underlying clustering algorithm is robust with respect to variations in the initial TIS annotation and does not require specific assumptions about prokaryotic gene starts. These features are particularly useful on genomes with high G+C content. The algorithm has been implemented in the tool "TICO" (TIs COrrector) which is publicly available from our web site.

  11. Classification and unsupervised clustering of LIGO data with Deep Transfer Learning

    NASA Astrophysics Data System (ADS)

    George, Daniel; Shen, Hongyu; Huerta, E. A.

    2018-05-01

    Gravitational wave detection requires a detailed understanding of the response of the LIGO and Virgo detectors to true signals in the presence of environmental and instrumental noise. Of particular interest is the study of anomalous non-Gaussian transients, such as glitches, since their occurrence rate in LIGO and Virgo data can obscure or even mimic true gravitational wave signals. Therefore, successfully identifying and excising these anomalies from gravitational wave data is of utmost importance for the detection and characterization of true signals and for the accurate computation of their significance. To facilitate this work, we present the first application of deep learning combined with transfer learning to show that knowledge from pretrained models for real-world object recognition can be transferred for classifying spectrograms of glitches. To showcase this new method, we use a data set of twenty-two classes of glitches, curated and labeled by the Gravity Spy project using data collected during LIGO's first discovery campaign. We demonstrate that our Deep Transfer Learning method enables an optimal use of very deep convolutional neural networks for glitch classification given small and unbalanced training data sets, significantly reduces the training time, and achieves state-of-the-art accuracy above 98.8%, lowering the previous error rate by over 60%. More importantly, once trained via transfer learning on the known classes, we show that our neural networks can be truncated and used as feature extractors for unsupervised clustering to automatically group together new unknown classes of glitches and anomalous signals. This novel capability is of paramount importance to identify and remove new types of glitches which will occur as the LIGO/Virgo detectors gradually attain design sensitivity.

  12. Spatio-spectral classification of hyperspectral images for brain cancer detection during surgical operations.

    PubMed

    Fabelo, Himar; Ortega, Samuel; Ravi, Daniele; Kiran, B Ravi; Sosa, Coralia; Bulters, Diederik; Callicó, Gustavo M; Bulstrode, Harry; Szolna, Adam; Piñeiro, Juan F; Kabwama, Silvester; Madroñal, Daniel; Lazcano, Raquel; J-O'Shanahan, Aruma; Bisshopp, Sara; Hernández, María; Báez, Abelardo; Yang, Guang-Zhong; Stanciulescu, Bogdan; Salvador, Rubén; Juárez, Eduardo; Sarmiento, Roberto

    2018-01-01

    Surgery for brain cancer is a major problem in neurosurgery. The diffuse infiltration into the surrounding normal brain by these tumors makes their accurate identification by the naked eye difficult. Since surgery is the common treatment for brain cancer, an accurate radical resection of the tumor leads to improved survival rates for patients. However, the identification of the tumor boundaries during surgery is challenging. Hyperspectral imaging is a non-contact, non-ionizing and non-invasive technique suitable for medical diagnosis. This study presents the development of a novel classification method taking into account the spatial and spectral characteristics of the hyperspectral images to help neurosurgeons to accurately determine the tumor boundaries in surgical-time during the resection, avoiding excessive excision of normal tissue or unintentionally leaving residual tumor. The algorithm proposed in this study to approach an efficient solution consists of a hybrid framework that combines both supervised and unsupervised machine learning methods. Firstly, a supervised pixel-wise classification using a Support Vector Machine classifier is performed. The generated classification map is spatially homogenized using a one-band representation of the HS cube, employing the Fixed Reference t-Stochastic Neighbors Embedding dimensional reduction algorithm, and performing a K-Nearest Neighbors filtering. The information generated by the supervised stage is combined with a segmentation map obtained via unsupervised clustering employing a Hierarchical K-Means algorithm. The fusion is performed using a majority voting approach that associates each cluster with a certain class. To evaluate the proposed approach, five hyperspectral images of surface of the brain affected by glioblastoma tumor in vivo from five different patients have been used. The final classification maps obtained have been analyzed and validated by specialists. These preliminary results are promising, obtaining an accurate delineation of the tumor area.

  13. An unsupervised machine learning method for delineating stratum corneum in reflectance confocal microscopy stacks of human skin in vivo

    NASA Astrophysics Data System (ADS)

    Bozkurt, Alican; Kose, Kivanc; Fox, Christi A.; Dy, Jennifer; Brooks, Dana H.; Rajadhyaksha, Milind

    2016-02-01

    Study of the stratum corneum (SC) in human skin is important for research in barrier structure and function, drug delivery, and water permeability of skin. The optical sectioning and high resolution of reflectance confocal microscopy (RCM) allows visual examination of SC non-invasively. Here, we present an unsupervised segmentation algorithm that can automatically delineate thickness of the SC in RCM images of human skin in-vivo. We mimic clinicians visual process by applying complex wavelet transform over non-overlapping local regions of size 16 x 16 μm called tiles, and analyze the textural changes in between consecutive tiles in axial (depth) direction. We use dual-tree complex wavelet transform to represent textural structures in each tile. This transform is almost shift-invariant, and directionally selective, which makes it highly efficient in texture representation. Using DT-CWT, we decompose each tile into 6 directional sub-bands with orientations in +/-15, 45, and 75 degrees and a low-pass band, which is the decimated version of the input. We apply 3 scales of decomposition by recursively transforming the low-pass bands and obtain 18 bands of different directionality at different scales. We then calculate mean and variance of each band resulting in a feature vector of 36 entries. Feature vectors obtained for each stack of tiles in axial direction are then clustered using spectral clustering in order to detect the textural changes in depth direction. Testing on a set of 15 RCM stacks produced a mean error of 5.45+/-1.32 μm, compared to the "ground truth" segmentation provided by a clinical expert reader.

  14. Molecular Subtypes of Glioblastoma Are Relevant to Lower Grade Glioma

    PubMed Central

    Sloan, Andrew E.; Chen, Yanwen; Brat, Daniel J.; O’Neill, Brian Patrick; de Groot, John; Yust-Katz, Shlomit; Yung, Wai-Kwan Alfred; Cohen, Mark L.; Aldape, Kenneth D.; Rosenfeld, Steven; Verhaak, Roeland G. W.; Barnholtz-Sloan, Jill S.

    2014-01-01

    Background Gliomas are the most common primary malignant brain tumors in adults with great heterogeneity in histopathology and clinical course. The intent was to evaluate the relevance of known glioblastoma (GBM) expression and methylation based subtypes to grade II and III gliomas (ie. lower grade gliomas). Methods Gene expression array, single nucleotide polymorphism (SNP) array and clinical data were obtained for 228 GBMs and 176 grade II/II gliomas (GII/III) from the publically available Rembrandt dataset. Two additional datasets with IDH1 mutation status were utilized as validation datasets (one publicly available dataset and one newly generated dataset from MD Anderson). Unsupervised clustering was performed and compared to gene expression subtypes assigned using the Verhaak et al 840-gene classifier. The glioma-CpG Island Methylator Phenotype (G-CIMP) was assigned using prediction models by Fine et al. Results Unsupervised clustering by gene expression aligned with the Verhaak 840-gene subtype group assignments. GII/IIIs were preferentially assigned to the proneural subtype with IDH1 mutation and G-CIMP. GBMs were evenly distributed among the four subtypes. Proneural, IDH1 mutant, G-CIMP GII/III s had significantly better survival than other molecular subtypes. Only 6% of GBMs were proneural and had either IDH1 mutation or G-CIMP but these tumors had significantly better survival than other GBMs. Copy number changes in chromosomes 1p and 19q were associated with GII/IIIs, while these changes in CDKN2A, PTEN and EGFR were more commonly associated with GBMs. Conclusions GBM gene-expression and methylation based subtypes are relevant for GII/III s and associate with overall survival differences. A better understanding of the association between these subtypes and GII/IIIs could further knowledge regarding prognosis and mechanisms of glioma progression. PMID:24614622

  15. Spatio-spectral classification of hyperspectral images for brain cancer detection during surgical operations

    PubMed Central

    Kabwama, Silvester; Madroñal, Daniel; Lazcano, Raquel; J-O’Shanahan, Aruma; Bisshopp, Sara; Hernández, María; Báez, Abelardo; Yang, Guang-Zhong; Stanciulescu, Bogdan; Salvador, Rubén; Juárez, Eduardo; Sarmiento, Roberto

    2018-01-01

    Surgery for brain cancer is a major problem in neurosurgery. The diffuse infiltration into the surrounding normal brain by these tumors makes their accurate identification by the naked eye difficult. Since surgery is the common treatment for brain cancer, an accurate radical resection of the tumor leads to improved survival rates for patients. However, the identification of the tumor boundaries during surgery is challenging. Hyperspectral imaging is a non-contact, non-ionizing and non-invasive technique suitable for medical diagnosis. This study presents the development of a novel classification method taking into account the spatial and spectral characteristics of the hyperspectral images to help neurosurgeons to accurately determine the tumor boundaries in surgical-time during the resection, avoiding excessive excision of normal tissue or unintentionally leaving residual tumor. The algorithm proposed in this study to approach an efficient solution consists of a hybrid framework that combines both supervised and unsupervised machine learning methods. Firstly, a supervised pixel-wise classification using a Support Vector Machine classifier is performed. The generated classification map is spatially homogenized using a one-band representation of the HS cube, employing the Fixed Reference t-Stochastic Neighbors Embedding dimensional reduction algorithm, and performing a K-Nearest Neighbors filtering. The information generated by the supervised stage is combined with a segmentation map obtained via unsupervised clustering employing a Hierarchical K-Means algorithm. The fusion is performed using a majority voting approach that associates each cluster with a certain class. To evaluate the proposed approach, five hyperspectral images of surface of the brain affected by glioblastoma tumor in vivo from five different patients have been used. The final classification maps obtained have been analyzed and validated by specialists. These preliminary results are promising, obtaining an accurate delineation of the tumor area. PMID:29554126

  16. Two Clinical Phenotypes in Polycythemia Vera

    PubMed Central

    Spivak, Jerry L.; Considine, Michael; Williams, Donna M.; Talbot, Conover C.; Rogers, Ophelia; Moliterno, Alison R.; Jie, Chunfa; Ochs, Michael F.

    2014-01-01

    BACKGROUND Polycythemia vera is the ultimate phenotypic consequence of the V617F mutation in Janus kinase 2 (encoded by JAK2), but the extent to which this mutation influences the behavior of the involved CD34+ hematopoietic stem cells is unknown. METHODS We analyzed gene expression in CD34+ peripheral-blood cells from 19 patients with polycythemia vera, using oligonucleotide microarray technology after correcting for potential confounding by sex, since the phenotypic features of the disease differ between men and women. RESULTS Men with polycythemia vera had twice as many up-regulated or down-regulated genes as women with polycythemia vera, in a comparison of gene expression in the patients and in healthy persons of the same sex, but there were 102 genes with differential regulation that was concordant in men and women. When these genes were used for class discovery by means of unsupervised hierarchical clustering, the 19 patients could be divided into two groups that did not differ significantly with respect to age, neutrophil JAK2 V617F allele burden, white-cell count, platelet count, or clonal dominance. However, they did differ significantly with respect to disease duration; hemoglobin level; frequency of thromboembolic events, palpable splenomegaly, and splenectomy; chemotherapy exposure; leukemic transformation; and survival. The unsupervised clustering was confirmed by a supervised approach with the use of a top-scoring-pair classifier that segregated the 19 patients into the same two phenotypic groups with 100% accuracy. CONCLUSIONS Removing sex as a potential confounder, we identified an accurate molecular method for classifying patients with polycythemia vera according to disease behavior, independently of their JAK2 V617F allele burden, and identified previously unrecognized molecular pathways in polycythemia vera outside the canonical JAK2 pathway that may be amenable to targeted therapy. PMID:25162887

  17. Prospective Molecular Profiling of Canine Cancers Provides a Clinically Relevant Comparative Model for Evaluating Personalized Medicine (PMed) Trials

    PubMed Central

    Mazcko, Christina; Cherba, David; Hendricks, William; Lana, Susan; Ehrhart, E. J.; Charles, Brad; Fehling, Heather; Kumar, Leena; Vail, David; Henson, Michael; Childress, Michael; Kitchell, Barbara; Kingsley, Christopher; Kim, Seungchan; Neff, Mark; Davis, Barbara

    2014-01-01

    Background Molecularly-guided trials (i.e. PMed) now seek to aid clinical decision-making by matching cancer targets with therapeutic options. Progress has been hampered by the lack of cancer models that account for individual-to-individual heterogeneity within and across cancer types. Naturally occurring cancers in pet animals are heterogeneous and thus provide an opportunity to answer questions about these PMed strategies and optimize translation to human patients. In order to realize this opportunity, it is now necessary to demonstrate the feasibility of conducting molecularly-guided analysis of tumors from dogs with naturally occurring cancer in a clinically relevant setting. Methodology A proof-of-concept study was conducted by the Comparative Oncology Trials Consortium (COTC) to determine if tumor collection, prospective molecular profiling, and PMed report generation within 1 week was feasible in dogs. Thirty-one dogs with cancers of varying histologies were enrolled. Twenty-four of 31 samples (77%) successfully met all predefined QA/QC criteria and were analyzed via Affymetrix gene expression profiling. A subsequent bioinformatics workflow transformed genomic data into a personalized drug report. Average turnaround from biopsy to report generation was 116 hours (4.8 days). Unsupervised clustering of canine tumor expression data clustered by cancer type, but supervised clustering of tumors based on the personalized drug report clustered by drug class rather than cancer type. Conclusions Collection and turnaround of high quality canine tumor samples, centralized pathology, analyte generation, array hybridization, and bioinformatic analyses matching gene expression to therapeutic options is achievable in a practical clinical window (<1 week). Clustering data show robust signatures by cancer type but also showed patient-to-patient heterogeneity in drug predictions. This lends further support to the inclusion of a heterogeneous population of dogs with cancer into the preclinical modeling of personalized medicine. Future comparative oncology studies optimizing the delivery of PMed strategies may aid cancer drug development. PMID:24637659

  18. Prospective molecular profiling of canine cancers provides a clinically relevant comparative model for evaluating personalized medicine (PMed) trials.

    PubMed

    Paoloni, Melissa; Webb, Craig; Mazcko, Christina; Cherba, David; Hendricks, William; Lana, Susan; Ehrhart, E J; Charles, Brad; Fehling, Heather; Kumar, Leena; Vail, David; Henson, Michael; Childress, Michael; Kitchell, Barbara; Kingsley, Christopher; Kim, Seungchan; Neff, Mark; Davis, Barbara; Khanna, Chand; Trent, Jeffrey

    2014-01-01

    Molecularly-guided trials (i.e. PMed) now seek to aid clinical decision-making by matching cancer targets with therapeutic options. Progress has been hampered by the lack of cancer models that account for individual-to-individual heterogeneity within and across cancer types. Naturally occurring cancers in pet animals are heterogeneous and thus provide an opportunity to answer questions about these PMed strategies and optimize translation to human patients. In order to realize this opportunity, it is now necessary to demonstrate the feasibility of conducting molecularly-guided analysis of tumors from dogs with naturally occurring cancer in a clinically relevant setting. A proof-of-concept study was conducted by the Comparative Oncology Trials Consortium (COTC) to determine if tumor collection, prospective molecular profiling, and PMed report generation within 1 week was feasible in dogs. Thirty-one dogs with cancers of varying histologies were enrolled. Twenty-four of 31 samples (77%) successfully met all predefined QA/QC criteria and were analyzed via Affymetrix gene expression profiling. A subsequent bioinformatics workflow transformed genomic data into a personalized drug report. Average turnaround from biopsy to report generation was 116 hours (4.8 days). Unsupervised clustering of canine tumor expression data clustered by cancer type, but supervised clustering of tumors based on the personalized drug report clustered by drug class rather than cancer type. Collection and turnaround of high quality canine tumor samples, centralized pathology, analyte generation, array hybridization, and bioinformatic analyses matching gene expression to therapeutic options is achievable in a practical clinical window (<1 week). Clustering data show robust signatures by cancer type but also showed patient-to-patient heterogeneity in drug predictions. This lends further support to the inclusion of a heterogeneous population of dogs with cancer into the preclinical modeling of personalized medicine. Future comparative oncology studies optimizing the delivery of PMed strategies may aid cancer drug development.

  19. Hyperspectral Image Classification using a Self-Organizing Map

    NASA Technical Reports Server (NTRS)

    Martinez, P.; Gualtieri, J. A.; Aguilar, P. L.; Perez, R. M.; Linaje, M.; Preciado, J. C.; Plaza, A.

    2001-01-01

    The use of hyperspectral data to determine the abundance of constituents in a certain portion of the Earth's surface relies on the capability of imaging spectrometers to provide a large amount of information at each pixel of a certain scene. Today, hyperspectral imaging sensors are capable of generating unprecedented volumes of radiometric data. The Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), for example, routinely produces image cubes with 224 spectral bands. This undoubtedly opens a wide range of new possibilities, but the analysis of such a massive amount of information is not an easy task. In fact, most of the existing algorithms devoted to analyzing multispectral images are not applicable in the hyperspectral domain, because of the size and high dimensionality of the images. The application of neural networks to perform unsupervised classification of hyperspectral data has been tested by several authors and also by us in some previous work. We have also focused on analyzing the intrinsic capability of neural networks to parallelize the whole hyperspectral unmixing process. The results shown in this work indicate that neural network models are able to find clusters of closely related hyperspectral signatures, and thus can be used as a powerful tool to achieve the desired classification. The present work discusses the possibility of using a Self Organizing neural network to perform unsupervised classification of hyperspectral images. In sections 3 and 4, the topology of the proposed neural network and the training algorithm are respectively described. Section 5 provides the results we have obtained after applying the proposed methodology to real hyperspectral data, described in section 2. Different parameters in the learning stage have been modified in order to obtain a detailed description of their influence on the final results. Finally, in section 6 we provide the conclusions at which we have arrived.

  20. Generation of brain pseudo-CTs using an undersampled, single-acquisition UTE-mDixon pulse sequence and unsupervised clustering

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Su, Kuan-Hao; Hu, Lingzhi; Traughber, Melanie

    Purpose: MR-based pseudo-CT has an important role in MR-based radiation therapy planning and PET attenuation correction. The purpose of this study is to establish a clinically feasible approach, including image acquisition, correction, and CT formation, for pseudo-CT generation of the brain using a single-acquisition, undersampled ultrashort echo time (UTE)-mDixon pulse sequence. Methods: Nine patients were recruited for this study. For each patient, a 190-s, undersampled, single acquisition UTE-mDixon sequence of the brain was acquired (TE = 0.1, 1.5, and 2.8 ms). A novel method of retrospective trajectory correction of the free induction decay (FID) signal was performed based on point-spreadmore » functions of three external MR markers. Two-point Dixon images were reconstructed using the first and second echo data (TE = 1.5 and 2.8 ms). R2{sup ∗} images (1/T2{sup ∗}) were then estimated and were used to provide bone information. Three image features, i.e., Dixon-fat, Dixon-water, and R2{sup ∗}, were used for unsupervised clustering. Five tissue clusters, i.e., air, brain, fat, fluid, and bone, were estimated using the fuzzy c-means (FCM) algorithm. A two-step, automatic tissue-assignment approach was proposed and designed according to the prior information of the given feature space. Pseudo-CTs were generated by a voxelwise linear combination of the membership functions of the FCM. A low-dose CT was acquired for each patient and was used as the gold standard for comparison. Results: The contrast and sharpness of the FID images were improved after trajectory correction was applied. The mean of the estimated trajectory delay was 0.774 μs (max: 1.350 μs; min: 0.180 μs). The FCM-estimated centroids of different tissue types showed a distinguishable pattern for different tissues, and significant differences were found between the centroid locations of different tissue types. Pseudo-CT can provide additional skull detail and has low bias and absolute error of estimated CT numbers of voxels (−22 ± 29 HU and 130 ± 16 HU) when compared to low-dose CT. Conclusions: The MR features generated by the proposed acquisition, correction, and processing methods may provide representative clustering information and could thus be used for clinical pseudo-CT generation.« less

  1. Tool Support for Parametric Analysis of Large Software Simulation Systems

    NASA Technical Reports Server (NTRS)

    Schumann, Johann; Gundy-Burlet, Karen; Pasareanu, Corina; Menzies, Tim; Barrett, Tony

    2008-01-01

    The analysis of large and complex parameterized software systems, e.g., systems simulation in aerospace, is very complicated and time-consuming due to the large parameter space, and the complex, highly coupled nonlinear nature of the different system components. Thus, such systems are generally validated only in regions local to anticipated operating points rather than through characterization of the entire feasible operational envelope of the system. We have addressed the factors deterring such an analysis with a tool to support envelope assessment: we utilize a combination of advanced Monte Carlo generation with n-factor combinatorial parameter variations to limit the number of cases, but still explore important interactions in the parameter space in a systematic fashion. Additional test-cases, automatically generated from models (e.g., UML, Simulink, Stateflow) improve the coverage. The distributed test runs of the software system produce vast amounts of data, making manual analysis impossible. Our tool automatically analyzes the generated data through a combination of unsupervised Bayesian clustering techniques (AutoBayes) and supervised learning of critical parameter ranges using the treatment learner TAR3. The tool has been developed around the Trick simulation environment, which is widely used within NASA. We will present this tool with a GN&C (Guidance, Navigation and Control) simulation of a small satellite system.

  2. Extracting galactic structure parameters from multivariated density estimation

    NASA Technical Reports Server (NTRS)

    Chen, B.; Creze, M.; Robin, A.; Bienayme, O.

    1992-01-01

    Multivariate statistical analysis, including includes cluster analysis (unsupervised classification), discriminant analysis (supervised classification) and principle component analysis (dimensionlity reduction method), and nonparameter density estimation have been successfully used to search for meaningful associations in the 5-dimensional space of observables between observed points and the sets of simulated points generated from a synthetic approach of galaxy modelling. These methodologies can be applied as the new tools to obtain information about hidden structure otherwise unrecognizable, and place important constraints on the space distribution of various stellar populations in the Milky Way. In this paper, we concentrate on illustrating how to use nonparameter density estimation to substitute for the true densities in both of the simulating sample and real sample in the five-dimensional space. In order to fit model predicted densities to reality, we derive a set of equations which include n lines (where n is the total number of observed points) and m (where m: the numbers of predefined groups) unknown parameters. A least-square estimation will allow us to determine the density law of different groups and components in the Galaxy. The output from our software, which can be used in many research fields, will also give out the systematic error between the model and the observation by a Bayes rule.

  3. Biases of STRUCTURE software when exploring introduction routes of invasive species.

    PubMed

    Lombaert, Eric; Guillemaud, Thomas; Deleury, Emeline

    2018-06-01

    Population genetic methods are widely used to retrace the introduction routes of invasive species. The unsupervised Bayesian clustering algorithm implemented in STRUCTURE is amongst the most frequently used of these methods, but its ability to provide reliable information about introduction routes has never been assessed. We simulated microsatellite datasets to evaluate the extent to which the results provided by STRUCTURE were misleading for the inference of introduction routes. We focused on an invasion scenario involving one native and two independently introduced populations, because it is the sole scenario that can be rejected when obtaining a particular clustering with a STRUCTURE analysis at K = 2 (two clusters). Results were classified as "misleading" or "non-misleading". We investigated the influence of effective size, bottleneck severity and number of loci on the type and frequency of misleading results. We showed that misleading STRUCTURE results were obtained for 10% of all simulated datasets. Our results highlighted two categories of misleading output. The first occurs when the native population has a low level of diversity. In this case, the two introduced populations may be very similar, despite their independent introduction histories. The second category results from convergence issues in STRUCTURE for K = 2, with strong bottleneck severity and/or large numbers of loci resulting in high levels of differentiation between the three populations. Overall, the risk of being misled by STRUCTURE in the context of introduction routes inferences is moderate, but it is important to remain cautious when low genetic diversity or genuine multimodality between runs are involved.

  4. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features.

    PubMed

    Nikfarjam, Azadeh; Sarker, Abeed; O'Connor, Karen; Ginn, Rachel; Gonzalez, Graciela

    2015-05-01

    Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.

  5. Identification of Chiari Type I Malformation subtypes using whole genome expression profiles and cranial base morphometrics

    PubMed Central

    2014-01-01

    Background Chiari Type I Malformation (CMI) is characterized by herniation of the cerebellar tonsils through the foramen magnum at the base of the skull, resulting in significant neurologic morbidity. As CMI patients display a high degree of clinical variability and multiple mechanisms have been proposed for tonsillar herniation, it is hypothesized that this heterogeneous disorder is due to multiple genetic and environmental factors. The purpose of the present study was to gain a better understanding of what factors contribute to this heterogeneity by using an unsupervised statistical approach to define disease subtypes within a case-only pediatric population. Methods A collection of forty-four pediatric CMI patients were ascertained to identify disease subtypes using whole genome expression profiles generated from patient blood and dura mater tissue samples, and radiological data consisting of posterior fossa (PF) morphometrics. Sparse k-means clustering and an extension to accommodate multiple data sources were used to cluster patients into more homogeneous groups using biological and radiological data both individually and collectively. Results All clustering analyses resulted in the significant identification of patient classes, with the pure biological classes derived from patient blood and dura mater samples demonstrating the strongest evidence. Those patient classes were further characterized by identifying enriched biological pathways, as well as correlated cranial base morphological and clinical traits. Conclusions Our results implicate several strong biological candidates warranting further investigation from the dura expression analysis and also identified a blood gene expression profile corresponding to a global down-regulation in protein synthesis. PMID:24962150

  6. A comparative hidden Markov model analysis pipeline identifies proteins characteristic of cereal-infecting fungi

    PubMed Central

    2013-01-01

    Background Fungal pathogens cause devastating losses in economically important cereal crops by utilising pathogen proteins to infect host plants. Secreted pathogen proteins are referred to as effectors and have thus far been identified by selecting small, cysteine-rich peptides from the secretome despite increasing evidence that not all effectors share these attributes. Results We take advantage of the availability of sequenced fungal genomes and present an unbiased method for finding putative pathogen proteins and secreted effectors in a query genome via comparative hidden Markov model analyses followed by unsupervised protein clustering. Our method returns experimentally validated fungal effectors in Stagonospora nodorum and Fusarium oxysporum as well as the N-terminal Y/F/WxC-motif from the barley powdery mildew pathogen. Application to the cereal pathogen Fusarium graminearum reveals a secreted phosphorylcholine phosphatase that is characteristic of hemibiotrophic and necrotrophic cereal pathogens and shares an ancient selection process with bacterial plant pathogens. Three F. graminearum protein clusters are found with an enriched secretion signal. One of these putative effector clusters contains proteins that share a [SG]-P-C-[KR]-P sequence motif in the N-terminal and show features not commonly associated with fungal effectors. This motif is conserved in secreted pathogenic Fusarium proteins and a prime candidate for functional testing. Conclusions Our pipeline has successfully uncovered conservation patterns, putative effectors and motifs of fungal pathogens that would have been overlooked by existing approaches that identify effectors as small, secreted, cysteine-rich peptides. It can be applied to any pathogenic proteome data, such as microbial pathogen data of plants and other organisms. PMID:24252298

  7. Spatio-Temporal Metabolite Profiling of the Barley Germination Process by MALDI MS Imaging

    PubMed Central

    Gorzolka, Karin; Kölling, Jan; Nattkemper, Tim W.; Niehaus, Karsten

    2016-01-01

    MALDI mass spectrometry imaging was performed to localize metabolites during the first seven days of the barley germination. Up to 100 mass signals were detected of which 85 signals were identified as 48 different metabolites with highly tissue-specific localizations. Oligosaccharides were observed in the endosperm and in parts of the developed embryo. Lipids in the endosperm co-localized in dependency on their fatty acid compositions with changes in the distributions of diacyl phosphatidylcholines during germination. 26 potentially antifungal hordatines were detected in the embryo with tissue-specific localizations of their glycosylated, hydroxylated, and O-methylated derivates. In order to reveal spatio-temporal patterns in local metabolite compositions, multiple MSI data sets from a time series were analyzed in one batch. This requires a new preprocessing strategy to achieve comparability between data sets as well as a new strategy for unsupervised clustering. The resulting spatial segmentation for each time point sample is visualized in an interactive cluster map and enables simultaneous interactive exploration of all time points. Using this new analysis approach and visualization tool germination-dependent developments of metabolite patterns with single MS position accuracy were discovered. This is the first study that presents metabolite profiling of a cereals’ germination process over time by MALDI MSI with the identification of a large number of peaks of agronomically and industrially important compounds such as oligosaccharides, lipids and antifungal agents. Their detailed localization as well as the MS cluster analyses for on-tissue metabolite profile mapping revealed important information for the understanding of the germination process, which is of high scientific interest. PMID:26938880

  8. A methodological study of genome-wide DNA methylation analyses using matched archival formalin-fixed paraffin embedded and fresh frozen breast tumors.

    PubMed

    Espinal, Allyson C; Wang, Dan; Yan, Li; Liu, Song; Tang, Li; Hu, Qiang; Morrison, Carl D; Ambrosone, Christine B; Higgins, Michael J; Sucheston-Campbell, Lara E

    2017-02-28

    DNA from archival formalin-fixed and paraffin embedded (FFPE) tissue is an invaluable resource for genome-wide methylation studies although concerns about poor quality may limit its use. In this study, we compared DNA methylation profiles of breast tumors using DNA from fresh-frozen (FF) tissues and three types of matched FFPE samples. For 9/10 patients, correlation and unsupervised clustering analysis revealed that the FF and FFPE samples were consistently correlated with each other and clustered into distinct subgroups. Greater than 84% of the top 100 loci previously shown to differentiate ER+ and ER- tumors in FF tissues were also FFPE DML. Weighted Correlation Gene Network Analyses (WCGNA) grouped the DML loci into 16 modules in FF tissue, with ~85% of the module membership preserved across tissue types. Restored FFPE and matched FF samples were profiled using the Illumina Infinium HumanMethylation450K platform. Methylation levels (β-values) across all loci and the top 100 loci previously shown to differentiate tumors by estrogen receptor status (ER+ or ER-) in a larger FF study, were compared between matched FF and FFPE samples using Pearson's correlation, hierarchical clustering and WCGNA. Positive predictive values and sensitivity levels for detecting differentially methylated loci (DML) in FF samples were calculated in an independent FFPE cohort. FFPE breast tumors samples show lower overall detection of DMLs versus FF, however FFPE and FF DMLs compare favorably. These results support the emerging consensus that the 450K platform can be employed to investigate epigenetics in large sets of archival FFPE tissues.

  9. Tensor decomposition-based and principal-component-analysis-based unsupervised feature extraction applied to the gene expression and methylation profiles in the brains of social insects with multiple castes.

    PubMed

    Taguchi, Y-H

    2018-05-08

    Even though coexistence of multiple phenotypes sharing the same genomic background is interesting, it remains incompletely understood. Epigenomic profiles may represent key factors, with unknown contributions to the development of multiple phenotypes, and social-insect castes are a good model for elucidation of the underlying mechanisms. Nonetheless, previous studies have failed to identify genes associated with aberrant gene expression and methylation profiles because of the lack of suitable methodology that can address this problem properly. A recently proposed principal component analysis (PCA)-based and tensor decomposition (TD)-based unsupervised feature extraction (FE) can solve this problem because these two approaches can deal with gene expression and methylation profiles even when a small number of samples is available. PCA-based and TD-based unsupervised FE methods were applied to the analysis of gene expression and methylation profiles in the brains of two social insects, Polistes canadensis and Dinoponera quadriceps. Genes associated with differential expression and methylation between castes were identified, and analysis of enrichment of Gene Ontology terms confirmed reliability of the obtained sets of genes from the biological standpoint. Biologically relevant genes, shown to be associated with significant differential gene expression and methylation between castes, were identified here for the first time. The identification of these genes may help understand the mechanisms underlying epigenetic control of development of multiple phenotypes under the same genomic conditions.

  10. An unsupervised method for quantifying the behavior of paired animals

    NASA Astrophysics Data System (ADS)

    Klibaite, Ugne; Berman, Gordon J.; Cande, Jessica; Stern, David L.; Shaevitz, Joshua W.

    2017-02-01

    Behaviors involving the interaction of multiple individuals are complex and frequently crucial for an animal’s survival. These interactions, ranging across sensory modalities, length scales, and time scales, are often subtle and difficult to characterize. Contextual effects on the frequency of behaviors become even more difficult to quantify when physical interaction between animals interferes with conventional data analysis, e.g. due to visual occlusion. We introduce a method for quantifying behavior in fruit fly interaction that combines high-throughput video acquisition and tracking of individuals with recent unsupervised methods for capturing an animal’s entire behavioral repertoire. We find behavioral differences between solitary flies and those paired with an individual of the opposite sex, identifying specific behaviors that are affected by social and spatial context. Our pipeline allows for a comprehensive description of the interaction between two individuals using unsupervised machine learning methods, and will be used to answer questions about the depth of complexity and variance in fruit fly courtship.

  11. Supervised and Unsupervised Aspect Category Detection for Sentiment Analysis with Co-occurrence Data.

    PubMed

    Schouten, Kim; van der Weijde, Onne; Frasincar, Flavius; Dekker, Rommert

    2018-04-01

    Using online consumer reviews as electronic word of mouth to assist purchase-decision making has become increasingly popular. The Web provides an extensive source of consumer reviews, but one can hardly read all reviews to obtain a fair evaluation of a product or service. A text processing framework that can summarize reviews, would therefore be desirable. A subtask to be performed by such a framework would be to find the general aspect categories addressed in review sentences, for which this paper presents two methods. In contrast to most existing approaches, the first method presented is an unsupervised method that applies association rule mining on co-occurrence frequency data obtained from a corpus to find these aspect categories. While not on par with state-of-the-art supervised methods, the proposed unsupervised method performs better than several simple baselines, a similar but supervised method, and a supervised baseline, with an -score of 67%. The second method is a supervised variant that outperforms existing methods with an -score of 84%.

  12. Mapping the Indonesian territory, based on pollution, social demography and geographical data, using self organizing feature map

    NASA Astrophysics Data System (ADS)

    Hernawati, Kuswari; Insani, Nur; Bambang S. H., M.; Nur Hadi, W.; Sahid

    2017-08-01

    This research aims to mapping the 33 (thirty-three) provinces in Indonesia, based on the data on air, water and soil pollution, as well as social demography and geography data, into a clustered model. The method used in this study was unsupervised method that combines the basic concept of Kohonen or Self-Organizing Feature Maps (SOFM). The method is done by providing the design parameters for the model based on data related directly/ indirectly to pollution, which are the demographic and social data, pollution levels of air, water and soil, as well as the geographical situation of each province. The parameters used consists of 19 features/characteristics, including the human development index, the number of vehicles, the availability of the plant's water absorption and flood prevention, as well as geographic and demographic situation. The data used were secondary data from the Central Statistics Agency (BPS), Indonesia. The data are mapped into SOFM from a high-dimensional vector space into two-dimensional vector space according to the closeness of location in term of Euclidean distance. The resulting outputs are represented in clustered grouping. Thirty-three provinces are grouped into five clusters, where each cluster has different features/characteristics and level of pollution. The result can used to help the efforts on prevention and resolution of pollution problems on each cluster in an effective and efficient way.

  13. Unsupervised frequency-recognition method of SSVEPs using a filter bank implementation of binary subband CCA.

    PubMed

    Rabiul Islam, Md; Khademul Islam Molla, Md; Nakanishi, Masaki; Tanaka, Toshihisa

    2017-04-01

    Recently developed effective methods for detection commands of steady-state visual evoked potential (SSVEP)-based brain-computer interface (BCI) that need calibration for visual stimuli, which cause more time and fatigue prior to the use, as the number of commands increases. This paper develops a novel unsupervised method based on canonical correlation analysis (CCA) for accurate detection of stimulus frequency. A novel unsupervised technique termed as binary subband CCA (BsCCA) is implemented in a multiband approach to enhance the frequency recognition performance of SSVEP. In BsCCA, two subbands are used and a CCA-based correlation coefficient is computed for the individual subbands. In addition, a reduced set of artificial reference signals is used to calculate CCA for the second subband. The analyzing SSVEP is decomposed into multiple subband and the BsCCA is implemented for each one. Then, the overall recognition score is determined by a weighted sum of the canonical correlation coefficients obtained from each band. A 12-class SSVEP dataset (frequency range: 9.25-14.75 Hz with an interval of 0.5 Hz) for ten healthy subjects are used to evaluate the performance of the proposed method. The results suggest that BsCCA significantly improves the performance of SSVEP-based BCI compared to the state-of-the-art methods. The proposed method is an unsupervised approach with averaged information transfer rate (ITR) of 77.04 bits min -1 across 10 subjects. The maximum individual ITR is 107.55 bits min -1 for 12-class SSVEP dataset, whereas, the ITR of 69.29 and 69.44 bits min -1 are achieved with CCA and NCCA respectively. The statistical test shows that the proposed unsupervised method significantly improves the performance of the SSVEP-based BCI. It can be usable in real world applications.

  14. Towards an unsupervised device for the diagnosis of childhood pneumonia in low resource settings: automatic segmentation of respiratory sounds.

    PubMed

    Sola, J; Braun, F; Muntane, E; Verjus, C; Bertschi, M; Hugon, F; Manzano, S; Benissa, M; Gervaix, A

    2016-08-01

    Pneumonia remains the worldwide leading cause of children mortality under the age of five, with every year 1.4 million deaths. Unfortunately, in low resource settings, very limited diagnostic support aids are provided to point-of-care practitioners. Current UNICEF/WHO case management algorithm relies on the use of a chronometer to manually count breath rates on pediatric patients: there is thus a major need for more sophisticated tools to diagnose pneumonia that increase sensitivity and specificity of breath-rate-based algorithms. These tools should be low cost, and adapted to practitioners with limited training. In this work, a novel concept of unsupervised tool for the diagnosis of childhood pneumonia is presented. The concept relies on the automated analysis of respiratory sounds as recorded by a point-of-care electronic stethoscope. By identifying the presence of auscultation sounds at different chest locations, this diagnostic tool is intended to estimate a pneumonia likelihood score. After presenting the overall architecture of an algorithm to estimate pneumonia scores, the importance of a robust unsupervised method to identify inspiratory and expiratory phases of a respiratory cycle is highlighted. Based on data from an on-going study involving pediatric pneumonia patients, a first algorithm to segment respiratory sounds is suggested. The unsupervised algorithm relies on a Mel-frequency filter bank, a two-step Gaussian Mixture Model (GMM) description of data, and a final Hidden Markov Model (HMM) interpretation of inspiratory-expiratory sequences. Finally, illustrative results on first recruited patients are provided. The presented algorithm opens the doors to a new family of unsupervised respiratory sound analyzers that could improve future versions of case management algorithms for the diagnosis of pneumonia in low-resources settings.

  15. Unsupervised laparoscopic appendicectomy by surgical trainees is safe and time-effective.

    PubMed

    Wong, Kenneth; Duncan, Tristram; Pearson, Andrew

    2007-07-01

    Open appendicectomy is the traditional standard treatment for appendicitis. Laparoscopic appendicectomy is perceived as a procedure with greater potential for complications and longer operative times. This paper examines the hypothesis that unsupervised laparoscopic appendicectomy by surgical trainees is a safe and time-effective valid alternative. Medical records, operating theatre records and histopathology reports of all patients undergoing laparoscopic and open appendicectomy over a 15-month period in two hospitals within an area health service were retrospectively reviewed. Data were analysed to compare patient features, pathology findings, operative times, complications, readmissions and mortality between laparoscopic and open groups and between unsupervised surgical trainee operators versus consultant surgeon operators. A total of 143 laparoscopic and 222 open appendicectomies were reviewed. Unsupervised trainees performed 64% of the laparoscopic appendicectomies and 55% of the open appendicectomies. There were no significant differences in complication rates, readmissions, mortality and length of stay between laparoscopic and open appendicectomy groups or between trainee and consultant surgeon operators. Conversion rates (laparoscopic to open approach) were similar for trainees and consultants. Unsupervised senior surgical trainees did not take significantly longer to perform laparoscopic appendicectomy when compared to unsupervised trainee-performed open appendicectomy. Unsupervised laparoscopic appendicectomy by surgical trainees is safe and time-effective.

  16. Potential impacts of robust surface roughness indexes on DTM-based segmentation

    NASA Astrophysics Data System (ADS)

    Trevisani, Sebastiano; Rocca, Michele

    2017-04-01

    In this study, we explore the impact of robust surface texture indexes based on MAD (median absolute differences), implemented by Trevisani and Rocca (2015), in the unsupervised morphological segmentation of an alpine basin. The area was already object of a geomorphometric analysis, consisting in the roughness-based segmentation of the landscape (Trevisani et al. 2012); the roughness indexes were calculated on a high resolution DTM derived by means of airborne Lidar using the variogram as estimator. The calculated roughness indexes have been then used for the fuzzy clustering (Odeh et al., 1992; Burrough et al., 2000) of the basin, revealing the high informative geomorphometric content of the roughness-based indexes. However, the fuzzy clustering revealed a high fuzziness and a high degree of mixing between textural classes; this was ascribed both to the morphological complexity of the basin and to the high sensitivity of variogram to non-stationarity and signal-noise. Accordingly, we explore how the new implemented roughness indexes based on MAD affect the morphological segmentation of the studied basin. References Burrough, P.A., Van Gaans, P.F.M., MacMillan, R.A., 2000. High-resolution landform classification using fuzzy k-means. Fuzzy Sets and Systems 113, 37-52. Odeh, I.O.A., McBratney, A.B., Chittleborough, D.J., 1992. Soil pattern recognition with fuzzy-c-means: application to classification and soil-landform interrelationships. Soil Sciences Society of America Journal 56, 505-516. Trevisani, S., Cavalli, M. & Marchi, L. 2012, "Surface texture analysis of a high-resolution DTM: Interpreting an alpine basin", Geomorphology, vol. 161-162, pp. 26-39. Trevisani, S. & Rocca, M. 2015, "MAD: Robust image texture analysis for applications in high resolution geomorphometry", Computers and Geosciences, vol. 81, pp. 78-92.

  17. Meta-analysis of Clear Cell Renal Cell Carcinoma Gene Expression Defines a Variant Subgroup and Identifies Gender Influences on Tumor Biology

    PubMed Central

    Brannon, A. Rose; Haake, Scott M.; Hacker, Kathryn E.; Pruthi, Raj S.; Wallen, Eric M.; Nielsen, Matthew E.; Rathmell, W. Kimryn

    2011-01-01

    Background Clear cell renal cell carcinoma (ccRCC) displays molecular and histologic heterogeneity. Previously described subsets of this disease, ccA and ccB, were defined based on multigene expression profiles, but it is unclear whether these subgroupings reflect the full spectrum of disease or how these molecular subtypes relate to histologic descriptions or gender. Objective Determine whether additional subtypes of ccRCC exist and whether these subtypes are related to von Hippel-Lindau (VHL) inactivation, hypoxia-inducible factor (HIF) 1 and 2 expression, tumor histology, or gender. Design, setting, and participants Six large, publicly available ccRCC gene expression databases were identified that cumulatively provided data for 480 tumors for meta-analysis via meta-array compilation. Measurements Unsupervised consensus clustering was performed on the meta-arrays. Tumors were examined for the relationship of multigene-defined consensus subtypes and expression signatures of VHL mutation and HIF status, tumor histology, and gender. Results and limitations Two dominant subsets of ccRCC were observed. However, a minor third cluster was revealed that correlated strongly with a wild type (WT) VHL expression profile and indications of variant histologies. When variant histologies were removed, ccA tumors naturally divided by gender. This technique is limited by the potential for persistent batch effect, tumor sampling bias, and restrictions of annotated information. Conclusions The ccA and ccB subsets of ccRCC are robust in meta-analysis among histologically conventional ccRCC tumors. A third group of tumors was identified that may represent a new variant of ccRCC. Within definitively clear cell tumors, gender may delineate tumors in such a way that it could have implications regarding current treatments and future drug development. PMID:22030119

  18. Glycome Diagnosis of Human Induced Pluripotent Stem Cells Using Lectin Microarray*

    PubMed Central

    Tateno, Hiroaki; Toyota, Masashi; Saito, Shigeru; Onuma, Yasuko; Ito, Yuzuru; Hiemori, Keiko; Fukumura, Mihoko; Matsushima, Asako; Nakanishi, Mio; Ohnuma, Kiyoshi; Akutsu, Hidenori; Umezawa, Akihiro; Horimoto, Katsuhisa; Hirabayashi, Jun; Asashima, Makoto

    2011-01-01

    Induced pluripotent stem cells (iPSCs) can now be produced from various somatic cell (SC) lines by ectopic expression of the four transcription factors. Although the procedure has been demonstrated to induce global change in gene and microRNA expressions and even epigenetic modification, it remains largely unknown how this transcription factor-induced reprogramming affects the total glycan repertoire expressed on the cells. Here we performed a comprehensive glycan analysis using 114 types of human iPSCs generated from five different SCs and compared their glycomes with those of human embryonic stem cells (ESCs; nine cell types) using a high density lectin microarray. In unsupervised cluster analysis of the results obtained by lectin microarray, both undifferentiated iPSCs and ESCs were clustered as one large group. However, they were clearly separated from the group of differentiated SCs, whereas all of the four SCs had apparently distinct glycome profiles from one another, demonstrating that SCs with originally distinct glycan profiles have acquired those similar to ESCs upon induction of pluripotency. Thirty-eight lectins discriminating between SCs and iPSCs/ESCs were statistically selected, and characteristic features of the pluripotent state were then obtained at the level of the cellular glycome. The expression profiles of relevant glycosyltransferase genes agreed well with the results obtained by lectin microarray. Among the 38 lectins, rBC2LCN was found to detect only undifferentiated iPSCs/ESCs and not differentiated SCs. Hence, the high density lectin microarray has proved to be valid for not only comprehensive analysis of glycans but also diagnosis of stem cells under the concept of the cellular glycome. PMID:21471226

  19. Methylation profiling of choroid plexus tumors reveals 3 clinically distinct subgroups.

    PubMed

    Thomas, Christian; Sill, Martin; Ruland, Vincent; Witten, Anika; Hartung, Stefan; Kordes, Uwe; Jeibmann, Astrid; Beschorner, Rudi; Keyvani, Kathy; Bergmann, Markus; Mittelbronn, Michel; Pietsch, Torsten; Felsberg, Jörg; Monoranu, Camelia M; Varlet, Pascale; Hauser, Peter; Olar, Adriana; Grundy, Richard G; Wolff, Johannes E; Korshunov, Andrey; Jones, David T; Bewerunge-Hudler, Melanie; Hovestadt, Volker; von Deimling, Andreas; Pfister, Stefan M; Paulus, Werner; Capper, David; Hasselblatt, Martin

    2016-06-01

    Choroid plexus tumors are intraventricular neoplasms derived from the choroid plexus epithelium. A better knowledge of molecular factors involved in choroid plexus tumor biology may aid in identifying patients at risk for recurrence. Methylation profiles were examined in 29 choroid plexus papillomas (CPPs, WHO grade I), 32 atypical choroid plexus papillomas (aCPPs, WHO grade II), and 31 choroid plexus carcinomas (CPCs, WHO grade III) by Illumina Infinium HumanMethylation450 Bead Chip Array. Unsupervised hierarchical clustering identified 3 subgroups: methylation cluster 1 (pediatric CPP and aCPP of mainly supratentorial location), methylation cluster 2 (adult CPP and aCPP of mainly infratentorial location), and methylation cluster 3 (pediatric CPP, aCPP, and CPC of supratentorial location). In methylation cluster 3, progression-free survival (PFS) accounted for a mean of 72 months (CI, 55-89 mo), whereas only 1 of 42 tumors of methylation clusters 1 and 2 progressed (P< .001). On stratification of outcome data according to WHO grade, all CPCs clustered within cluster 3 and were associated with shorter overall survival (mean, 105 mo [CI, 81-128 mo]) and PFS (mean, 55 mo [CI, 36-73 mo]). The aCPP of methylation cluster 3 also progressed frequently (mean, 69 mo [CI, 44-93 mo]), whereas no tumor progression was observed in aCPP of methylation clusters 1 and 2 (P< .05). Only 1 of 29 CPPs recurred. Methylation profiling of choroid plexus tumors reveals 3 distinct subgroups (ie, pediatric low-risk choroid plexus tumors [cluster 1], adult low-risk choroid plexus tumors [cluster 2], and pediatric high-risk choroid plexus tumors [cluster 3]) and may provide useful prognostic information in addition to histopathology. Published by Oxford University Press on behalf of the Society for Neuro-Oncology 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  20. Discriminative clustering on manifold for adaptive transductive classification.

    PubMed

    Zhang, Zhao; Jia, Lei; Zhang, Min; Li, Bing; Zhang, Li; Li, Fanzhang

    2017-10-01

    In this paper, we mainly propose a novel adaptive transductive label propagation approach by joint discriminative clustering on manifolds for representing and classifying high-dimensional data. Our framework seamlessly combines the unsupervised manifold learning, discriminative clustering and adaptive classification into a unified model. Also, our method incorporates the adaptive graph weight construction with label propagation. Specifically, our method is capable of propagating label information using adaptive weights over low-dimensional manifold features, which is different from most existing studies that usually predict the labels and construct the weights in the original Euclidean space. For transductive classification by our formulation, we first perform the joint discriminative K-means clustering and manifold learning to capture the low-dimensional nonlinear manifolds. Then, we construct the adaptive weights over the learnt manifold features, where the adaptive weights are calculated through performing the joint minimization of the reconstruction errors over features and soft labels so that the graph weights can be joint-optimal for data representation and classification. Using the adaptive weights, we can easily estimate the unknown labels of samples. After that, our method returns the updated weights for further updating the manifold features. Extensive simulations on image classification and segmentation show that our proposed algorithm can deliver the state-of-the-art performance on several public datasets. Copyright © 2017 Elsevier Ltd. All rights reserved.

Top