Sample records for k-means cluster analyses

  1. Subspace K-means clustering.

    PubMed

    Timmerman, Marieke E; Ceulemans, Eva; De Roover, Kim; Van Leeuwen, Karla

    2013-12-01

    To achieve an insightful clustering of multivariate data, we propose subspace K-means. Its central idea is to model the centroids and cluster residuals in reduced spaces, which allows for dealing with a wide range of cluster types and yields rich interpretations of the clusters. We review the existing related clustering methods, including deterministic, stochastic, and unsupervised learning approaches. To evaluate subspace K-means, we performed a comparative simulation study, in which we manipulated the overlap of subspaces, the between-cluster variance, and the error variance. The study shows that the subspace K-means algorithm is sensitive to local minima but that the problem can be reasonably dealt with by using partitions of various cluster procedures as a starting point for the algorithm. Subspace K-means performs very well in recovering the true clustering across all conditions considered and appears to be superior to its competitor methods: K-means, reduced K-means, factorial K-means, mixtures of factor analyzers (MFA), and MCLUST. The best competitor method, MFA, showed a performance similar to that of subspace K-means in easy conditions but deteriorated in more difficult ones. Using data from a study on parental behavior, we show that subspace K-means analysis provides a rich insight into the cluster characteristics, in terms of both the relative positions of the clusters (via the centroids) and the shape of the clusters (via the within-cluster residuals).

  2. Canonical PSO Based K-Means Clustering Approach for Real Datasets.

    PubMed

    Dey, Lopamudra; Chakraborty, Sanjay

    2014-01-01

    "Clustering" the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

  3. Integration K-Means Clustering Method and Elbow Method For Identification of The Best Customer Profile Cluster

    NASA Astrophysics Data System (ADS)

    Syakur, M. A.; Khotimah, B. K.; Rochman, E. M. S.; Satoto, B. D.

    2018-04-01

    Clustering is a data mining technique used to analyse data that has variations and the number of lots. Clustering was process of grouping data into a cluster, so they contained data that is as similar as possible and different from other cluster objects. SMEs Indonesia has a variety of customers, but SMEs do not have the mapping of these customers so they did not know which customers are loyal or otherwise. Customer mapping is a grouping of customer profiling to facilitate analysis and policy of SMEs in the production of goods, especially batik sales. Researchers will use a combination of K-Means method with elbow to improve efficient and effective k-means performance in processing large amounts of data. K-Means Clustering is a localized optimization method that is sensitive to the selection of the starting position from the midpoint of the cluster. So choosing the starting position from the midpoint of a bad cluster will result in K-Means Clustering algorithm resulting in high errors and poor cluster results. The K-means algorithm has problems in determining the best number of clusters. So Elbow looks for the best number of clusters on the K-means method. Based on the results obtained from the process in determining the best number of clusters with elbow method can produce the same number of clusters K on the amount of different data. The result of determining the best number of clusters with elbow method will be the default for characteristic process based on case study. Measurement of k-means value of k-means has resulted in the best clusters based on SSE values on 500 clusters of batik visitors. The result shows the cluster has a sharp decrease is at K = 3, so K as the cut-off point as the best cluster.

  4. Canonical PSO Based K-Means Clustering Approach for Real Datasets

    PubMed Central

    Dey, Lopamudra; Chakraborty, Sanjay

    2014-01-01

    “Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms. PMID:27355083

  5. *K-means and cluster models for cancer signatures.

    PubMed

    Kakushadze, Zura; Yu, Willie

    2017-09-01

    We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means' computational cost is a fraction of NMF's. Using 1389 published samples for 14 cancer types, we find that 3 cancers (liver cancer, lung cancer and renal cell carcinoma) stand out and do not have cluster-like structures. Two clusters have especially high within-cluster correlations with 11 other cancers indicating common underlying structures. Our approach opens a novel avenue for studying such structures. *K-means is universal and can be applied in other fields. We discuss some potential applications in quantitative finance.

  6. Clustering for Binary Data Sets by Using Genetic Algorithm-Incremental K-means

    NASA Astrophysics Data System (ADS)

    Saharan, S.; Baragona, R.; Nor, M. E.; Salleh, R. M.; Asrah, N. M.

    2018-04-01

    This research was initially driven by the lack of clustering algorithms that specifically focus in binary data. To overcome this gap in knowledge, a promising technique for analysing this type of data became the main subject in this research, namely Genetic Algorithms (GA). For the purpose of this research, GA was combined with the Incremental K-means (IKM) algorithm to cluster the binary data streams. In GAIKM, the objective function was based on a few sufficient statistics that may be easily and quickly calculated on binary numbers. The implementation of IKM will give an advantage in terms of fast convergence. The results show that GAIKM is an efficient and effective new clustering algorithm compared to the clustering algorithms and to the IKM itself. In conclusion, the GAIKM outperformed other clustering algorithms such as GCUK, IKM, Scalable K-means (SKM) and K-means clustering and paves the way for future research involving missing data and outliers.

  7. An improved K-means clustering method for cDNA microarray image segmentation.

    PubMed

    Wang, T N; Li, T J; Shao, G F; Wu, S X

    2015-07-14

    Microarray technology is a powerful tool for human genetic research and other biomedical applications. Numerous improvements to the standard K-means algorithm have been carried out to complete the image segmentation step. However, most of the previous studies classify the image into two clusters. In this paper, we propose a novel K-means algorithm, which first classifies the image into three clusters, and then one of the three clusters is divided as the background region and the other two clusters, as the foreground region. The proposed method was evaluated on six different data sets. The analyses of accuracy, efficiency, expression values, special gene spots, and noise images demonstrate the effectiveness of our method in improving the segmentation quality.

  8. Android Malware Classification Using K-Means Clustering Algorithm

    NASA Astrophysics Data System (ADS)

    Hamid, Isredza Rahmi A.; Syafiqah Khalid, Nur; Azma Abdullah, Nurul; Rahman, Nurul Hidayah Ab; Chai Wen, Chuah

    2017-08-01

    Malware was designed to gain access or damage a computer system without user notice. Besides, attacker exploits malware to commit crime or fraud. This paper proposed Android malware classification approach based on K-Means clustering algorithm. We evaluate the proposed model in terms of accuracy using machine learning algorithms. Two datasets were selected to demonstrate the practicing of K-Means clustering algorithms that are Virus Total and Malgenome dataset. We classify the Android malware into three clusters which are ransomware, scareware and goodware. Nine features were considered for each types of dataset such as Lock Detected, Text Detected, Text Score, Encryption Detected, Threat, Porn, Law, Copyright and Moneypak. We used IBM SPSS Statistic software for data classification and WEKA tools to evaluate the built cluster. The proposed K-Means clustering algorithm shows promising result with high accuracy when tested using Random Forest algorithm.

  9. Choosing the Number of Clusters in K-Means Clustering

    ERIC Educational Resources Information Center

    Steinley, Douglas; Brusco, Michael J.

    2011-01-01

    Steinley (2007) provided a lower bound for the sum-of-squares error criterion function used in K-means clustering. In this article, on the basis of the lower bound, the authors propose a method to distinguish between 1 cluster (i.e., a single distribution) versus more than 1 cluster. Additionally, conditional on indicating there are multiple…

  10. MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering

    PubMed Central

    Kim, Eun-Youn; Kim, Seon-Young; Ashlock, Daniel; Nam, Dougu

    2009-01-01

    Background Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance. Results We present a cluster-number-based ensemble clustering algorithm, called MULTI-K, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple k-means runs by varying the number of clusters and identifies clusters that manifest the most robust co-memberships of elements. In addition to the original algorithm, we newly devised the entropy-plot to control the separation of singletons or small clusters. MULTI-K, unlike the simple k-means or other widely used methods, was able to capture clusters with complex and high-dimensional structures accurately. MULTI-K outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real gene-expression data sets. Conclusion The geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors. PMID:19698124

  11. Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm

    NASA Astrophysics Data System (ADS)

    Umam, Khoirul; Bustamam, Alhadi; Lestari, Dian

    2017-03-01

    DNA is one of the carrier of genetic information of living organisms. Encoding, sequencing, and clustering DNA sequences has become the key jobs and routine in the world of molecular biology, in particular on bioinformatics application. There are two type of clustering, hierarchical clustering and partitioning clustering. In this paper, we combined two type clustering i.e. K-Means (partitioning clustering) and DIANA (hierarchical clustering), therefore it called Hybrid clustering. Application of hybrid clustering using Parallel K-Means algorithm and DIANA algorithm used to clustering DNA sequences of Human Papillomavirus (HPV). The clustering process is started with Collecting DNA sequences of HPV are obtained from NCBI (National Centre for Biotechnology Information), then performing characteristics extraction of DNA sequences. The characteristics extraction result is store in a matrix form, then normalize this matrix using Min-Max normalization and calculate genetic distance using Euclidian Distance. Furthermore, the hybrid clustering is applied by using implementation of Parallel K-Means algorithm and DIANA algorithm. The aim of using Hybrid Clustering is to obtain better clusters result. For validating the resulted clusters, to get optimum number of clusters, we use Davies-Bouldin Index (DBI). In this study, the result of implementation of Parallel K-Means clustering is data clustered become 5 clusters with minimal IDB value is 0.8741, and Hybrid Clustering clustered data become 13 sub-clusters with minimal IDB values = 0.8216, 0.6845, 0.3331, 0.1994 and 0.3952. The IDB value of hybrid clustering less than IBD value of Parallel K-Means clustering only that perform at 1ts stage. Its means clustering using Hybrid Clustering have the better result to clustered DNA sequence of HPV than perform parallel K-Means Clustering only.

  12. Finding reproducible cluster partitions for the k-means algorithm

    PubMed Central

    2013-01-01

    K-means clustering is widely used for exploratory data analysis. While its dependence on initialisation is well-known, it is common practice to assume that the partition with lowest sum-of-squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also the closest that k-means can provide to true structure, when applied to synthetic data. We show that this is generally the case for small numbers of clusters, but for values of k that are still of theoretical and practical interest, similar values of SSQ can correspond to markedly different cluster partitions. This paper extends stability measures previously presented in the context of finding optimal values of cluster number, into a component of a 2-d map of the local minima found by the k-means algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index. The proposed method is illustrated by application to five synthetic datasets replicating a real world breast cancer dataset with varying data density, and a large bioinformatics dataset. PMID:23369085

  13. Finding reproducible cluster partitions for the k-means algorithm.

    PubMed

    Lisboa, Paulo J G; Etchells, Terence A; Jarman, Ian H; Chambers, Simon J

    2013-01-01

    K-means clustering is widely used for exploratory data analysis. While its dependence on initialisation is well-known, it is common practice to assume that the partition with lowest sum-of-squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also the closest that k-means can provide to true structure, when applied to synthetic data. We show that this is generally the case for small numbers of clusters, but for values of k that are still of theoretical and practical interest, similar values of SSQ can correspond to markedly different cluster partitions. This paper extends stability measures previously presented in the context of finding optimal values of cluster number, into a component of a 2-d map of the local minima found by the k-means algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index. The proposed method is illustrated by application to five synthetic datasets replicating a real world breast cancer dataset with varying data density, and a large bioinformatics dataset.

  14. Automatic detection of erythemato-squamous diseases using k-means clustering.

    PubMed

    Ubeyli, Elif Derya; Doğdu, Erdoğan

    2010-04-01

    A new approach based on the implementation of k-means clustering is presented for automated detection of erythemato-squamous diseases. The purpose of clustering techniques is to find a structure for the given data by finding similarities between data according to data characteristics. The studied domain contained records of patients with known diagnosis. The k-means clustering algorithm's task was to classify the data points, in this case the patients with attribute data, to one of the five clusters. The algorithm was used to detect the five erythemato-squamous diseases when 33 features defining five disease indications were used. The purpose is to determine an optimum classification scheme for this problem. The present research demonstrated that the features well represent the erythemato-squamous diseases and the k-means clustering algorithm's task achieved high classification accuracies for only five erythemato-squamous diseases.

  15. Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm.

    PubMed

    Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong

    2016-01-01

    In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis.

  16. Performance Analysis of Entropy Methods on K Means in Clustering Process

    NASA Astrophysics Data System (ADS)

    Dicky Syahputra Lubis, Mhd.; Mawengkang, Herman; Suwilo, Saib

    2017-12-01

    K Means is a non-hierarchical data clustering method that attempts to partition existing data into one or more clusters / groups. This method partitions the data into clusters / groups so that data that have the same characteristics are grouped into the same cluster and data that have different characteristics are grouped into other groups.The purpose of this data clustering is to minimize the objective function set in the clustering process, which generally attempts to minimize variation within a cluster and maximize the variation between clusters. However, the main disadvantage of this method is that the number k is often not known before. Furthermore, a randomly chosen starting point may cause two points to approach the distance to be determined as two centroids. Therefore, for the determination of the starting point in K Means used entropy method where this method is a method that can be used to determine a weight and take a decision from a set of alternatives. Entropy is able to investigate the harmony in discrimination among a multitude of data sets. Using Entropy criteria with the highest value variations will get the highest weight. Given this entropy method can help K Means work process in determining the starting point which is usually determined at random. Thus the process of clustering on K Means can be more quickly known by helping the entropy method where the iteration process is faster than the K Means Standard process. Where the postoperative patient dataset of the UCI Repository Machine Learning used and using only 12 data as an example of its calculations is obtained by entropy method only with 2 times iteration can get the desired end result.

  17. Support Vector Data Descriptions and k-Means Clustering: One Class?

    PubMed

    Gornitz, Nico; Lima, Luiz Alberto; Muller, Klaus-Robert; Kloft, Marius; Nakajima, Shinichi

    2017-09-27

    We present ClusterSVDD, a methodology that unifies support vector data descriptions (SVDDs) and k-means clustering into a single formulation. This allows both methods to benefit from one another, i.e., by adding flexibility using multiple spheres for SVDDs and increasing anomaly resistance and flexibility through kernels to k-means. In particular, our approach leads to a new interpretation of k-means as a regularized mode seeking algorithm. The unifying formulation further allows for deriving new algorithms by transferring knowledge from one-class learning settings to clustering settings and vice versa. As a showcase, we derive a clustering method for structured data based on a one-class learning scenario. Additionally, our formulation can be solved via a particularly simple optimization scheme. We evaluate our approach empirically to highlight some of the proposed benefits on artificially generated data, as well as on real-world problems, and provide a Python software package comprising various implementations of primal and dual SVDD as well as our proposed ClusterSVDD.

  18. Clustering performance comparison using K-means and expectation maximization algorithms.

    PubMed

    Jung, Yong Gyu; Kang, Min Soo; Heo, Jun

    2014-11-14

    Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K -means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K -means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.

  19. A Variable-Selection Heuristic for K-Means Clustering.

    ERIC Educational Resources Information Center

    Brusco, Michael J.; Cradit, J. Dennis

    2001-01-01

    Presents a variable selection heuristic for nonhierarchical (K-means) cluster analysis based on the adjusted Rand index for measuring cluster recovery. Subjected the heuristic to Monte Carlo testing across more than 2,200 datasets. Results indicate that the heuristic is extremely effective at eliminating masking variables. (SLD)

  20. Merging K-means with hierarchical clustering for identifying general-shaped groups.

    PubMed

    Peterson, Anna D; Ghosh, Arka P; Maitra, Ranjan

    2018-01-01

    Clustering partitions a dataset such that observations placed together in a group are similar but different from those in other groups. Hierarchical and K -means clustering are two approaches but have different strengths and weaknesses. For instance, hierarchical clustering identifies groups in a tree-like structure but suffers from computational complexity in large datasets while K -means clustering is efficient but designed to identify homogeneous spherically-shaped clusters. We present a hybrid non-parametric clustering approach that amalgamates the two methods to identify general-shaped clusters and that can be applied to larger datasets. Specifically, we first partition the dataset into spherical groups using K -means. We next merge these groups using hierarchical methods with a data-driven distance measure as a stopping criterion. Our proposal has the potential to reveal groups with general shapes and structure in a dataset. We demonstrate good performance on several simulated and real datasets.

  1. Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm

    PubMed Central

    Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong

    2016-01-01

    In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis. PMID:27959895

  2. Paternal age related schizophrenia (PARS): Latent subgroups detected by k-means clustering analysis.

    PubMed

    Lee, Hyejoo; Malaspina, Dolores; Ahn, Hongshik; Perrin, Mary; Opler, Mark G; Kleinhaus, Karine; Harlap, Susan; Goetz, Raymond; Antonius, Daniel

    2011-05-01

    Paternal age related schizophrenia (PARS) has been proposed as a subgroup of schizophrenia with distinct etiology, pathophysiology and symptoms. This study uses a k-means clustering analysis approach to generate hypotheses about differences between PARS and other cases of schizophrenia. We studied PARS (operationally defined as not having any family history of schizophrenia among first and second-degree relatives and fathers' age at birth ≥ 35 years) in a series of schizophrenia cases recruited from a research unit. Data were available on demographic variables, symptoms (Positive and Negative Syndrome Scale; PANSS), cognitive tests (Wechsler Adult Intelligence Scale-Revised; WAIS-R) and olfaction (University of Pennsylvania Smell Identification Test; UPSIT). We conducted a series of k-means clustering analyses to identify clusters of cases containing high concentrations of PARS. Two analyses generated clusters with high concentrations of PARS cases. The first analysis (N=136; PARS=34) revealed a cluster containing 83% PARS cases, in which the patients showed a significant discrepancy between verbal and performance intelligence. The mean paternal and maternal ages were 41 and 33, respectively. The second analysis (N=123; PARS=30) revealed a cluster containing 71% PARS cases, of which 93% were females; the mean age of onset of psychosis, at 17.2, was significantly early. These results strengthen the evidence that PARS cases differ from other patients with schizophrenia. Hypothesis-generating findings suggest that features of PARS may include a discrepancy between verbal and performance intelligence, and in females, an early age of onset. These findings provide a rationale for separating these phenotypes from others in future clinical, genetic and pathophysiologic studies of schizophrenia and in considering responses to treatment. Copyright © 2011 Elsevier B.V. All rights reserved.

  3. Security and Correctness Analysis on Privacy-Preserving k-Means Clustering Schemes

    NASA Astrophysics Data System (ADS)

    Su, Chunhua; Bao, Feng; Zhou, Jianying; Takagi, Tsuyoshi; Sakurai, Kouichi

    Due to the fast development of Internet and the related IT technologies, it becomes more and more easier to access a large amount of data. k-means clustering is a powerful and frequently used technique in data mining. Many research papers about privacy-preserving k-means clustering were published. In this paper, we analyze the existing privacy-preserving k-means clustering schemes based on the cryptographic techniques. We show those schemes will cause the privacy breach and cannot output the correct results due to the faults in the protocol construction. Furthermore, we analyze our proposal as an option to improve such problems but with intermediate information breach during the computation.

  4. A comparison of latent class, K-means, and K-median methods for clustering dichotomous data.

    PubMed

    Brusco, Michael J; Shireman, Emilie; Steinley, Douglas

    2017-09-01

    The problem of partitioning a collection of objects based on their measurements on a set of dichotomous variables is a well-established problem in psychological research, with applications including clinical diagnosis, educational testing, cognitive categorization, and choice analysis. Latent class analysis and K-means clustering are popular methods for partitioning objects based on dichotomous measures in the psychological literature. The K-median clustering method has recently been touted as a potentially useful tool for psychological data and might be preferable to its close neighbor, K-means, when the variable measures are dichotomous. We conducted simulation-based comparisons of the latent class, K-means, and K-median approaches for partitioning dichotomous data. Although all 3 methods proved capable of recovering cluster structure, K-median clustering yielded the best average performance, followed closely by latent class analysis. We also report results for the 3 methods within the context of an application to transitive reasoning data, in which it was found that the 3 approaches can exhibit profound differences when applied to real data. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  5. Parallel k-means++

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less

  6. A New Soft Computing Method for K-Harmonic Means Clustering.

    PubMed

    Yeh, Wei-Chang; Jiang, Yunzhi; Chen, Yee-Fen; Chen, Zhe

    2016-01-01

    The K-harmonic means clustering algorithm (KHM) is a new clustering method used to group data such that the sum of the harmonic averages of the distances between each entity and all cluster centroids is minimized. Because it is less sensitive to initialization than K-means (KM), many researchers have recently been attracted to studying KHM. In this study, the proposed iSSO-KHM is based on an improved simplified swarm optimization (iSSO) and integrates a variable neighborhood search (VNS) for KHM clustering. As evidence of the utility of the proposed iSSO-KHM, we present extensive computational results on eight benchmark problems. From the computational results, the comparison appears to support the superiority of the proposed iSSO-KHM over previously developed algorithms for all experiments in the literature.

  7. Segmentation of dermatoscopic images by frequency domain filtering and k-means clustering algorithms.

    PubMed

    Rajab, Maher I

    2011-11-01

    Since the introduction of epiluminescence microscopy (ELM), image analysis tools have been extended to the field of dermatology, in an attempt to algorithmically reproduce clinical evaluation. Accurate image segmentation of skin lesions is one of the key steps for useful, early and non-invasive diagnosis of coetaneous melanomas. This paper proposes two image segmentation algorithms based on frequency domain processing and k-means clustering/fuzzy k-means clustering. The two methods are capable of segmenting and extracting the true border that reveals the global structure irregularity (indentations and protrusions), which may suggest excessive cell growth or regression of a melanoma. As a pre-processing step, Fourier low-pass filtering is applied to reduce the surrounding noise in a skin lesion image. A quantitative comparison of the techniques is enabled by the use of synthetic skin lesion images that model lesions covered with hair to which Gaussian noise is added. The proposed techniques are also compared with an established optimal-based thresholding skin-segmentation method. It is demonstrated that for lesions with a range of different border irregularity properties, the k-means clustering and fuzzy k-means clustering segmentation methods provide the best performance over a range of signal to noise ratios. The proposed segmentation techniques are also demonstrated to have similar performance when tested on real skin lesions representing high-resolution ELM images. This study suggests that the segmentation results obtained using a combination of low-pass frequency filtering and k-means or fuzzy k-means clustering are superior to the result that would be obtained by using k-means or fuzzy k-means clustering segmentation methods alone. © 2011 John Wiley & Sons A/S.

  8. Profiling Local Optima in K-Means Clustering: Developing a Diagnostic Technique

    ERIC Educational Resources Information Center

    Steinley, Douglas

    2006-01-01

    Using the cluster generation procedure proposed by D. Steinley and R. Henson (2005), the author investigated the performance of K-means clustering under the following scenarios: (a) different probabilities of cluster overlap; (b) different types of cluster overlap; (c) varying samples sizes, clusters, and dimensions; (d) different multivariate…

  9. Long-term surface EMG monitoring using K-means clustering and compressive sensing

    NASA Astrophysics Data System (ADS)

    Balouchestani, Mohammadreza; Krishnan, Sridhar

    2015-05-01

    In this work, we present an advanced K-means clustering algorithm based on Compressed Sensing theory (CS) in combination with the K-Singular Value Decomposition (K-SVD) method for Clustering of long-term recording of surface Electromyography (sEMG) signals. The long-term monitoring of sEMG signals aims at recording of the electrical activity produced by muscles which are very useful procedure for treatment and diagnostic purposes as well as for detection of various pathologies. The proposed algorithm is examined for three scenarios of sEMG signals including healthy person (sEMG-Healthy), a patient with myopathy (sEMG-Myopathy), and a patient with neuropathy (sEMG-Neuropathr), respectively. The proposed algorithm can easily scan large sEMG datasets of long-term sEMG recording. We test the proposed algorithm with Principal Component Analysis (PCA) and Linear Correlation Coefficient (LCC) dimensionality reduction methods. Then, the output of the proposed algorithm is fed to K-Nearest Neighbours (K-NN) and Probabilistic Neural Network (PNN) classifiers in order to calclute the clustering performance. The proposed algorithm achieves a classification accuracy of 99.22%. This ability allows reducing 17% of Average Classification Error (ACE), 9% of Training Error (TE), and 18% of Root Mean Square Error (RMSE). The proposed algorithm also reduces 14% clustering energy consumption compared to the existing K-Means clustering algorithm.

  10. An improved K-means clustering algorithm in agricultural image segmentation

    NASA Astrophysics Data System (ADS)

    Cheng, Huifeng; Peng, Hui; Liu, Shanmei

    Image segmentation is the first important step to image analysis and image processing. In this paper, according to color crops image characteristics, we firstly transform the color space of image from RGB to HIS, and then select proper initial clustering center and cluster number in application of mean-variance approach and rough set theory followed by clustering calculation in such a way as to automatically segment color component rapidly and extract target objects from background accurately, which provides a reliable basis for identification, analysis, follow-up calculation and process of crops images. Experimental results demonstrate that improved k-means clustering algorithm is able to reduce the computation amounts and enhance precision and accuracy of clustering.

  11. Implementation of spectral clustering on microarray data of carcinoma using k-means algorithm

    NASA Astrophysics Data System (ADS)

    Frisca, Bustamam, Alhadi; Siswantining, Titin

    2017-03-01

    Clustering is one of data analysis methods that aims to classify data which have similar characteristics in the same group. Spectral clustering is one of the most popular modern clustering algorithms. As an effective clustering technique, spectral clustering method emerged from the concepts of spectral graph theory. Spectral clustering method needs partitioning algorithm. There are some partitioning methods including PAM, SOM, Fuzzy c-means, and k-means. Based on the research that has been done by Capital and Choudhury in 2013, when using Euclidian distance k-means algorithm provide better accuracy than PAM algorithm. So in this paper we use k-means as our partition algorithm. The major advantage of spectral clustering is in reducing data dimension, especially in this case to reduce the dimension of large microarray dataset. Microarray data is a small-sized chip made of a glass plate containing thousands and even tens of thousands kinds of genes in the DNA fragments derived from doubling cDNA. Application of microarray data is widely used to detect cancer, for the example is carcinoma, in which cancer cells express the abnormalities in his genes. The purpose of this research is to classify the data that have high similarity in the same group and the data that have low similarity in the others. In this research, Carcinoma microarray data using 7457 genes. The result of partitioning using k-means algorithm is two clusters.

  12. Towards enhancement of performance of K-means clustering using nature-inspired optimization algorithms.

    PubMed

    Fong, Simon; Deb, Suash; Yang, Xin-She; Zhuang, Yan

    2014-01-01

    Traditional K-means clustering algorithms have the drawback of getting stuck at local optima that depend on the random values of initial centroids. Optimization algorithms have their advantages in guiding iterative computation to search for global optima while avoiding local optima. The algorithms help speed up the clustering process by converging into a global optimum early with multiple search agents in action. Inspired by nature, some contemporary optimization algorithms which include Ant, Bat, Cuckoo, Firefly, and Wolf search algorithms mimic the swarming behavior allowing them to cooperatively steer towards an optimal objective within a reasonable time. It is known that these so-called nature-inspired optimization algorithms have their own characteristics as well as pros and cons in different applications. When these algorithms are combined with K-means clustering mechanism for the sake of enhancing its clustering quality by avoiding local optima and finding global optima, the new hybrids are anticipated to produce unprecedented performance. In this paper, we report the results of our evaluation experiments on the integration of nature-inspired optimization methods into K-means algorithms. In addition to the standard evaluation metrics in evaluating clustering quality, the extended K-means algorithms that are empowered by nature-inspired optimization methods are applied on image segmentation as a case study of application scenario.

  13. A novel harmony search-K means hybrid algorithm for clustering gene expression data

    PubMed Central

    Nazeer, KA Abdul; Sebastian, MP; Kumar, SD Madhu

    2013-01-01

    Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms. PMID:23390351

  14. A novel harmony search-K means hybrid algorithm for clustering gene expression data.

    PubMed

    Nazeer, Ka Abdul; Sebastian, Mp; Kumar, Sd Madhu

    2013-01-01

    Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms.

  15. The global Minmax k-means algorithm.

    PubMed

    Wang, Xiaoyan; Bai, Yanping

    2016-01-01

    The global k -means algorithm is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure from suitable initial positions, and employs k -means to minimize the sum of the intra-cluster variances. However the global k -means algorithm sometimes results singleton clusters and the initial positions sometimes are bad, after a bad initialization, poor local optimal can be easily obtained by k -means algorithm. In this paper, we modified the global k -means algorithm to eliminate the singleton clusters at first, and then we apply MinMax k -means clustering error method to global k -means algorithm to overcome the effect of bad initialization, proposed the global Minmax k -means algorithm. The proposed clustering method is tested on some popular data sets and compared to the k -means algorithm, the global k -means algorithm and the MinMax k -means algorithm. The experiment results show our proposed algorithm outperforms other algorithms mentioned in the paper.

  16. K-means-clustering-based fiber nonlinearity equalization techniques for 64-QAM coherent optical communication system.

    PubMed

    Zhang, Junfeng; Chen, Wei; Gao, Mingyi; Shen, Gangxiang

    2017-10-30

    In this work, we proposed two k-means-clustering-based algorithms to mitigate the fiber nonlinearity for 64-quadrature amplitude modulation (64-QAM) signal, the training-sequence assisted k-means algorithm and the blind k-means algorithm. We experimentally demonstrated the proposed k-means-clustering-based fiber nonlinearity mitigation techniques in 75-Gb/s 64-QAM coherent optical communication system. The proposed algorithms have reduced clustering complexity and low data redundancy and they are able to quickly find appropriate initial centroids and select correctly the centroids of the clusters to obtain the global optimal solutions for large k value. We measured the bit-error-ratio (BER) performance of 64-QAM signal with different launched powers into the 50-km single mode fiber and the proposed techniques can greatly mitigate the signal impairments caused by the amplified spontaneous emission noise and the fiber Kerr nonlinearity and improve the BER performance.

  17. Towards Enhancement of Performance of K-Means Clustering Using Nature-Inspired Optimization Algorithms

    PubMed Central

    Deb, Suash; Yang, Xin-She

    2014-01-01

    Traditional K-means clustering algorithms have the drawback of getting stuck at local optima that depend on the random values of initial centroids. Optimization algorithms have their advantages in guiding iterative computation to search for global optima while avoiding local optima. The algorithms help speed up the clustering process by converging into a global optimum early with multiple search agents in action. Inspired by nature, some contemporary optimization algorithms which include Ant, Bat, Cuckoo, Firefly, and Wolf search algorithms mimic the swarming behavior allowing them to cooperatively steer towards an optimal objective within a reasonable time. It is known that these so-called nature-inspired optimization algorithms have their own characteristics as well as pros and cons in different applications. When these algorithms are combined with K-means clustering mechanism for the sake of enhancing its clustering quality by avoiding local optima and finding global optima, the new hybrids are anticipated to produce unprecedented performance. In this paper, we report the results of our evaluation experiments on the integration of nature-inspired optimization methods into K-means algorithms. In addition to the standard evaluation metrics in evaluating clustering quality, the extended K-means algorithms that are empowered by nature-inspired optimization methods are applied on image segmentation as a case study of application scenario. PMID:25202730

  18. The effect of mining data k-means clustering toward students profile model drop out potential

    NASA Astrophysics Data System (ADS)

    Purba, Windania; Tamba, Saut; Saragih, Jepronel

    2018-04-01

    The high of student success and the low of student failure can reflect the quality of a college. One of the factors of fail students was drop out. To solve the problem, so mining data with K-means Clustering was applied. K-Means Clustering method would be implemented to clustering the drop out students potentially. Firstly the the result data would be clustering to get the information of all students condition. Based on the model taken was found that students who potentially drop out because of the unexciting students in learning, unsupported parents, diffident students and less of students behavior time. The result of process of K-Means Clustering could known that students who more potentially drop out were in Cluster 1 caused Credit Total System, Quality Total, and the lowest Grade Point Average (GPA) compared between cluster 2 and 3.

  19. Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming.

    PubMed

    Wang, Haizhou; Song, Mingzhou

    2011-12-01

    The heuristic k -means algorithm, widely used for cluster analysis, does not guarantee optimality. We developed a dynamic programming algorithm for optimal one-dimensional clustering. The algorithm is implemented as an R package called Ckmeans.1d.dp . We demonstrate its advantage in optimality and runtime over the standard iterative k -means algorithm.

  20. Reducing Earth Topography Resolution for SMAP Mission Ground Tracks Using K-Means Clustering

    NASA Technical Reports Server (NTRS)

    Rizvi, Farheen

    2013-01-01

    The K-means clustering algorithm is used to reduce Earth topography resolution for the SMAP mission ground tracks. As SMAP propagates in orbit, knowledge of the radar antenna footprints on Earth is required for the antenna misalignment calibration. Each antenna footprint contains a latitude and longitude location pair on the Earth surface. There are 400 pairs in one data set for the calibration model. It is computationally expensive to calculate corresponding Earth elevation for these data pairs. Thus, the antenna footprint resolution is reduced. Similar topographical data pairs are grouped together with the K-means clustering algorithm. The resolution is reduced to the mean of each topographical cluster called the cluster centroid. The corresponding Earth elevation for each cluster centroid is assigned to the entire group. Results show that 400 data points are reduced to 60 while still maintaining algorithm performance and computational efficiency. In this work, sensitivity analysis is also performed to show a trade-off between algorithm performance versus computational efficiency as the number of cluster centroids and algorithm iterations are increased.

  1. Prediction of chemotherapeutic response in bladder cancer using k-means clustering of DCE-MRI pharmacokinetic parameters

    PubMed Central

    Nguyen, Huyen T.; Jia, Guang; Shah, Zarine K.; Pohar, Kamal; Mortazavi, Amir; Zynger, Debra L.; Wei, Lai; Yang, Xiangyu; Clark, Daniel; Knopp, Michael V.

    2015-01-01

    Purpose To apply k-means clustering of two pharmacokinetic parameters derived from 3T DCE-MRI to predict chemotherapeutic response in bladder cancer at the mid-cycle time-point. Materials and Methods With the pre-determined number of 3 clusters, k-means clustering was performed on non-dimensionalized Amp and kep estimates of each bladder tumor. Three cluster volume fractions (VFs) were calculated for each tumor at baseline and mid-cycle. The changes of three cluster VFs from baseline to mid-cycle were correlated with the tumor’s chemotherapeutic response. Receiver-operating-characteristics curve analysis was used to evaluate the performance of each cluster VF change as a biomarker of chemotherapeutic response in bladder cancer. Results k-means clustering partitioned each bladder tumor into cluster 1 (low kep and low Amp), cluster 2 (low kep and high Amp), cluster 3 (high kep and low Amp). The changes of all three cluster VFs were found to be associated with bladder tumor response to chemotherapy. The VF change of cluster 2 presented with the highest area-under-the-curve value (0.96) and the highest sensitivity/specificity/accuracy (96%/100%/97%) with a selected cutoff value. Conclusion k-means clustering of the two DCE-MRI pharmacokinetic parameters can characterize the complex microcirculatory changes within a bladder tumor to enable early prediction of the tumor’s chemotherapeutic response. PMID:24943272

  2. An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks.

    PubMed

    Botía, Juan A; Vandrovcova, Jana; Forabosco, Paola; Guelfi, Sebastian; D'Sa, Karishma; Hardy, John; Lewis, Cathryn M; Ryten, Mina; Weale, Michael E

    2017-04-12

    Weighted Gene Co-expression Network Analysis (WGCNA) is a widely used R software package for the generation of gene co-expression networks (GCN). WGCNA generates both a GCN and a derived partitioning of clusters of genes (modules). We propose k-means clustering as an additional processing step to conventional WGCNA, which we have implemented in the R package km2gcn (k-means to gene co-expression network, https://github.com/juanbot/km2gcn ). We assessed our method on networks created from UKBEC data (10 different human brain tissues), on networks created from GTEx data (42 human tissues, including 13 brain tissues), and on simulated networks derived from GTEx data. We observed substantially improved module properties, including: (1) few or zero misplaced genes; (2) increased counts of replicable clusters in alternate tissues (x3.1 on average); (3) improved enrichment of Gene Ontology terms (seen in 48/52 GCNs) (4) improved cell type enrichment signals (seen in 21/23 brain GCNs); and (5) more accurate partitions in simulated data according to a range of similarity indices. The results obtained from our investigations indicate that our k-means method, applied as an adjunct to standard WGCNA, results in better network partitions. These improved partitions enable more fruitful downstream analyses, as gene modules are more biologically meaningful.

  3. Implementation of K-Means Clustering Method for Electronic Learning Model

    NASA Astrophysics Data System (ADS)

    Latipa Sari, Herlina; Suranti Mrs., Dewi; Natalia Zulita, Leni

    2017-12-01

    Teaching and Learning process at SMK Negeri 2 Bengkulu Tengah has applied e-learning system for teachers and students. The e-learning was based on the classification of normative, productive, and adaptive subjects. SMK Negeri 2 Bengkulu Tengah consisted of 394 students and 60 teachers with 16 subjects. The record of e-learning database was used in this research to observe students’ activity pattern in attending class. K-Means algorithm in this research was used to classify students’ learning activities using e-learning, so that it was obtained cluster of students’ activity and improvement of student’s ability. Implementation of K-Means Clustering method for electronic learning model at SMK Negeri 2 Bengkulu Tengah was conducted by observing 10 students’ activities, namely participation of students in the classroom, submit assignment, view assignment, add discussion, view discussion, add comment, download course materials, view article, view test, and submit test. In the e-learning model, the testing was conducted toward 10 students that yielded 2 clusters of membership data (C1 and C2). Cluster 1: with membership percentage of 70% and it consisted of 6 members, namely 1112438 Anggi Julian, 1112439 Anis Maulita, 1112441 Ardi Febriansyah, 1112452 Berlian Sinurat, 1112460 Dewi Anugrah Anwar and 1112467 Eka Tri Oktavia Sari. Cluster 2:with membership percentage of 30% and it consisted of 4 members, namely 1112463 Dosita Afriyani, 1112471 Erda Novita, 1112474 Eskardi and 1112477 Fachrur Rozi.

  4. Analysis of k-means clustering approach on the breast cancer Wisconsin dataset.

    PubMed

    Dubey, Ashutosh Kumar; Gupta, Umesh; Jain, Sonal

    2016-11-01

    Breast cancer is one of the most common cancers found worldwide and most frequently found in women. An early detection of breast cancer provides the possibility of its cure; therefore, a large number of studies are currently going on to identify methods that can detect breast cancer in its early stages. This study was aimed to find the effects of k-means clustering algorithm with different computation measures like centroid, distance, split method, epoch, attribute, and iteration and to carefully consider and identify the combination of measures that has potential of highly accurate clustering accuracy. K-means algorithm was used to evaluate the impact of clustering using centroid initialization, distance measures, and split methods. The experiments were performed using breast cancer Wisconsin (BCW) diagnostic dataset. Foggy and random centroids were used for the centroid initialization. In foggy centroid, based on random values, the first centroid was calculated. For random centroid, the initial centroid was considered as (0, 0). The results were obtained by employing k-means algorithm and are discussed with different cases considering variable parameters. The calculations were based on the centroid (foggy/random), distance (Euclidean/Manhattan/Pearson), split (simple/variance), threshold (constant epoch/same centroid), attribute (2-9), and iteration (4-10). Approximately, 92 % average positive prediction accuracy was obtained with this approach. Better results were found for the same centroid and the highest variance. The results achieved using Euclidean and Manhattan were better than the Pearson correlation. The findings of this work provided extensive understanding of the computational parameters that can be used with k-means. The results indicated that k-means has a potential to classify BCW dataset.

  5. K-means clustering for support construction in diffractive imaging.

    PubMed

    Hattanda, Shunsuke; Shioya, Hiroyuki; Maehara, Yosuke; Gohara, Kazutoshi

    2014-03-01

    A method for constructing an object support based on K-means clustering of the object-intensity distribution is newly presented in diffractive imaging. This releases the adjustment of unknown parameters in the support construction, and it is well incorporated with the Gerchberg and Saxton diagram. A simple numerical simulation reveals that the proposed method is effective for dynamically constructing the support without an initial prior support.

  6. A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality.

    PubMed

    Wang, Xueyi

    2012-02-08

    The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 10(6) records and 10(4) dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces.

  7. Cluster analysis of polymers using laser-induced breakdown spectroscopy with K-means

    NASA Astrophysics Data System (ADS)

    Yangmin, GUO; Yun, TANG; Yu, DU; Shisong, TANG; Lianbo, GUO; Xiangyou, LI; Yongfeng, LU; Xiaoyan, ZENG

    2018-06-01

    Laser-induced breakdown spectroscopy (LIBS) combined with K-means algorithm was employed to automatically differentiate industrial polymers under atmospheric conditions. The unsupervised learning algorithm K-means were utilized for the clustering of LIBS dataset measured from twenty kinds of industrial polymers. To prevent the interference from metallic elements, three atomic emission lines (C I 247.86 nm , H I 656.3 nm, and O I 777.3 nm) and one molecular line C–N (0, 0) 388.3 nm were used. The cluster analysis results were obtained through an iterative process. The Davies–Bouldin index was employed to determine the initial number of clusters. The average relative standard deviation values of characteristic spectral lines were used as the iterative criterion. With the proposed approach, the classification accuracy for twenty kinds of industrial polymers achieved 99.6%. The results demonstrated that this approach has great potential for industrial polymers recycling by LIBS.

  8. What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm.

    PubMed

    Raykov, Yordan P; Boukouvalas, Alexis; Baig, Fahd; Little, Max A

    The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

  9. What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm

    PubMed Central

    Baig, Fahd; Little, Max A.

    2016-01-01

    The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism. PMID:27669525

  10. Effects of Group Size and Lack of Sphericity on the Recovery of Clusters in K-Means Cluster Analysis

    ERIC Educational Resources Information Center

    de Craen, Saskia; Commandeur, Jacques J. F.; Frank, Laurence E.; Heiser, Willem J.

    2006-01-01

    K-means cluster analysis is known for its tendency to produce spherical and equally sized clusters. To assess the magnitude of these effects, a simulation study was conducted, in which populations were created with varying departures from sphericity and group sizes. An analysis of the recovery of clusters in the samples taken from these…

  11. Detection of maize kernels breakage rate based on K-means clustering

    NASA Astrophysics Data System (ADS)

    Yang, Liang; Wang, Zhuo; Gao, Lei; Bai, Xiaoping

    2017-04-01

    In order to optimize the recognition accuracy of maize kernels breakage detection and improve the detection efficiency of maize kernels breakage, this paper using computer vision technology and detecting of the maize kernels breakage based on K-means clustering algorithm. First, the collected RGB images are converted into Lab images, then the original images clarity evaluation are evaluated by the energy function of Sobel 8 gradient. Finally, the detection of maize kernels breakage using different pixel acquisition equipments and different shooting angles. In this paper, the broken maize kernels are identified by the color difference between integrity kernels and broken kernels. The original images clarity evaluation and different shooting angles are taken to verify that the clarity and shooting angles of the images have a direct influence on the feature extraction. The results show that K-means clustering algorithm can distinguish the broken maize kernels effectively.

  12. An adaptive enhancement algorithm for infrared video based on modified k-means clustering

    NASA Astrophysics Data System (ADS)

    Zhang, Linze; Wang, Jingqi; Wu, Wen

    2016-09-01

    In this paper, we have proposed a video enhancement algorithm to improve the output video of the infrared camera. Sometimes the video obtained by infrared camera is very dark since there is no clear target. In this case, infrared video should be divided into frame images by frame extraction, in order to carry out the image enhancement. For the first frame image, which can be divided into k sub images by using K-means clustering according to the gray interval it occupies before k sub images' histogram equalization according to the amount of information per sub image, we used a method to solve a problem that final cluster centers close to each other in some cases; and for the other frame images, their initial cluster centers can be determined by the final clustering centers of the previous ones, and the histogram equalization of each sub image will be carried out after image segmentation based on K-means clustering. The histogram equalization can make the gray value of the image to the whole gray level, and the gray level of each sub image is determined by the ratio of pixels to a frame image. Experimental results show that this algorithm can improve the contrast of infrared video where night target is not obvious which lead to a dim scene, and reduce the negative effect given by the overexposed pixels adaptively in a certain range.

  13. Automatic classification of canine PRG neuronal discharge patterns using K-means clustering.

    PubMed

    Zuperku, Edward J; Prkic, Ivana; Stucke, Astrid G; Miller, Justin R; Hopp, Francis A; Stuth, Eckehard A

    2015-02-01

    Respiratory-related neurons in the parabrachial-Kölliker-Fuse (PB-KF) region of the pons play a key role in the control of breathing. The neuronal activities of these pontine respiratory group (PRG) neurons exhibit a variety of inspiratory (I), expiratory (E), phase spanning and non-respiratory related (NRM) discharge patterns. Due to the variety of patterns, it can be difficult to classify them into distinct subgroups according to their discharge contours. This report presents a method that automatically classifies neurons according to their discharge patterns and derives an average subgroup contour of each class. It is based on the K-means clustering technique and it is implemented via SigmaPlot User-Defined transform scripts. The discharge patterns of 135 canine PRG neurons were classified into seven distinct subgroups. Additional methods for choosing the optimal number of clusters are described. Analysis of the results suggests that the K-means clustering method offers a robust objective means of both automatically categorizing neuron patterns and establishing the underlying archetypical contours of subtypes based on the discharge patterns of group of neurons. Published by Elsevier B.V.

  14. K-mean clustering algorithm for processing signals from compound semiconductor detectors

    NASA Astrophysics Data System (ADS)

    Tada, Tsutomu; Hitomi, Keitaro; Wu, Yan; Kim, Seong-Yun; Yamazaki, Hiromichi; Ishii, Keizo

    2011-12-01

    The K-mean clustering algorithm was employed for processing signal waveforms from TlBr detectors. The signal waveforms were classified based on its shape reflecting the charge collection process in the detector. The classified signal waveforms were processed individually to suppress the pulse height variation of signals due to the charge collection loss. The obtained energy resolution of a 137Cs spectrum measured with a 0.5 mm thick TlBr detector was 1.3% FWHM by employing 500 clusters.

  15. Nucleus and cytoplasm segmentation in microscopic images using K-means clustering and region growing.

    PubMed

    Sarrafzadeh, Omid; Dehnavi, Alireza Mehri

    2015-01-01

    Segmentation of leukocytes acts as the foundation for all automated image-based hematological disease recognition systems. Most of the time, hematologists are interested in evaluation of white blood cells only. Digital image processing techniques can help them in their analysis and diagnosis. The main objective of this paper is to detect leukocytes from a blood smear microscopic image and segment them into their two dominant elements, nucleus and cytoplasm. The segmentation is conducted using two stages of applying K-means clustering. First, the nuclei are segmented using K-means clustering. Then, a proposed method based on region growing is applied to separate the connected nuclei. Next, the nuclei are subtracted from the original image. Finally, the cytoplasm is segmented using the second stage of K-means clustering. The results indicate that the proposed method is able to extract the nucleus and cytoplasm regions accurately and works well even though there is no significant contrast between the components in the image. In this paper, a method based on K-means clustering and region growing is proposed in order to detect leukocytes from a blood smear microscopic image and segment its components, the nucleus and the cytoplasm. As region growing step of the algorithm relies on the information of edges, it will not able to separate the connected nuclei more accurately in poor edges and it requires at least a weak edge to exist between the nuclei. The nucleus and cytoplasm segments of a leukocyte can be used for feature extraction and classification which leads to automated leukemia detection.

  16. Anthropometric typology of male and female rowers using k-means clustering.

    PubMed

    Forjasz, Justyna

    2011-06-01

    The aim of this paper is to present the morphological features of rowers. The objective is to establish the type of body build best suited to the present requirements of this sports discipline through the determination of the most important morphological features in rowing with regard to the type of racing boat. The subjects of this study included competitors who practise rowing and were members of the Junior National Team. The considered variables included a group of 32 anthropometric measurements of body composition determined using the BIA method among male and female athletes, while also including rowing boat categories. In order to determine the analysed structures of male and female rowers, an observation analysis was taken into consideration and performed by the k-means clustering method. In the group of male and female rowers using long paddles, higher mean values for the analysed features were observed, with the exception of fat-free mass, and water content in both genders, and trunk length and horizontal reach in women who achieved higher means in the short-paddle group. On the men's team, both groups differed significantly in body mass, longitudinal features, horizontal reach, hand width and body circumferences, while on the women's, they differed in body mass, width and length of the chest, body circumferences and fat content. The method of grouping used in this paper confirmed morphological differences in the competitors with regard to the type of racing boat.

  17. Anthropometric Typology of Male and Female Rowers Using K-Means Clustering

    PubMed Central

    Forjasz, Justyna

    2011-01-01

    The aim of this paper is to present the morphological features of rowers. The objective is to establish the type of body build best suited to the present requirements of this sports discipline through the determination of the most important morphological features in rowing with regard to the type of racing boat. The subjects of this study included competitors who practise rowing and were members of the Junior National Team. The considered variables included a group of 32 anthropometric measurements of body composition determined using the BIA method among male and female athletes, while also including rowing boat categories. In order to determine the analysed structures of male and female rowers, an observation analysis was taken into consideration and performed by the k-means clustering method. In the group of male and female rowers using long paddles, higher mean values for the analysed features were observed, with the exception of fat-free mass, and water content in both genders, and trunk length and horizontal reach in women who achieved higher means in the short-paddle group. On the men’s team, both groups differed significantly in body mass, longitudinal features, horizontal reach, hand width and body circumferences, while on the women’s, they differed in body mass, width and length of the chest, body circumferences and fat content. The method of grouping used in this paper confirmed morphological differences in the competitors with regard to the type of racing boat. PMID:23486287

  18. Classification of Two Class Motor Imagery Tasks Using Hybrid GA-PSO Based K-Means Clustering.

    PubMed

    Suraj; Tiwari, Purnendu; Ghosh, Subhojit; Sinha, Rakesh Kumar

    2015-01-01

    Transferring the brain computer interface (BCI) from laboratory condition to meet the real world application needs BCI to be applied asynchronously without any time constraint. High level of dynamism in the electroencephalogram (EEG) signal reasons us to look toward evolutionary algorithm (EA). Motivated by these two facts, in this work a hybrid GA-PSO based K-means clustering technique has been used to distinguish two class motor imagery (MI) tasks. The proposed hybrid GA-PSO based K-means clustering is found to outperform genetic algorithm (GA) and particle swarm optimization (PSO) based K-means clustering techniques in terms of both accuracy and execution time. The lesser execution time of hybrid GA-PSO technique makes it suitable for real time BCI application. Time frequency representation (TFR) techniques have been used to extract the feature of the signal under investigation. TFRs based features are extracted and relying on the concept of event related synchronization (ERD) and desynchronization (ERD) feature vector is formed.

  19. Classification of Two Class Motor Imagery Tasks Using Hybrid GA-PSO Based K-Means Clustering

    PubMed Central

    Suraj; Tiwari, Purnendu; Ghosh, Subhojit; Sinha, Rakesh Kumar

    2015-01-01

    Transferring the brain computer interface (BCI) from laboratory condition to meet the real world application needs BCI to be applied asynchronously without any time constraint. High level of dynamism in the electroencephalogram (EEG) signal reasons us to look toward evolutionary algorithm (EA). Motivated by these two facts, in this work a hybrid GA-PSO based K-means clustering technique has been used to distinguish two class motor imagery (MI) tasks. The proposed hybrid GA-PSO based K-means clustering is found to outperform genetic algorithm (GA) and particle swarm optimization (PSO) based K-means clustering techniques in terms of both accuracy and execution time. The lesser execution time of hybrid GA-PSO technique makes it suitable for real time BCI application. Time frequency representation (TFR) techniques have been used to extract the feature of the signal under investigation. TFRs based features are extracted and relying on the concept of event related synchronization (ERD) and desynchronization (ERD) feature vector is formed. PMID:25972896

  20. [Research on K-means clustering segmentation method for MRI brain image based on selecting multi-peaks in gray histogram].

    PubMed

    Chen, Zhaoxue; Yu, Haizhong; Chen, Hao

    2013-12-01

    To solve the problem of traditional K-means clustering in which initial clustering centers are selected randomly, we proposed a new K-means segmentation algorithm based on robustly selecting 'peaks' standing for White Matter, Gray Matter and Cerebrospinal Fluid in multi-peaks gray histogram of MRI brain image. The new algorithm takes gray value of selected histogram 'peaks' as the initial K-means clustering center and can segment the MRI brain image into three parts of tissue more effectively, accurately, steadily and successfully. Massive experiments have proved that the proposed algorithm can overcome many shortcomings caused by traditional K-means clustering method such as low efficiency, veracity, robustness and time consuming. The histogram 'peak' selecting idea of the proposed segmentootion method is of more universal availability.

  1. Sleep stages identification in patients with sleep disorder using k-means clustering

    NASA Astrophysics Data System (ADS)

    Fadhlullah, M. U.; Resahya, A.; Nugraha, D. F.; Yulita, I. N.

    2018-05-01

    Data mining is a computational intelligence discipline where a large dataset processed using a certain method to look for patterns within the large dataset. This pattern then used for real time application or to develop some certain knowledge. This is a valuable tool to solve a complex problem, discover new knowledge, data analysis and decision making. To be able to get the pattern that lies inside the large dataset, clustering method is used to get the pattern. Clustering is basically grouping data that looks similar so a certain pattern can be seen in the large data set. Clustering itself has several algorithms to group the data into the corresponding cluster. This research used data from patients who suffer sleep disorders and aims to help people in the medical world to reduce the time required to classify the sleep stages from a patient who suffers from sleep disorders. This study used K-Means algorithm and silhouette evaluation to find out that 3 clusters are the optimal cluster for this dataset which means can be divided to 3 sleep stages.

  2. Nucleus and cytoplasm segmentation in microscopic images using K-means clustering and region growing

    PubMed Central

    Sarrafzadeh, Omid; Dehnavi, Alireza Mehri

    2015-01-01

    Background: Segmentation of leukocytes acts as the foundation for all automated image-based hematological disease recognition systems. Most of the time, hematologists are interested in evaluation of white blood cells only. Digital image processing techniques can help them in their analysis and diagnosis. Materials and Methods: The main objective of this paper is to detect leukocytes from a blood smear microscopic image and segment them into their two dominant elements, nucleus and cytoplasm. The segmentation is conducted using two stages of applying K-means clustering. First, the nuclei are segmented using K-means clustering. Then, a proposed method based on region growing is applied to separate the connected nuclei. Next, the nuclei are subtracted from the original image. Finally, the cytoplasm is segmented using the second stage of K-means clustering. Results: The results indicate that the proposed method is able to extract the nucleus and cytoplasm regions accurately and works well even though there is no significant contrast between the components in the image. Conclusions: In this paper, a method based on K-means clustering and region growing is proposed in order to detect leukocytes from a blood smear microscopic image and segment its components, the nucleus and the cytoplasm. As region growing step of the algorithm relies on the information of edges, it will not able to separate the connected nuclei more accurately in poor edges and it requires at least a weak edge to exist between the nuclei. The nucleus and cytoplasm segments of a leukocyte can be used for feature extraction and classification which leads to automated leukemia detection. PMID:26605213

  3. Prediction of chemotherapeutic response in bladder cancer using K-means clustering of dynamic contrast-enhanced (DCE)-MRI pharmacokinetic parameters.

    PubMed

    Nguyen, Huyen T; Jia, Guang; Shah, Zarine K; Pohar, Kamal; Mortazavi, Amir; Zynger, Debra L; Wei, Lai; Yang, Xiangyu; Clark, Daniel; Knopp, Michael V

    2015-05-01

    To apply k-means clustering of two pharmacokinetic parameters derived from 3T dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) to predict the chemotherapeutic response in bladder cancer at the mid-cycle timepoint. With the predetermined number of three clusters, k-means clustering was performed on nondimensionalized Amp and kep estimates of each bladder tumor. Three cluster volume fractions (VFs) were calculated for each tumor at baseline and mid-cycle. The changes of three cluster VFs from baseline to mid-cycle were correlated with the tumor's chemotherapeutic response. Receiver-operating-characteristics curve analysis was used to evaluate the performance of each cluster VF change as a biomarker of chemotherapeutic response in bladder cancer. The k-means clustering partitioned each bladder tumor into cluster 1 (low kep and low Amp), cluster 2 (low kep and high Amp), cluster 3 (high kep and low Amp). The changes of all three cluster VFs were found to be associated with bladder tumor response to chemotherapy. The VF change of cluster 2 presented with the highest area-under-the-curve value (0.96) and the highest sensitivity/specificity/accuracy (96%/100%/97%) with a selected cutoff value. The k-means clustering of the two DCE-MRI pharmacokinetic parameters can characterize the complex microcirculatory changes within a bladder tumor to enable early prediction of the tumor's chemotherapeutic response. © 2014 Wiley Periodicals, Inc.

  4. [Automatic Sleep Stage Classification Based on an Improved K-means Clustering Algorithm].

    PubMed

    Xiao, Shuyuan; Wang, Bei; Zhang, Jian; Zhang, Qunfeng; Zou, Junzhong

    2016-10-01

    Sleep stage scoring is a hotspot in the field of medicine and neuroscience.Visual inspection of sleep is laborious and the results may be subjective to different clinicians.Automatic sleep stage classification algorithm can be used to reduce the manual workload.However,there are still limitations when it encounters complicated and changeable clinical cases.The purpose of this paper is to develop an automatic sleep staging algorithm based on the characteristics of actual sleep data.In the proposed improved K-means clustering algorithm,points were selected as the initial centers by using a concept of density to avoid the randomness of the original K-means algorithm.Meanwhile,the cluster centers were updated according to the‘Three-Sigma Rule’during the iteration to abate the influence of the outliers.The proposed method was tested and analyzed on the overnight sleep data of the healthy persons and patients with sleep disorders after continuous positive airway pressure(CPAP)treatment.The automatic sleep stage classification results were compared with the visual inspection by qualified clinicians and the averaged accuracy reached 76%.With the analysis of morphological diversity of sleep data,it was proved that the proposed improved K-means algorithm was feasible and valid for clinical practice.

  5. Elastic K-means using posterior probability

    PubMed Central

    Zheng, Aihua; Jiang, Bo; Li, Yan; Zhang, Xuehan; Ding, Chris

    2017-01-01

    The widely used K-means clustering is a hard clustering algorithm. Here we propose a Elastic K-means clustering model (EKM) using posterior probability with soft capability where each data point can belong to multiple clusters fractionally and show the benefit of proposed Elastic K-means. Furthermore, in many applications, besides vector attributes information, pairwise relations (graph information) are also available. Thus we integrate EKM with Normalized Cut graph clustering into a single clustering formulation. Finally, we provide several useful matrix inequalities which are useful for matrix formulations of learning models. Based on these results, we prove the correctness and the convergence of EKM algorithms. Experimental results on six benchmark datasets demonstrate the effectiveness of proposed EKM and its integrated model. PMID:29240756

  6. Elastic K-means using posterior probability.

    PubMed

    Zheng, Aihua; Jiang, Bo; Li, Yan; Zhang, Xuehan; Ding, Chris

    2017-01-01

    The widely used K-means clustering is a hard clustering algorithm. Here we propose a Elastic K-means clustering model (EKM) using posterior probability with soft capability where each data point can belong to multiple clusters fractionally and show the benefit of proposed Elastic K-means. Furthermore, in many applications, besides vector attributes information, pairwise relations (graph information) are also available. Thus we integrate EKM with Normalized Cut graph clustering into a single clustering formulation. Finally, we provide several useful matrix inequalities which are useful for matrix formulations of learning models. Based on these results, we prove the correctness and the convergence of EKM algorithms. Experimental results on six benchmark datasets demonstrate the effectiveness of proposed EKM and its integrated model.

  7. Utility of the k-means clustering algorithm in differentiating apparent diffusion coefficient values of benign and malignant neck pathologies.

    PubMed

    Srinivasan, A; Galbán, C J; Johnson, T D; Chenevert, T L; Ross, B D; Mukherji, S K

    2010-04-01

    Does the K-means algorithm do a better job of differentiating benign and malignant neck pathologies compared to only mean ADC? The objective of our study was to analyze the differences between ADC partitions to evaluate whether the K-means technique can be of additional benefit to whole-lesion mean ADC alone in distinguishing benign and malignant neck pathologies. MR imaging studies of 10 benign and 10 malignant proved neck pathologies were postprocessed on a PC by using in-house software developed in Matlab. Two neuroradiologists manually contoured the lesions, with the ADC values within each lesion clustered into 2 (low, ADC-ADC(L); high, ADC-ADC(H)) and 3 partitions (ADC(L); intermediate, ADC-ADC(I); ADC(H)) by using the K-means clustering algorithm. An unpaired 2-tailed Student t test was performed for all metrics to determine statistical differences in the means of the benign and malignant pathologies. A statistically significant difference between the mean ADC(L) clusters in benign and malignant pathologies was seen in the 3-cluster models of both readers (P = .03 and .022, respectively) and the 2-cluster model of reader 2 (P = .04), with the other metrics (ADC(H), ADC(I); whole-lesion mean ADC) not revealing any significant differences. ROC curves demonstrated the quantitative differences in mean ADC(H) and ADC(L) in both the 2- and 3-cluster models to be predictive of malignancy (2 clusters: P = .008, area under curve = 0.850; 3 clusters: P = .01, area under curve = 0.825). The K-means clustering algorithm that generates partitions of large datasets may provide a better characterization of neck pathologies and may be of additional benefit in distinguishing benign and malignant neck pathologies compared with whole-lesion mean ADC alone.

  8. Utility of K-Means clustering algorithm in differentiating apparent diffusion coefficient values between benign and malignant neck pathologies

    PubMed Central

    Srinivasan, A.; Galbán, C.J.; Johnson, T.D.; Chenevert, T.L.; Ross, B.D.; Mukherji, S.K.

    2014-01-01

    Purpose The objective of our study was to analyze the differences between apparent diffusion coefficient (ADC) partitions (created using the K-Means algorithm) between benign and malignant neck lesions and evaluate its benefit in distinguishing these entities. Material and methods MRI studies of 10 benign and 10 malignant proven neck pathologies were post-processed on a PC using in-house software developed in MATLAB (The MathWorks, Inc., Natick, MA). Lesions were manually contoured by two neuroradiologists with the ADC values within each lesion clustered into two (low ADC-ADCL, high ADC-ADCH) and three partitions (ADCL, intermediate ADC-ADCI, ADCH) using the K-Means clustering algorithm. An unpaired two-tailed Student’s t-test was performed for all metrics to determine statistical differences in the means between the benign and malignant pathologies. Results Statistically significant difference between the mean ADCL clusters in benign and malignant pathologies was seen in the 3 cluster models of both readers (p=0.03, 0.022 respectively) and the 2 cluster model of reader 2 (p=0.04) with the other metrics (ADCH, ADCI, whole lesion mean ADC) not revealing any significant differences. Receiver operating characteristics curves demonstrated the quantitative difference in mean ADCH and ADCL in both the 2 and 3 cluster models to be predictive of malignancy (2 clusters: p=0.008, area under curve=0.850, 3 clusters: p=0.01, area under curve=0.825). Conclusion The K-Means clustering algorithm that generates partitions of large datasets may provide a better characterization of neck pathologies and may be of additional benefit in distinguishing benign and malignant neck pathologies compared to whole lesion mean ADC alone. PMID:20007723

  9. An improved initialization center k-means clustering algorithm based on distance and density

    NASA Astrophysics Data System (ADS)

    Duan, Yanling; Liu, Qun; Xia, Shuyin

    2018-04-01

    Aiming at the problem of the random initial clustering center of k means algorithm that the clustering results are influenced by outlier data sample and are unstable in multiple clustering, a method of central point initialization method based on larger distance and higher density is proposed. The reciprocal of the weighted average of distance is used to represent the sample density, and the data sample with the larger distance and the higher density are selected as the initial clustering centers to optimize the clustering results. Then, a clustering evaluation method based on distance and density is designed to verify the feasibility of the algorithm and the practicality, the experimental results on UCI data sets show that the algorithm has a certain stability and practicality.

  10. Predicting the random drift of MEMS gyroscope based on K-means clustering and OLS RBF Neural Network

    NASA Astrophysics Data System (ADS)

    Wang, Zhen-yu; Zhang, Li-jie

    2017-10-01

    Measure error of the sensor can be effectively compensated with prediction. Aiming at large random drift error of MEMS(Micro Electro Mechanical System))gyroscope, an improved learning algorithm of Radial Basis Function(RBF) Neural Network(NN) based on K-means clustering and Orthogonal Least-Squares (OLS) is proposed in this paper. The algorithm selects the typical samples as the initial cluster centers of RBF NN firstly, candidates centers with K-means algorithm secondly, and optimizes the candidate centers with OLS algorithm thirdly, which makes the network structure simpler and makes the prediction performance better. Experimental results show that the proposed K-means clustering OLS learning algorithm can predict the random drift of MEMS gyroscope effectively, the prediction error of which is 9.8019e-007°/s and the prediction time of which is 2.4169e-006s

  11. Reducing the Time Requirement of k-Means Algorithm

    PubMed Central

    Osamor, Victor Chukwudi; Adebiyi, Ezekiel Femi; Oyelade, Jelilli Olarenwaju; Doumbia, Seydou

    2012-01-01

    Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the quality is good (ARIHA>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data. PMID:23239974

  12. Reducing the time requirement of k-means algorithm.

    PubMed

    Osamor, Victor Chukwudi; Adebiyi, Ezekiel Femi; Oyelade, Jelilli Olarenwaju; Doumbia, Seydou

    2012-01-01

    Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space R(d) and an integer k. The problem is to determine a set of k points in R(d), called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARI(HA)). We found that when k is close to d, the quality is good (ARI(HA)>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARI(HA)>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.

  13. Performance Analysis of Combined Methods of Genetic Algorithm and K-Means Clustering in Determining the Value of Centroid

    NASA Astrophysics Data System (ADS)

    Adya Zizwan, Putra; Zarlis, Muhammad; Budhiarti Nababan, Erna

    2017-12-01

    The determination of Centroid on K-Means Algorithm directly affects the quality of the clustering results. Determination of centroid by using random numbers has many weaknesses. The GenClust algorithm that combines the use of Genetic Algorithms and K-Means uses a genetic algorithm to determine the centroid of each cluster. The use of the GenClust algorithm uses 50% chromosomes obtained through deterministic calculations and 50% is obtained from the generation of random numbers. This study will modify the use of the GenClust algorithm in which the chromosomes used are 100% obtained through deterministic calculations. The results of this study resulted in performance comparisons expressed in Mean Square Error influenced by centroid determination on K-Means method by using GenClust method, modified GenClust method and also classic K-Means.

  14. Automated spike sorting algorithm based on Laplacian eigenmaps and k-means clustering.

    PubMed

    Chah, E; Hok, V; Della-Chiesa, A; Miller, J J H; O'Mara, S M; Reilly, R B

    2011-02-01

    This study presents a new automatic spike sorting method based on feature extraction by Laplacian eigenmaps combined with k-means clustering. The performance of the proposed method was compared against previously reported algorithms such as principal component analysis (PCA) and amplitude-based feature extraction. Two types of classifier (namely k-means and classification expectation-maximization) were incorporated within the spike sorting algorithms, in order to find a suitable classifier for the feature sets. Simulated data sets and in-vivo tetrode multichannel recordings were employed to assess the performance of the spike sorting algorithms. The results show that the proposed algorithm yields significantly improved performance with mean sorting accuracy of 73% and sorting error of 10% compared to PCA which combined with k-means had a sorting accuracy of 58% and sorting error of 10%.A correction was made to this article on 22 February 2011. The spacing of the title was amended on the abstract page. No changes were made to the article PDF and the print version was unaffected.

  15. An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data.

    PubMed

    Nidheesh, N; Abdul Nazeer, K A; Ameer, P M

    2017-12-01

    Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data. Copyright © 2017 Elsevier Ltd. All rights reserved.

  16. Order-Constrained Solutions in K-Means Clustering: Even Better than Being Globally Optimal

    ERIC Educational Resources Information Center

    Steinley, Douglas; Hubert, Lawrence

    2008-01-01

    This paper proposes an order-constrained K-means cluster analysis strategy, and implements that strategy through an auxiliary quadratic assignment optimization heuristic that identifies an initial object order. A subsequent dynamic programming recursion is applied to optimally subdivide the object set subject to the order constraint. We show that…

  17. Enhanced K-means clustering with encryption on cloud

    NASA Astrophysics Data System (ADS)

    Singh, Iqjot; Dwivedi, Prerna; Gupta, Taru; Shynu, P. G.

    2017-11-01

    This paper tries to solve the problem of storing and managing big files over cloud by implementing hashing on Hadoop in big-data and ensure security while uploading and downloading files. Cloud computing is a term that emphasis on sharing data and facilitates to share infrastructure and resources.[10] Hadoop is an open source software that gives us access to store and manage big files according to our needs on cloud. K-means clustering algorithm is an algorithm used to calculate distance between the centroid of the cluster and the data points. Hashing is a algorithm in which we are storing and retrieving data with hash keys. The hashing algorithm is called as hash function which is used to portray the original data and later to fetch the data stored at the specific key. [17] Encryption is a process to transform electronic data into non readable form known as cipher text. Decryption is the opposite process of encryption, it transforms the cipher text into plain text that the end user can read and understand well. For encryption and decryption we are using Symmetric key cryptographic algorithm. In symmetric key cryptography are using DES algorithm for a secure storage of the files. [3

  18. A K-means multivariate approach for clustering independent components from magnetoencephalographic data.

    PubMed

    Spadone, Sara; de Pasquale, Francesco; Mantini, Dante; Della Penna, Stefania

    2012-09-01

    Independent component analysis (ICA) is typically applied on functional magnetic resonance imaging, electroencephalographic and magnetoencephalographic (MEG) data due to its data-driven nature. In these applications, ICA needs to be extended from single to multi-session and multi-subject studies for interpreting and assigning a statistical significance at the group level. Here a novel strategy for analyzing MEG independent components (ICs) is presented, Multivariate Algorithm for Grouping MEG Independent Components K-means based (MAGMICK). The proposed approach is able to capture spatio-temporal dynamics of brain activity in MEG studies by running ICA at subject level and then clustering the ICs across sessions and subjects. Distinctive features of MAGMICK are: i) the implementation of an efficient set of "MEG fingerprints" designed to summarize properties of MEG ICs as they are built on spatial, temporal and spectral parameters; ii) the implementation of a modified version of the standard K-means procedure to improve its data-driven character. This algorithm groups the obtained ICs automatically estimating the number of clusters through an adaptive weighting of the parameters and a constraint on the ICs independence, i.e. components coming from the same session (at subject level) or subject (at group level) cannot be grouped together. The performances of MAGMICK are illustrated by analyzing two sets of MEG data obtained during a finger tapping task and median nerve stimulation. The results demonstrate that the method can extract consistent patterns of spatial topography and spectral properties across sessions and subjects that are in good agreement with the literature. In addition, these results are compared to those from a modified version of affinity propagation clustering method. The comparison, evaluated in terms of different clustering validity indices, shows that our methodology often outperforms the clustering algorithm. Eventually, these results are

  19. Automatic video shot boundary detection using k-means clustering and improved adaptive dual threshold comparison

    NASA Astrophysics Data System (ADS)

    Sa, Qila; Wang, Zhihui

    2018-03-01

    At present, content-based video retrieval (CBVR) is the most mainstream video retrieval method, using the video features of its own to perform automatic identification and retrieval. This method involves a key technology, i.e. shot segmentation. In this paper, the method of automatic video shot boundary detection with K-means clustering and improved adaptive dual threshold comparison is proposed. First, extract the visual features of every frame and divide them into two categories using K-means clustering algorithm, namely, one with significant change and one with no significant change. Then, as to the classification results, utilize the improved adaptive dual threshold comparison method to determine the abrupt as well as gradual shot boundaries.Finally, achieve automatic video shot boundary detection system.

  20. Effects of Group Size and Lack of Sphericity on the Recovery of Clusters in K-means Cluster Analysis.

    PubMed

    Craen, Saskia de; Commandeur, Jacques J F; Frank, Laurence E; Heiser, Willem J

    2006-06-01

    K-means cluster analysis is known for its tendency to produce spherical and equally sized clusters. To assess the magnitude of these effects, a simulation study was conducted, in which populations were created with varying departures from sphericity and group sizes. An analysis of the recovery of clusters in the samples taken from these populations showed a significant effect of lack of sphericity and group size. This effect was, however, not as large as expected, with still a recovery index of more than 0.5 in the "worst case scenario." An interaction effect between the two data aspects was also found. The decreasing trend in the recovery of clusters for increasing departures from sphericity is different for equal and unequal group sizes.

  1. A Parametric k-Means Algorithm

    PubMed Central

    Tarpey, Thaddeus

    2007-01-01

    Summary The k points that optimally represent a distribution (usually in terms of a squared error loss) are called the k principal points. This paper presents a computationally intensive method that automatically determines the principal points of a parametric distribution. Cluster means from the k-means algorithm are nonparametric estimators of principal points. A parametric k-means approach is introduced for estimating principal points by running the k-means algorithm on a very large simulated data set from a distribution whose parameters are estimated using maximum likelihood. Theoretical and simulation results are presented comparing the parametric k-means algorithm to the usual k-means algorithm and an example on determining sizes of gas masks is used to illustrate the parametric k-means algorithm. PMID:17917692

  2. Comparison of five cluster validity indices performance in brain [18 F]FET-PET image segmentation using k-means.

    PubMed

    Abualhaj, Bedor; Weng, Guoyang; Ong, Melissa; Attarwala, Ali Asgar; Molina, Flavia; Büsing, Karen; Glatting, Gerhard

    2017-01-01

    Dynamic [ 18 F]fluoro-ethyl-L-tyrosine positron emission tomography ([ 18 F]FET-PET) is used to identify tumor lesions for radiotherapy treatment planning, to differentiate glioma recurrence from radiation necrosis and to classify gliomas grading. To segment different regions in the brain k-means cluster analysis can be used. The main disadvantage of k-means is that the number of clusters must be pre-defined. In this study, we therefore compared different cluster validity indices for automated and reproducible determination of the optimal number of clusters based on the dynamic PET data. The k-means algorithm was applied to dynamic [ 18 F]FET-PET images of 8 patients. Akaike information criterion (AIC), WB, I, modified Dunn's and Silhouette indices were compared on their ability to determine the optimal number of clusters based on requirements for an adequate cluster validity index. To check the reproducibility of k-means, the coefficients of variation CVs of the objective function values OFVs (sum of squared Euclidean distances within each cluster) were calculated using 100 random centroid initialization replications RCI 100 for 2 to 50 clusters. k-means was performed independently on three neighboring slices containing tumor for each patient to investigate the stability of the optimal number of clusters within them. To check the independence of the validity indices on the number of voxels, cluster analysis was applied after duplication of a slice selected from each patient. CVs of index values were calculated at the optimal number of clusters using RCI 100 to investigate the reproducibility of the validity indices. To check if the indices have a single extremum, visual inspection was performed on the replication with minimum OFV from RCI 100 . The maximum CV of OFVs was 2.7 × 10 -2 from all patients. The optimal number of clusters given by modified Dunn's and Silhouette indices was 2 or 3 leading to a very poor segmentation. WB and I indices suggested in

  3. Cluster Analysis of Indonesian Province Based on Household Primary Cooking Fuel Using K-Means

    NASA Astrophysics Data System (ADS)

    Huda, S. N.

    2017-03-01

    Each household definitely provides installations for cooking. Kerosene, which is refined from petroleum products once dominated types of primary fuel for cooking in Indonesia, whereas kerosene has an expensive cost and small efficiency. Other household use LPG as their primary cooking fuel. However, LPG supply is also limited. In addition, with a very diverse environments and cultures in Indonesia led to diversity of the installation type of cooking, such as wood-burning stove brazier. The government is also promoting alternative fuels, such as charcoal briquettes, and fuel from biomass. The use of other fuels is part of the diversification of energy that is expected to reduce community dependence on petroleum-based fuels. The use of various fuels in cooking that vary from one region to another reflects the distribution of fuel basic use by household. By knowing the characteristics of each province, the government can take appropriate policies to each province according each character. Therefore, it would be very good if there exist a cluster analysis of all provinces in Indonesia based on the type of primary cooking fuel in household. Cluster analysis is done using K-Means method with K ranging from 2-5. Cluster results are validated using Silhouette Coefficient (SC). The results show that the highest SC achieved from K = 2 with SC value 0.39135818388151. Two clusters reflect provinces in Indonesia, one is a cluster of more traditional provinces and the other is a cluster of more modern provinces. The cluster results are then shown in a map using Google Map API.

  4. Clustering Educational Digital Library Usage Data: A Comparison of Latent Class Analysis and K-Means Algorithms

    ERIC Educational Resources Information Center

    Xu, Beijie; Recker, Mimi; Qi, Xiaojun; Flann, Nicholas; Ye, Lei

    2013-01-01

    This article examines clustering as an educational data mining method. In particular, two clustering algorithms, the widely used K-means and the model-based Latent Class Analysis, are compared, using usage data from an educational digital library service, the Instructional Architect (IA.usu.edu). Using a multi-faceted approach and multiple data…

  5. Vertebra identification using template matching modelmp and K-means clustering.

    PubMed

    Larhmam, Mohamed Amine; Benjelloun, Mohammed; Mahmoudi, Saïd

    2014-03-01

    Accurate vertebra detection and segmentation are essential steps for automating the diagnosis of spinal disorders. This study is dedicated to vertebra alignment measurement, the first step in a computer-aided diagnosis tool for cervical spine trauma. Automated vertebral segment alignment determination is a challenging task due to low contrast imaging and noise. A software tool for segmenting vertebrae and detecting subluxations has clinical significance. A robust method was developed and tested for cervical vertebra identification and segmentation that extracts parameters used for vertebra alignment measurement. Our contribution involves a novel combination of a template matching method and an unsupervised clustering algorithm. In this method, we build a geometric vertebra mean model. To achieve vertebra detection, manual selection of the region of interest is performed initially on the input image. Subsequent preprocessing is done to enhance image contrast and detect edges. Candidate vertebra localization is then carried out by using a modified generalized Hough transform (GHT). Next, an adapted cost function is used to compute local voted centers and filter boundary data. Thereafter, a K-means clustering algorithm is applied to obtain clusters distribution corresponding to the targeted vertebrae. These clusters are combined with the vote parameters to detect vertebra centers. Rigid segmentation is then carried out by using GHT parameters. Finally, cervical spine curves are extracted to measure vertebra alignment. The proposed approach was successfully applied to a set of 66 high-resolution X-ray images. Robust detection was achieved in 97.5 % of the 330 tested cervical vertebrae. An automated vertebral identification method was developed and demonstrated to be robust to noise and occlusion. This work presents a first step toward an automated computer-aided diagnosis system for cervical spine trauma detection.

  6. Surface EMG decomposition based on K-means clustering and convolution kernel compensation.

    PubMed

    Ning, Yong; Zhu, Xiangjun; Zhu, Shanan; Zhang, Yingchun

    2015-03-01

    A new approach has been developed by combining the K-mean clustering (KMC) method and a modified convolution kernel compensation (CKC) method for multichannel surface electromyogram (EMG) decomposition. The KMC method was first utilized to cluster vectors of observations at different time instants and then estimate the initial innervation pulse train (IPT). The CKC method, modified with a novel multistep iterative process, was conducted to update the estimated IPT. The performance of the proposed K-means clustering-Modified CKC (KmCKC) approach was evaluated by reconstructing IPTs from both simulated and experimental surface EMG signals. The KmCKC approach successfully reconstructed all 10 IPTs from the simulated surface EMG signals with true positive rates (TPR) of over 90% with a low signal-to-noise ratio (SNR) of -10 dB. More than 10 motor units were also successfully extracted from the 64-channel experimental surface EMG signals of the first dorsal interosseous (FDI) muscles when a contraction force was held at 8 N by using the KmCKC approach. A "two-source" test was further conducted with 64-channel surface EMG signals. The high percentage of common MUs and common pulses (over 92% at all force levels) between the IPTs reconstructed from the two independent groups of surface EMG signals demonstrates the reliability and capability of the proposed KmCKC approach in multichannel surface EMG decomposition. Results from both simulated and experimental data are consistent and confirm that the proposed KmCKC approach can successfully reconstruct IPTs with high accuracy at different levels of contraction.

  7. Are judgments a form of data clustering? Reexamining contrast effects with the k-means algorithm.

    PubMed

    Boillaud, Eric; Molina, Guylaine

    2015-04-01

    A number of theories have been proposed to explain in precise mathematical terms how statistical parameters and sequential properties of stimulus distributions affect category ratings. Various contextual factors such as the mean, the midrange, and the median of the stimuli; the stimulus range; the percentile rank of each stimulus; and the order of appearance have been assumed to influence judgmental contrast. A data clustering reinterpretation of judgmental relativity is offered wherein the influence of the initial choice of centroids on judgmental contrast involves 2 combined frequency and consistency tendencies. Accounts of the k-means algorithm are provided, showing good agreement with effects observed on multiple distribution shapes and with a variety of interaction effects relating to the number of stimuli, the number of response categories, and the method of skewing. Experiment 1 demonstrates that centroid initialization accounts for contrast effects obtained with stretched distributions. Experiment 2 demonstrates that the iterative convergence inherent to the k-means algorithm accounts for the contrast reduction observed across repeated blocks of trials. The concept of within-cluster variance minimization is discussed, as is the applicability of a backward k-means calculation method for inferring, from empirical data, the values of the centroids that would serve as a representation of the judgmental context. (c) 2015 APA, all rights reserved.

  8. Stroke localization and classification using microwave tomography with k-means clustering and support vector machine.

    PubMed

    Guo, Lei; Abbosh, Amin

    2018-05-01

    For any chance for stroke patients to survive, the stroke type should be classified to enable giving medication within a few hours of the onset of symptoms. In this paper, a microwave-based stroke localization and classification framework is proposed. It is based on microwave tomography, k-means clustering, and a support vector machine (SVM) method. The dielectric profile of the brain is first calculated using the Born iterative method, whereas the amplitude of the dielectric profile is then taken as the input to k-means clustering. The cluster is selected as the feature vector for constructing and testing the SVM. A database of MRI-derived realistic head phantoms at different signal-to-noise ratios is used in the classification procedure. The performance of the proposed framework is evaluated using the receiver operating characteristic (ROC) curve. The results based on a two-dimensional framework show that 88% classification accuracy, with a sensitivity of 91% and a specificity of 87%, can be achieved. Bioelectromagnetics. 39:312-324, 2018. © 2018 Wiley Periodicals, Inc. © 2018 Wiley Periodicals, Inc.

  9. Text grouping in patent analysis using adaptive K-means clustering algorithm

    NASA Astrophysics Data System (ADS)

    Shanie, Tiara; Suprijadi, Jadi; Zulhanif

    2017-03-01

    Patents are one of the Intellectual Property. Analyzing patent is one requirement in knowing well the development of technology in each country and in the world now. This study uses the patent document coming from the Espacenet server about Green Tea. Patent documents related to the technology in the field of tea is still widespread, so it will be difficult for users to information retrieval (IR). Therefore, it is necessary efforts to categorize documents in a specific group of related terms contained therein. This study uses titles patent text data with the proposed Green Tea in Statistical Text Mining methods consists of two phases: data preparation and data analysis stage. The data preparation phase uses Text Mining methods and data analysis stage is done by statistics. Statistical analysis in this study using a cluster analysis algorithm, the Adaptive K-Means Clustering Algorithm. Results from this study showed that based on the maximum value Silhouette, generate 87 clusters associated fifteen terms therein that can be utilized in the process of information retrieval needs.

  10. The cascaded moving k-means and fuzzy c-means clustering algorithms for unsupervised segmentation of malaria images

    NASA Astrophysics Data System (ADS)

    Abdul-Nasir, Aimi Salihah; Mashor, Mohd Yusoff; Halim, Nurul Hazwani Abd; Mohamed, Zeehaida

    2015-05-01

    Malaria is a life-threatening parasitic infectious disease that corresponds for nearly one million deaths each year. Due to the requirement of prompt and accurate diagnosis of malaria, the current study has proposed an unsupervised pixel segmentation based on clustering algorithm in order to obtain the fully segmented red blood cells (RBCs) infected with malaria parasites based on the thin blood smear images of P. vivax species. In order to obtain the segmented infected cell, the malaria images are first enhanced by using modified global contrast stretching technique. Then, an unsupervised segmentation technique based on clustering algorithm has been applied on the intensity component of malaria image in order to segment the infected cell from its blood cells background. In this study, cascaded moving k-means (MKM) and fuzzy c-means (FCM) clustering algorithms has been proposed for malaria slide image segmentation. After that, median filter algorithm has been applied to smooth the image as well as to remove any unwanted regions such as small background pixels from the image. Finally, seeded region growing area extraction algorithm has been applied in order to remove large unwanted regions that are still appeared on the image due to their size in which cannot be cleaned by using median filter. The effectiveness of the proposed cascaded MKM and FCM clustering algorithms has been analyzed qualitatively and quantitatively by comparing the proposed cascaded clustering algorithm with MKM and FCM clustering algorithms. Overall, the results indicate that segmentation using the proposed cascaded clustering algorithm has produced the best segmentation performances by achieving acceptable sensitivity as well as high specificity and accuracy values compared to the segmentation results provided by MKM and FCM algorithms.

  11. Extending the Functionality of Behavioural Change-Point Analysis with k-Means Clustering: A Case Study with the Little Penguin (Eudyptula minor)

    PubMed Central

    Zhang, Jingjing; Dennis, Todd E.

    2015-01-01

    We present a simple framework for classifying mutually exclusive behavioural states within the geospatial lifelines of animals. This method involves use of three sequentially applied statistical procedures: (1) behavioural change point analysis to partition movement trajectories into discrete bouts of same-state behaviours, based on abrupt changes in the spatio-temporal autocorrelation structure of movement parameters; (2) hierarchical multivariate cluster analysis to determine the number of different behavioural states; and (3) k-means clustering to classify inferred bouts of same-state location observations into behavioural modes. We demonstrate application of the method by analysing synthetic trajectories of known ‘artificial behaviours’ comprised of different correlated random walks, as well as real foraging trajectories of little penguins (Eudyptula minor) obtained by global-positioning-system telemetry. Our results show that the modelling procedure correctly classified 92.5% of all individual location observations in the synthetic trajectories, demonstrating reasonable ability to successfully discriminate behavioural modes. Most individual little penguins were found to exhibit three unique behavioural states (resting, commuting/active searching, area-restricted foraging), with variation in the timing and locations of observations apparently related to ambient light, bathymetry, and proximity to coastlines and river mouths. Addition of k-means clustering extends the utility of behavioural change point analysis, by providing a simple means through which the behaviours inferred for the location observations comprising individual movement trajectories can be objectively classified. PMID:25922935

  12. Extending the Functionality of Behavioural Change-Point Analysis with k-Means Clustering: A Case Study with the Little Penguin (Eudyptula minor).

    PubMed

    Zhang, Jingjing; O'Reilly, Kathleen M; Perry, George L W; Taylor, Graeme A; Dennis, Todd E

    2015-01-01

    We present a simple framework for classifying mutually exclusive behavioural states within the geospatial lifelines of animals. This method involves use of three sequentially applied statistical procedures: (1) behavioural change point analysis to partition movement trajectories into discrete bouts of same-state behaviours, based on abrupt changes in the spatio-temporal autocorrelation structure of movement parameters; (2) hierarchical multivariate cluster analysis to determine the number of different behavioural states; and (3) k-means clustering to classify inferred bouts of same-state location observations into behavioural modes. We demonstrate application of the method by analysing synthetic trajectories of known 'artificial behaviours' comprised of different correlated random walks, as well as real foraging trajectories of little penguins (Eudyptula minor) obtained by global-positioning-system telemetry. Our results show that the modelling procedure correctly classified 92.5% of all individual location observations in the synthetic trajectories, demonstrating reasonable ability to successfully discriminate behavioural modes. Most individual little penguins were found to exhibit three unique behavioural states (resting, commuting/active searching, area-restricted foraging), with variation in the timing and locations of observations apparently related to ambient light, bathymetry, and proximity to coastlines and river mouths. Addition of k-means clustering extends the utility of behavioural change point analysis, by providing a simple means through which the behaviours inferred for the location observations comprising individual movement trajectories can be objectively classified.

  13. Developing cluster strategy of apples dodol SMEs by integration K-means clustering and analytical hierarchy process method

    NASA Astrophysics Data System (ADS)

    Mustaniroh, S. A.; Effendi, U.; Silalahi, R. L. R.; Sari, T.; Ala, M.

    2018-03-01

    The purposes of this research were to determine the grouping of apples dodol small and medium enterprises (SMEs) in Batu City and to determine an appropriate development strategy for each cluster. The methods used for clustering SMEs was k-means. The Analytical Hierarchy Process (AHP) approach was then applied to determine the development strategy priority for each cluster. The variables used in grouping include production capacity per month, length of operation, investment value, average sales revenue per month, amount of SMEs assets, and the number of workers. Several factors were considered in AHP include industry cluster, government, as well as related and supporting industries. Data was collected using the methods of questionaire and interviews. SMEs respondents were selected among SMEs appels dodol in Batu City using purposive sampling. The result showed that two clusters were formed from five apples dodol SMEs. The 1stcluster of apples dodol SMEs, classified as small enterprises, included SME A, SME C, and SME D. The 2ndcluster of SMEs apples dodol, classified as medium enterprises, consisted of SME B and SME E. The AHP results indicated that the priority development strategy for the 1stcluster of apples dodol SMEs was improving quality and the product standardisation, while for the 2nd cluster was increasing the marketing access.

  14. Identify High-Quality Protein Structural Models by Enhanced K-Means.

    PubMed

    Wu, Hongjie; Li, Haiou; Jiang, Min; Chen, Cheng; Lv, Qiang; Wu, Chuang

    2017-01-01

    Background. One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases. Results. Here, we proposed two enhanced K -means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basic K -means clustering ( SK -means), whereas the other employs squared distance to optimize the initial centroids ( K -means++). Our results showed that SK -means and K -means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER. Conclusions. We observed that the classic K -means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein-structure identification. Both SK -means and K -means++ demonstrated substantial improvements relative to results from SPICKER and classical K -means.

  15. Identify High-Quality Protein Structural Models by Enhanced K-Means

    PubMed Central

    Li, Haiou; Chen, Cheng; Lv, Qiang; Wu, Chuang

    2017-01-01

    Background. One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases. Results. Here, we proposed two enhanced K-means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basic K-means clustering (SK-means), whereas the other employs squared distance to optimize the initial centroids (K-means++). Our results showed that SK-means and K-means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER. Conclusions. We observed that the classic K-means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein-structure identification. Both SK-means and K-means++ demonstrated substantial improvements relative to results from SPICKER and classical K-means. PMID:28421198

  16. Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation.

    PubMed

    Saatchi, Mahdi; McClure, Mathew C; McKay, Stephanie D; Rolf, Megan M; Kim, JaeWoo; Decker, Jared E; Taxis, Tasia M; Chapple, Richard H; Ramey, Holly R; Northcutt, Sally L; Bauck, Stewart; Woodward, Brent; Dekkers, Jack C M; Fernando, Rohan L; Schnabel, Robert D; Garrick, Dorian J; Taylor, Jeremy F

    2011-11-28

    Genomic selection is a recently developed technology that is beginning to revolutionize animal breeding. The objective of this study was to estimate marker effects to derive prediction equations for direct genomic values for 16 routinely recorded traits of American Angus beef cattle and quantify corresponding accuracies of prediction. Deregressed estimated breeding values were used as observations in a weighted analysis to derive direct genomic values for 3570 sires genotyped using the Illumina BovineSNP50 BeadChip. These bulls were clustered into five groups using K-means clustering on pedigree estimates of additive genetic relationships between animals, with the aim of increasing within-group and decreasing between-group relationships. All five combinations of four groups were used for model training, with cross-validation performed in the group not used in training. Bivariate animal models were used for each trait to estimate the genetic correlation between deregressed estimated breeding values and direct genomic values. Accuracies of direct genomic values ranged from 0.22 to 0.69 for the studied traits, with an average of 0.44. Predictions were more accurate when animals within the validation group were more closely related to animals in the training set. When training and validation sets were formed by random allocation, the accuracies of direct genomic values ranged from 0.38 to 0.85, with an average of 0.65, reflecting the greater relationship between animals in training and validation. The accuracies of direct genomic values obtained from training on older animals and validating in younger animals were intermediate to the accuracies obtained from K-means clustering and random clustering for most traits. The genetic correlation between deregressed estimated breeding values and direct genomic values ranged from 0.15 to 0.80 for the traits studied. These results suggest that genomic estimates of genetic merit can be produced in beef cattle at a young age but

  17. Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation

    PubMed Central

    2011-01-01

    Background Genomic selection is a recently developed technology that is beginning to revolutionize animal breeding. The objective of this study was to estimate marker effects to derive prediction equations for direct genomic values for 16 routinely recorded traits of American Angus beef cattle and quantify corresponding accuracies of prediction. Methods Deregressed estimated breeding values were used as observations in a weighted analysis to derive direct genomic values for 3570 sires genotyped using the Illumina BovineSNP50 BeadChip. These bulls were clustered into five groups using K-means clustering on pedigree estimates of additive genetic relationships between animals, with the aim of increasing within-group and decreasing between-group relationships. All five combinations of four groups were used for model training, with cross-validation performed in the group not used in training. Bivariate animal models were used for each trait to estimate the genetic correlation between deregressed estimated breeding values and direct genomic values. Results Accuracies of direct genomic values ranged from 0.22 to 0.69 for the studied traits, with an average of 0.44. Predictions were more accurate when animals within the validation group were more closely related to animals in the training set. When training and validation sets were formed by random allocation, the accuracies of direct genomic values ranged from 0.38 to 0.85, with an average of 0.65, reflecting the greater relationship between animals in training and validation. The accuracies of direct genomic values obtained from training on older animals and validating in younger animals were intermediate to the accuracies obtained from K-means clustering and random clustering for most traits. The genetic correlation between deregressed estimated breeding values and direct genomic values ranged from 0.15 to 0.80 for the traits studied. Conclusions These results suggest that genomic estimates of genetic merit can be

  18. Technical Note: Using k-means clustering to determine the number and position of isocenters in MLC-based multiple target intracranial radiosurgery.

    PubMed

    Yock, Adam D; Kim, Gwe-Ya

    2017-09-01

    To present the k-means clustering algorithm as a tool to address treatment planning considerations characteristic of stereotactic radiosurgery using a single isocenter for multiple targets. For 30 patients treated with stereotactic radiosurgery for multiple brain metastases, the geometric centroids and radii of each met were determined from the treatment planning system. In-house software used this as well as weighted and unweighted versions of the k-means clustering algorithm to group the targets to be treated with a single isocenter, and to position each isocenter. The algorithm results were evaluated using within-cluster sum of squares as well as a minimum target coverage metric that considered the effect of target size. Both versions of the algorithm were applied to an example patient to demonstrate the prospective determination of the appropriate number and location of isocenters. Both weighted and unweighted versions of the k-means algorithm were applied successfully to determine the number and position of isocenters. Comparing the two, both the within-cluster sum of squares metric and the minimum target coverage metric resulting from the unweighted version were less than those from the weighted version. The average magnitudes of the differences were small (-0.2 cm 2 and 0.1% for the within cluster sum of squares and minimum target coverage, respectively) but statistically significant (Wilcoxon signed-rank test, P < 0.01). The differences between the versions of the k-means clustering algorithm represented an advantage of the unweighted version for the within-cluster sum of squares metric, and an advantage of the weighted version for the minimum target coverage metric. While additional treatment planning considerations have a large influence on the final treatment plan quality, both versions of the k-means algorithm provide automatic, consistent, quantitative, and objective solutions to the tasks associated with SRS treatment planning using a single isocenter

  19. Optimized data fusion for K-means Laplacian clustering

    PubMed Central

    Yu, Shi; Liu, Xinhai; Tranchevent, Léon-Charles; Glänzel, Wolfgang; Suykens, Johan A. K.; De Moor, Bart; Moreau, Yves

    2011-01-01

    Motivation: We propose a novel algorithm to combine multiple kernels and Laplacians for clustering analysis. The new algorithm is formulated on a Rayleigh quotient objective function and is solved as a bi-level alternating minimization procedure. Using the proposed algorithm, the coefficients of kernels and Laplacians can be optimized automatically. Results: Three variants of the algorithm are proposed. The performance is systematically validated on two real-life data fusion applications. The proposed Optimized Kernel Laplacian Clustering (OKLC) algorithms perform significantly better than other methods. Moreover, the coefficients of kernels and Laplacians optimized by OKLC show some correlation with the rank of performance of individual data source. Though in our evaluation the K values are predefined, in practical studies, the optimal cluster number can be consistently estimated from the eigenspectrum of the combined kernel Laplacian matrix. Availability: The MATLAB code of algorithms implemented in this paper is downloadable from http://homes.esat.kuleuven.be/~sistawww/bioi/syu/oklc.html. Contact: shiyu@uchicago.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20980271

  20. A hybrid sales forecasting scheme by combining independent component analysis with K-means clustering and support vector regression.

    PubMed

    Lu, Chi-Jie; Chang, Chi-Chang

    2014-01-01

    Sales forecasting plays an important role in operating a business since it can be used to determine the required inventory level to meet consumer demand and avoid the problem of under/overstocking. Improving the accuracy of sales forecasting has become an important issue of operating a business. This study proposes a hybrid sales forecasting scheme by combining independent component analysis (ICA) with K-means clustering and support vector regression (SVR). The proposed scheme first uses the ICA to extract hidden information from the observed sales data. The extracted features are then applied to K-means algorithm for clustering the sales data into several disjoined clusters. Finally, the SVR forecasting models are applied to each group to generate final forecasting results. Experimental results from information technology (IT) product agent sales data reveal that the proposed sales forecasting scheme outperforms the three comparison models and hence provides an efficient alternative for sales forecasting.

  1. Conveyor Performance based on Motor DC 12 Volt Eg-530ad-2f using K-Means Clustering

    NASA Astrophysics Data System (ADS)

    Arifin, Zaenal; Artini, Sri DP; Much Ibnu Subroto, Imam

    2017-04-01

    To produce goods in industry, a controlled tool to improve production is required. Separation process has become a part of production process. Separation process is carried out based on certain criteria to get optimum result. By knowing the characteristics performance of a controlled tools in separation process the optimum results is also possible to be obtained. Clustering analysis is popular method for clustering data into smaller segments. Clustering analysis is useful to divide a group of object into a k-group in which the member value of the group is homogeny or similar. Similarity in the group is set based on certain criteria. The work in this paper based on K-Means method to conduct clustering of loading in the performance of a conveyor driven by a dc motor 12 volt eg-530-2f. This technique gives a complete clustering data for a prototype of conveyor driven by dc motor to separate goods in term of height. The parameters involved are voltage, current, time of travelling. These parameters give two clusters namely optimal cluster with center of cluster 10.50 volt, 0.3 Ampere, 10.58 second, and unoptimal cluster with center of cluster 10.88 volt, 0.28 Ampere and 40.43 second.

  2. Determining the Number of Instars in Simulium quinquestriatum (Diptera: Simuliidae) Using k-Means Clustering via the Canberra Distance.

    PubMed

    Yang, Yao Ming; Jia, Ruo; Xun, Hui; Yang, Jie; Chen, Qiang; Zeng, Xiang Guang; Yang, Ming

    2018-02-21

    Simulium quinquestriatum Shiraki (Diptera: Simuliidae), a human-biting fly that is distributed widely across Asia, is a vector for multiple pathogens. However, the larval development of this species is poorly understood. In this study, we determined the number of instars in this pest using three batches of field-collected larvae from Guiyang, Guizhou, China. The postgenal length, head capsule width, mandibular phragma length, and body length of 773 individuals were measured, and k-means clustering was used for instar grouping. Four distance measures-Manhattan, Euclidean, Chebyshev, and Canberra-were determined. The reported instar numbers, ranging from 4 to 11, were set as initial cluster centers for k-means clustering. The Canberra distance yielded reliable instar grouping, which was consistent with the first instar, as characterized by egg bursters and prepupae with dark histoblasts. Females and males of the last cluster of larvae were identified using Feulgen-stained gonads. Morphometric differences between the two sexes were not significant. Validation was performed using the Brooks-Dyar and Crosby rules, revealing that the larval stage of S. quinquestriatum is composed of eight instars.

  3. White blood cell segmentation by color-space-based k-means clustering.

    PubMed

    Zhang, Congcong; Xiao, Xiaoyan; Li, Xiaomei; Chen, Ying-Jie; Zhen, Wu; Chang, Jun; Zheng, Chengyun; Liu, Zhi

    2014-09-01

    White blood cell (WBC) segmentation, which is important for cytometry, is a challenging issue because of the morphological diversity of WBCs and the complex and uncertain background of blood smear images. This paper proposes a novel method for the nucleus and cytoplasm segmentation of WBCs for cytometry. A color adjustment step was also introduced before segmentation. Color space decomposition and k-means clustering were combined for segmentation. A database including 300 microscopic blood smear images were used to evaluate the performance of our method. The proposed segmentation method achieves 95.7% and 91.3% overall accuracy for nucleus segmentation and cytoplasm segmentation, respectively. Experimental results demonstrate that the proposed method can segment WBCs effectively with high accuracy.

  4. Interactive K-Means Clustering Method Based on User Behavior for Different Analysis Target in Medicine.

    PubMed

    Lei, Yang; Yu, Dai; Bin, Zhang; Yang, Yang

    2017-01-01

    Clustering algorithm as a basis of data analysis is widely used in analysis systems. However, as for the high dimensions of the data, the clustering algorithm may overlook the business relation between these dimensions especially in the medical fields. As a result, usually the clustering result may not meet the business goals of the users. Then, in the clustering process, if it can combine the knowledge of the users, that is, the doctor's knowledge or the analysis intent, the clustering result can be more satisfied. In this paper, we propose an interactive K -means clustering method to improve the user's satisfactions towards the result. The core of this method is to get the user's feedback of the clustering result, to optimize the clustering result. Then, a particle swarm optimization algorithm is used in the method to optimize the parameters, especially the weight settings in the clustering algorithm to make it reflect the user's business preference as possible. After that, based on the parameter optimization and adjustment, the clustering result can be closer to the user's requirement. Finally, we take an example in the breast cancer, to testify our method. The experiments show the better performance of our algorithm.

  5. Mitigate the impact of transmitter finite extinction ratio using K-means clustering algorithm for 16QAM signal

    NASA Astrophysics Data System (ADS)

    Yu, Miao; Li, Yan; Shu, Tong; Zhang, Yifan; Hong, Xiaobin; Qiu, Jifang; Zuo, Yong; Guo, Hongxiang; Li, Wei; Wu, Jian

    2018-02-01

    A method of recognizing 16QAM signal based on k-means clustering algorithm is proposed to mitigate the impact of transmitter finite extinction ratio. There are pilot symbols with 0.39% overhead assigned to be regarded as initial centroids of k-means clustering algorithm. Simulation result in 10 GBaud 16QAM system shows that the proposed method obtains higher precision of identification compared with traditional decision method for finite ER and IQ mismatch. Specially, the proposed method improves the required OSNR by 5.5 dB, 4.5 dB, 4 dB and 3 dB at FEC limit with ER= 12 dB, 16 dB, 20 dB and 24 dB, respectively, and the acceptable bias error and IQ mismatch range is widened by 767% and 360% with ER =16 dB, respectively.

  6. A Hybrid Sales Forecasting Scheme by Combining Independent Component Analysis with K-Means Clustering and Support Vector Regression

    PubMed Central

    2014-01-01

    Sales forecasting plays an important role in operating a business since it can be used to determine the required inventory level to meet consumer demand and avoid the problem of under/overstocking. Improving the accuracy of sales forecasting has become an important issue of operating a business. This study proposes a hybrid sales forecasting scheme by combining independent component analysis (ICA) with K-means clustering and support vector regression (SVR). The proposed scheme first uses the ICA to extract hidden information from the observed sales data. The extracted features are then applied to K-means algorithm for clustering the sales data into several disjoined clusters. Finally, the SVR forecasting models are applied to each group to generate final forecasting results. Experimental results from information technology (IT) product agent sales data reveal that the proposed sales forecasting scheme outperforms the three comparison models and hence provides an efficient alternative for sales forecasting. PMID:25045738

  7. A comparative study of DIGNET, average, complete, single hierarchical and k-means clustering algorithms in 2D face image recognition

    NASA Astrophysics Data System (ADS)

    Thanos, Konstantinos-Georgios; Thomopoulos, Stelios C. A.

    2014-06-01

    The study in this paper belongs to a more general research of discovering facial sub-clusters in different ethnicity face databases. These new sub-clusters along with other metadata (such as race, sex, etc.) lead to a vector for each face in the database where each vector component represents the likelihood of participation of a given face to each cluster. This vector is then used as a feature vector in a human identification and tracking system based on face and other biometrics. The first stage in this system involves a clustering method which evaluates and compares the clustering results of five different clustering algorithms (average, complete, single hierarchical algorithm, k-means and DIGNET), and selects the best strategy for each data collection. In this paper we present the comparative performance of clustering results of DIGNET and four clustering algorithms (average, complete, single hierarchical and k-means) on fabricated 2D and 3D samples, and on actual face images from various databases, using four different standard metrics. These metrics are the silhouette figure, the mean silhouette coefficient, the Hubert test Γ coefficient, and the classification accuracy for each clustering result. The results showed that, in general, DIGNET gives more trustworthy results than the other algorithms when the metrics values are above a specific acceptance threshold. However when the evaluation results metrics have values lower than the acceptance threshold but not too low (too low corresponds to ambiguous results or false results), then it is necessary for the clustering results to be verified by the other algorithms.

  8. Findings in resting-state fMRI by differences from K-means clustering.

    PubMed

    Chyzhyk, Darya; Graña, Manuel

    2014-01-01

    Resting state fMRI has growing number of studies with diverse aims, always centered on some kind of functional connectivity biomarker obtained from correlation regarding seed regions, or by analytical decomposition of the signal towards the localization of the spatial distribution of functional connectivity patterns. In general, studies are computationally costly and very sensitive to noise and preprocessing of data. In this paper we consider clustering by K-means as a exploratory procedure which can provide some results with little computational effort, due to efficient implementations that are readily available. We demonstrate the approach on a dataset of schizophrenia patients, finding differences between patients with and without auditory hallucinations.

  9. A Simple but Powerful Heuristic Method for Accelerating k-Means Clustering of Large-Scale Data in Life Science.

    PubMed

    Ichikawa, Kazuki; Morishita, Shinichi

    2014-01-01

    K-means clustering has been widely used to gain insight into biological systems from large-scale life science data. To quantify the similarities among biological data sets, Pearson correlation distance and standardized Euclidean distance are used most frequently; however, optimization methods have been largely unexplored. These two distance measurements are equivalent in the sense that they yield the same k-means clustering result for identical sets of k initial centroids. Thus, an efficient algorithm used for one is applicable to the other. Several optimization methods are available for the Euclidean distance and can be used for processing the standardized Euclidean distance; however, they are not customized for this context. We instead approached the problem by studying the properties of the Pearson correlation distance, and we invented a simple but powerful heuristic method for markedly pruning unnecessary computation while retaining the final solution. Tests using real biological data sets with 50-60K vectors of dimensions 10-2001 (~400 MB in size) demonstrated marked reduction in computation time for k = 10-500 in comparison with other state-of-the-art pruning methods such as Elkan's and Hamerly's algorithms. The BoostKCP software is available at http://mlab.cb.k.u-tokyo.ac.jp/~ichikawa/boostKCP/.

  10. A Modified MinMax k-Means Algorithm Based on PSO.

    PubMed

    Wang, Xiaoyan; Bai, Yanping

    The MinMax k -means algorithm is widely used to tackle the effect of bad initialization by minimizing the maximum intraclustering errors. Two parameters, including the exponent parameter and memory parameter, are involved in the executive process. Since different parameters have different clustering errors, it is crucial to choose appropriate parameters. In the original algorithm, a practical framework is given. Such framework extends the MinMax k -means to automatically adapt the exponent parameter to the data set. It has been believed that if the maximum exponent parameter has been set, then the programme can reach the lowest intraclustering errors. However, our experiments show that this is not always correct. In this paper, we modified the MinMax k -means algorithm by PSO to determine the proper values of parameters which can subject the algorithm to attain the lowest clustering errors. The proposed clustering method is tested on some favorite data sets in several different initial situations and is compared to the k -means algorithm and the original MinMax k -means algorithm. The experimental results indicate that our proposed algorithm can reach the lowest clustering errors automatically.

  11. A Modified MinMax k-Means Algorithm Based on PSO

    PubMed Central

    2016-01-01

    The MinMax k-means algorithm is widely used to tackle the effect of bad initialization by minimizing the maximum intraclustering errors. Two parameters, including the exponent parameter and memory parameter, are involved in the executive process. Since different parameters have different clustering errors, it is crucial to choose appropriate parameters. In the original algorithm, a practical framework is given. Such framework extends the MinMax k-means to automatically adapt the exponent parameter to the data set. It has been believed that if the maximum exponent parameter has been set, then the programme can reach the lowest intraclustering errors. However, our experiments show that this is not always correct. In this paper, we modified the MinMax k-means algorithm by PSO to determine the proper values of parameters which can subject the algorithm to attain the lowest clustering errors. The proposed clustering method is tested on some favorite data sets in several different initial situations and is compared to the k-means algorithm and the original MinMax k-means algorithm. The experimental results indicate that our proposed algorithm can reach the lowest clustering errors automatically. PMID:27656201

  12. Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data.

    PubMed

    Vera, J Fernando; Macías, Rodrigo

    2017-06-01

    One of the main problems in cluster analysis is that of determining the number of groups in the data. In general, the approach taken depends on the cluster method used. For K-means, some of the most widely employed criteria are formulated in terms of the decomposition of the total point scatter, regarding a two-mode data set of N points in p dimensions, which are optimally arranged into K classes. This paper addresses the formulation of criteria to determine the number of clusters, in the general situation in which the available information for clustering is a one-mode [Formula: see text] dissimilarity matrix describing the objects. In this framework, p and the coordinates of points are usually unknown, and the application of criteria originally formulated for two-mode data sets is dependent on their possible reformulation in the one-mode situation. The decomposition of the variability of the clustered objects is proposed in terms of the corresponding block-shaped partition of the dissimilarity matrix. Within-block and between-block dispersion values for the partitioned dissimilarity matrix are derived, and variance-based criteria are subsequently formulated in order to determine the number of groups in the data. A Monte Carlo experiment was carried out to study the performance of the proposed criteria. For simulated clustered points in p dimensions, greater efficiency in recovering the number of clusters is obtained when the criteria are calculated from the related Euclidean distances instead of the known two-mode data set, in general, for unequal-sized clusters and for low dimensionality situations. For simulated dissimilarity data sets, the proposed criteria always outperform the results obtained when these criteria are calculated from their original formulation, using dissimilarities instead of distances.

  13. Identification of spatiotemporal nutrient patterns in a coastal bay via an integrated k-means clustering and gravity model.

    PubMed

    Chang, Ni-Bin; Wimberly, Brent; Xuan, Zhemin

    2012-03-01

    This study presents an integrated k-means clustering and gravity model (IKCGM) for investigating the spatiotemporal patterns of nutrient and associated dissolved oxygen levels in Tampa Bay, Florida. By using a k-means clustering analysis to first partition the nutrient data into a user-specified number of subsets, it is possible to discover the spatiotemporal patterns of nutrient distribution in the bay and capture the inherent linkages of hydrodynamic and biogeochemical features. Such patterns may then be combined with a gravity model to link the nutrient source contribution from each coastal watershed to the generated clusters in the bay to aid in the source proportion analysis for environmental management. The clustering analysis was carried out based on 1 year (2008) water quality data composed of 55 sample stations throughout Tampa Bay collected by the Environmental Protection Commission of Hillsborough County. In addition, hydrological and river water quality data of the same year were acquired from the United States Geological Survey's National Water Information System to support the gravity modeling analysis. The results show that the k-means model with 8 clusters is the optimal choice, in which cluster 2 at Lower Tampa Bay had the minimum values of total nitrogen (TN) concentrations, chlorophyll a (Chl-a) concentrations, and ocean color values in every season as well as the minimum concentration of total phosphorus (TP) in three consecutive seasons in 2008. The datasets indicate that Lower Tampa Bay is an area with limited nutrient input throughout the year. Cluster 5, located in Middle Tampa Bay, displayed elevated TN concentrations, ocean color values, and Chl-a concentrations, suggesting that high values of colored dissolved organic matter are linked with some nutrient sources. The data presented by the gravity modeling analysis indicate that the Alafia River Basin is the major contributor of nutrients in terms of both TP and TN values in all seasons

  14. Functional connectivity analysis of the neural bases of emotion regulation: A comparison of independent component method with density-based k-means clustering method.

    PubMed

    Zou, Ling; Guo, Qian; Xu, Yi; Yang, Biao; Jiao, Zhuqing; Xiang, Jianbo

    2016-04-29

    Functional magnetic resonance imaging (fMRI) is an important tool in neuroscience for assessing connectivity and interactions between distant areas of the brain. To find and characterize the coherent patterns of brain activity as a means of identifying brain systems for the cognitive reappraisal of the emotion task, both density-based k-means clustering and independent component analysis (ICA) methods can be applied to characterize the interactions between brain regions involved in cognitive reappraisal of emotion. Our results reveal that compared with the ICA method, the density-based k-means clustering method provides a higher sensitivity of polymerization. In addition, it is more sensitive to those relatively weak functional connection regions. Thus, the study concludes that in the process of receiving emotional stimuli, the relatively obvious activation areas are mainly distributed in the frontal lobe, cingulum and near the hypothalamus. Furthermore, density-based k-means clustering method creates a more reliable method for follow-up studies of brain functional connectivity.

  15. Parallel k-means++ for Multiple Shared-Memory Architectures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mackey, Patrick S.; Lewis, Robert R.

    2016-09-22

    In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varyingmore » data sizes.« less

  16. Quantitative Volumetric K-Means Cluster Segmentation of Fibroglandular Tissue and Skin in Breast MRI.

    PubMed

    Niukkanen, Anton; Arponen, Otso; Nykänen, Aki; Masarwah, Amro; Sutela, Anna; Liimatainen, Timo; Vanninen, Ritva; Sudah, Mazen

    2017-10-18

    Mammographic breast density (MBD) is the most commonly used method to assess the volume of fibroglandular tissue (FGT). However, MRI could provide a clinically feasible and more accurate alternative. There were three aims in this study: (1) to evaluate a clinically feasible method to quantify FGT with MRI, (2) to assess the inter-rater agreement of MRI-based volumetric measurements and (3) to compare them to measurements acquired using digital mammography and 3D tomosynthesis. This retrospective study examined 72 women (mean age 52.4 ± 12.3 years) with 105 disease-free breasts undergoing diagnostic 3.0-T breast MRI and either digital mammography or tomosynthesis. Two observers analyzed MRI images for breast and FGT volumes and FGT-% from T1-weighted images (0.7-, 2.0-, and 4.0-mm-thick slices) using K-means clustering, data from histogram, and active contour algorithms. Reference values were obtained with Quantra software. Inter-rater agreement for MRI measurements made with 2-mm-thick slices was excellent: for FGT-%, r = 0.994 (95% CI 0.990-0.997); for breast volume, r = 0.985 (95% CI 0.934-0.994); and for FGT volume, r = 0.979 (95% CI 0.958-0.989). MRI-based FGT-% correlated strongly with MBD in mammography (r = 0.819-0.904, P < 0.001) and moderately to high with MBD in tomosynthesis (r = 0.630-0.738, P < 0.001). K-means clustering-based assessments of the proportion of the fibroglandular tissue in the breast at MRI are highly reproducible. In the future, quantitative assessment of FGT-% to complement visual estimation of FGT should be performed on a more regular basis as it provides a component which can be incorporated into the individual's breast cancer risk stratification.

  17. Quick detection of QRS complexes and R-waves using a wavelet transform and K-means clustering.

    PubMed

    Xia, Yong; Han, Junze; Wang, Kuanquan

    2015-01-01

    Based on the idea of telemedicine, 24-hour uninterrupted monitoring on electrocardiograms (ECG) has started to be implemented. To create an intelligent ECG monitoring system, an efficient and quick detection algorithm for the characteristic waveforms is needed. This paper aims to give a quick and effective method for detecting QRS-complexes and R-waves in ECGs. The real ECG signal from the MIT-BIH Arrhythmia Database is used for the performance evaluation. The method proposed combined a wavelet transform and the K-means clustering algorithm. A wavelet transform is adopted in the data analysis and preprocessing. Then, based on the slope information of the filtered data, a segmented K-means clustering method is adopted to detect the QRS region. Detection of the R-peak is based on comparing the local amplitudes in each QRS region, which is different from other approaches, and the time cost of R-wave detection is reduced. Of the tested 8 records (total 18201 beats) from the MIT-BIH Arrhythmia Database, an average R-peak detection sensitivity of 99.72 and a positive predictive value of 99.80% are gained; the average time consumed detecting a 30-min original signal is 5.78s, which is competitive with other methods.

  18. Segmentation by fusion of histogram-based k-means clusters in different color spaces.

    PubMed

    Mignotte, Max

    2008-05-01

    This paper presents a new, simple, and efficient segmentation approach, based on a fusion procedure which aims at combining several segmentation maps associated to simpler partition models in order to finally get a more reliable and accurate segmentation result. The different label fields to be fused in our application are given by the same and simple (K-means based) clustering technique on an input image expressed in different color spaces. Our fusion strategy aims at combining these segmentation maps with a final clustering procedure using as input features, the local histogram of the class labels, previously estimated and associated to each site and for all these initial partitions. This fusion framework remains simple to implement, fast, general enough to be applied to various computer vision applications (e.g., motion detection and segmentation), and has been successfully applied on the Berkeley image database. The experiments herein reported in this paper illustrate the potential of this approach compared to the state-of-the-art segmentation methods recently proposed in the literature.

  19. Application of constrained k-means clustering in ground motion simulation validation

    NASA Astrophysics Data System (ADS)

    Khoshnevis, N.; Taborda, R.

    2017-12-01

    The validation of ground motion synthetics has received increased attention over the last few years due to the advances in physics-based deterministic and hybrid simulation methods. Unlike for low frequency simulations (f ≤ 0.5 Hz), for which it has become reasonable to expect a good match between synthetics and data, in the case of high-frequency simulations (f ≥ 1 Hz) it is not possible to match results on a wiggle-by-wiggle basis. This is mostly due to the various complexities and uncertainties involved in earthquake ground motion modeling. Therefore, in order to compare synthetics with data we turn to different time series metrics, which are used as a means to characterize how the synthetics match the data on qualitative and statistical sense. In general, these metrics provide GOF scores that measure the level of similarity in the time and frequency domains. It is common for these scores to be scaled from 0 to 10, with 10 representing a perfect match. Although using individual metrics for particular applications is considered more adequate, there is no consensus or a unified method to classify the comparison between a set of synthetic and recorded seismograms when the various metrics offer different scores. We study the relationship among these metrics through a constrained k-means clustering approach. We define 4 hypothetical stations with scores 3, 5, 7, and 9 for all metrics. We put these stations in the category of cannot-link constraints. We generate the dataset through the validation of the results from a deterministic (physics-based) ground motion simulation for a moderate magnitude earthquake in the greater Los Angeles basin using three velocity models. The maximum frequency of the simulation is 4 Hz. The dataset involves over 300 stations and 11 metrics, or features, as they are understood in the clustering process, where the metrics form a multi-dimensional space. We address the high-dimensional feature effects with a subspace-clustering analysis

  20. Load Weight Classification of The Quayside Container Crane Based On K-Means Clustering Algorithm

    NASA Astrophysics Data System (ADS)

    Zhang, Bingqian; Hu, Xiong; Tang, Gang; Wang, Yide

    2017-07-01

    The precise knowledge of the load weight of each operation of the quayside container crane is important for accurately assessing the service life of the crane. The load weight is directly related to the vibration intensity. Through the study on the vibration of the hoist motor of the crane in radial and axial directions, we can classify the load using K-means clustering algorithm and quantitative statistical analysis. Vibration in radial direction is significantly and positively correlated with that in axial direction by correlation analysis, which means that we can use the data only in one of the directions to carry out the study improving then the efficiency without degrading the accuracy of load classification. The proposed method can well represent the real-time working condition of the crane.

  1. Countries population determination to test rice crisis indicator at national level using k-means cluster analysis

    NASA Astrophysics Data System (ADS)

    Hidayat, Y.; Purwandari, T.; Sukono; Ariska, Y. D.

    2017-01-01

    This study aimed to obtain information on the population of the countries which is have similarities with Indonesia based on three characteristics, that is the democratic atmosphere, rice consumption and purchasing power of rice. It is useful as a reference material for research which tested the strength and predictability of the rice crisis indicators Unprecedented Restlessness (UR). The similarities countries with Indonesia were conducted using multivariate analysis that is non-hierarchical cluster analysis k-Means with 38 countries as the data population. This analysis is done repeatedly until the obtainment number of clusters which is capable to show the differentiator power of the three characteristics and describe the high similarity within clusters. Based on the results, it turns out with 6 clusters can describe the differentiator power of characteristics of formed clusters. However, to answer the purpose of the study, only one cluster which will be taken accordance with the criteria of success for the population of countries that have similarities with Indonesia that cluster contain Indonesia therein, there are countries which is sustain crisis and non-crisis of rice in 2008, and cluster which is have the largest member among them. This criterion is met by cluster 2, which consists of 22 countries, namely Indonesia, Brazil, Costa Rica, Djibouti, Dominican Republic, Ecuador, Fiji, Guinea-Bissau, Haiti, India, Jamaica, Japan, Korea South, Madagascar, Malaysia, Mali, Nicaragua, Panama, Peru, Senegal, Sierra Leone and Suriname.

  2. The implementation of two stages clustering (k-means clustering and adaptive neuro fuzzy inference system) for prediction of medicine need based on medical data

    NASA Astrophysics Data System (ADS)

    Husein, A. M.; Harahap, M.; Aisyah, S.; Purba, W.; Muhazir, A.

    2018-03-01

    Medication planning aim to get types, amount of medicine according to needs, and avoid the emptiness medicine based on patterns of disease. In making the medicine planning is still rely on ability and leadership experience, this is due to take a long time, skill, difficult to obtain a definite disease data, need a good record keeping and reporting, and the dependence of the budget resulted in planning is not going well, and lead to frequent lack and excess of medicines. In this research, we propose Adaptive Neuro Fuzzy Inference System (ANFIS) method to predict medication needs in 2016 and 2017 based on medical data in 2015 and 2016 from two source of hospital. The framework of analysis using two approaches. The first phase is implementing ANFIS to a data source, while the second approach we keep using ANFIS, but after the process of clustering from K-Means algorithm, both approaches are calculated values of Root Mean Square Error (RMSE) for training and testing. From the testing result, the proposed method with better prediction rates based on the evaluation analysis of quantitative and qualitative compared with existing systems, however the implementation of K-Means Algorithm against ANFIS have an effect on the timing of the training process and provide a classification accuracy significantly better without clustering.

  3. Application of k-means clustering algorithm in grouping the DNA sequences of hepatitis B virus (HBV)

    NASA Astrophysics Data System (ADS)

    Bustamam, A.; Tasman, H.; Yuniarti, N.; Frisca, Mursidah, I.

    2017-07-01

    Based on WHO data, an estimated of 15 millions people worldwide who are infected with hepatitis B (HBsAg+), which is caused by HBV virus, are also infected by hepatitis D, which is caused by HDV virus. Hepatitis D infection can occur simultaneously with hepatitis B (co infection) or after a person is exposed to chronic hepatitis B (super infection). Since HDV cannot live without HBV, HDV infection is closely related to HBV infection, hence it is very realistic that every effort of prevention against hepatitis B can indirectly prevent hepatitis D. This paper presents clustering of HBV DNA sequences by using k-means clustering algorithm and R programming. Clustering processes are started with collecting HBV DNA sequences from GenBank, then performing extraction HBV DNA sequences using n-mers frequency and furthermore the extraction results are collected as a matrix and normalized using the min-max normalization with interval [0, 1] which will later be used as an input data. The number of clusters is two and the initial centroid selected of the cluster is chosen randomly. In each iteration, the distance of every object to each centroid are calculated using the Euclidean distance and the minimum distance is selected to determine the membership in a cluster until two convergent clusters are created. As the result, the HBV viruses in the first cluster is more virulent than the HBV viruses in the second cluster, so the HBV viruses in the first cluster can potentially evolve with HDV viruses that cause hepatitis D.

  4. Prediction of settled water turbidity and optimal coagulant dosage in drinking water treatment plant using a hybrid model of k-means clustering and adaptive neuro-fuzzy inference system

    NASA Astrophysics Data System (ADS)

    Kim, Chan Moon; Parnichkun, Manukid

    2017-11-01

    Coagulation is an important process in drinking water treatment to attain acceptable treated water quality. However, the determination of coagulant dosage is still a challenging task for operators, because coagulation is nonlinear and complicated process. Feedback control to achieve the desired treated water quality is difficult due to lengthy process time. In this research, a hybrid of k-means clustering and adaptive neuro-fuzzy inference system ( k-means-ANFIS) is proposed for the settled water turbidity prediction and the optimal coagulant dosage determination using full-scale historical data. To build a well-adaptive model to different process states from influent water, raw water quality data are classified into four clusters according to its properties by a k-means clustering technique. The sub-models are developed individually on the basis of each clustered data set. Results reveal that the sub-models constructed by a hybrid k-means-ANFIS perform better than not only a single ANFIS model, but also seasonal models by artificial neural network (ANN). The finally completed model consisting of sub-models shows more accurate and consistent prediction ability than a single model of ANFIS and a single model of ANN based on all five evaluation indices. Therefore, the hybrid model of k-means-ANFIS can be employed as a robust tool for managing both treated water quality and production costs simultaneously.

  5. Linear regression models and k-means clustering for statistical analysis of fNIRS data.

    PubMed

    Bonomini, Viola; Zucchelli, Lucia; Re, Rebecca; Ieva, Francesca; Spinelli, Lorenzo; Contini, Davide; Paganoni, Anna; Torricelli, Alessandro

    2015-02-01

    We propose a new algorithm, based on a linear regression model, to statistically estimate the hemodynamic activations in fNIRS data sets. The main concern guiding the algorithm development was the minimization of assumptions and approximations made on the data set for the application of statistical tests. Further, we propose a K-means method to cluster fNIRS data (i.e. channels) as activated or not activated. The methods were validated both on simulated and in vivo fNIRS data. A time domain (TD) fNIRS technique was preferred because of its high performances in discriminating cortical activation and superficial physiological changes. However, the proposed method is also applicable to continuous wave or frequency domain fNIRS data sets.

  6. Clumpak: a program for identifying clustering modes and packaging population structure inferences across K.

    PubMed

    Kopelman, Naama M; Mayzel, Jonathan; Jakobsson, Mattias; Rosenberg, Noah A; Mayrose, Itay

    2015-09-01

    The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population-genetic data analysis. Application of model-based clustering programs often entails a number of steps, in which the user considers different modelling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present Clumpak (Cluster Markov Packager Across K), a method that automates the postprocessing of results of model-based population structure analyses. For analysing multiple independent runs at a single K value, Clumpak identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software Clumpp. Next, Clumpak identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp and simplifying the comparison of clustering results across different K values. Clumpak incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. Clumpak, available at http://clumpak.tau.ac.il, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology. © 2015 John Wiley & Sons Ltd.

  7. Magnetic resonance imaging with k-means clustering objectively measures whole muscle volume compartments in sarcopenia/cancer cachexia.

    PubMed

    Gray, Calum; MacGillivray, Thomas J; Eeley, Clare; Stephens, Nathan A; Beggs, Ian; Fearon, Kenneth C; Greig, Carolyn A

    2011-02-01

    Sarcopenia and cachexia are characterized by infiltration of non-contractile tissue within muscle which influences area and volume measurements. We applied a statistical clustering (k-means) technique to magnetic resonance (MR) images of the quadriceps of young and elderly healthy women and women with cancer to objectively separate the contractile and non-contractile tissue compartments. MR scans of the thigh were obtained for 34 women (n = 16 young, (median) age 26 y; n = 9 older, age 80 y; n = 9 upper gastrointestinal cancer patients, age 65 y). Segmented regions of consecutive axial images were used to calculate cross-sectional area and (gross) volume. The k-means unsupervised algorithm was subsequently applied to the MR binary mask image array data with resultant volumes compared between groups. Older women and women with cancer had 37% and 48% less quadriceps muscle respectively than young women (p < 0.001). Application of k-means subtracted a significant 9%, 14% and 20% non-contractile tissue from the quadriceps of young, older and patient groups respectively (p < 0.001). There was a significant effect of group (i.e., cancer vs healthy) when controlling for age as a covariate (p = 0.003). K-means objectively separates contractile and non-contractile tissue components. Women with upper GI cancer have significant fatty infiltration throughout whole muscle groups which is maintained when controlling for age. Copyright © 2010 Elsevier Ltd and European Society for Clinical Nutrition and Metabolism. All rights reserved.

  8. A New Approach to Identify High Burnout Medical Staffs by Kernel K-Means Cluster Analysis in a Regional Teaching Hospital in Taiwan

    PubMed Central

    Lee, Yii-Ching; Huang, Shian-Chang; Huang, Chih-Hsuan; Wu, Hsin-Hung

    2016-01-01

    This study uses kernel k-means cluster analysis to identify medical staffs with high burnout. The data collected in October to November 2014 are from the emotional exhaustion dimension of the Chinese version of Safety Attitudes Questionnaire in a regional teaching hospital in Taiwan. The number of effective questionnaires including the entire staffs such as physicians, nurses, technicians, pharmacists, medical administrators, and respiratory therapists is 680. The results show that 8 clusters are generated by kernel k-means method. Employees in clusters 1, 4, and 5 are relatively in good conditions, whereas employees in clusters 2, 3, 6, 7, and 8 need to be closely monitored from time to time because they have relatively higher degree of burnout. When employees with higher degree of burnout are identified, the hospital management can take actions to improve the resilience, reduce the potential medical errors, and, eventually, enhance the patient safety. This study also suggests that the hospital management needs to keep track of medical staffs’ fatigue conditions and provide timely assistance for burnout recovery through employee assistance programs, mindfulness-based stress reduction programs, positivity currency buildup, and forming appreciative inquiry groups. PMID:27895218

  9. A New Approach to Identify High Burnout Medical Staffs by Kernel K-Means Cluster Analysis in a Regional Teaching Hospital in Taiwan.

    PubMed

    Lee, Yii-Ching; Huang, Shian-Chang; Huang, Chih-Hsuan; Wu, Hsin-Hung

    2016-01-01

    This study uses kernel k-means cluster analysis to identify medical staffs with high burnout. The data collected in October to November 2014 are from the emotional exhaustion dimension of the Chinese version of Safety Attitudes Questionnaire in a regional teaching hospital in Taiwan. The number of effective questionnaires including the entire staffs such as physicians, nurses, technicians, pharmacists, medical administrators, and respiratory therapists is 680. The results show that 8 clusters are generated by kernel k-means method. Employees in clusters 1, 4, and 5 are relatively in good conditions, whereas employees in clusters 2, 3, 6, 7, and 8 need to be closely monitored from time to time because they have relatively higher degree of burnout. When employees with higher degree of burnout are identified, the hospital management can take actions to improve the resilience, reduce the potential medical errors, and, eventually, enhance the patient safety. This study also suggests that the hospital management needs to keep track of medical staffs' fatigue conditions and provide timely assistance for burnout recovery through employee assistance programs, mindfulness-based stress reduction programs, positivity currency buildup, and forming appreciative inquiry groups. © The Author(s) 2016.

  10. A Genetic Algorithm That Exchanges Neighboring Centers for Fuzzy c-Means Clustering

    ERIC Educational Resources Information Center

    Chahine, Firas Safwan

    2012-01-01

    Clustering algorithms are widely used in pattern recognition and data mining applications. Due to their computational efficiency, partitional clustering algorithms are better suited for applications with large datasets than hierarchical clustering algorithms. K-means is among the most popular partitional clustering algorithm, but has a major…

  11. Linear regression models and k-means clustering for statistical analysis of fNIRS data

    PubMed Central

    Bonomini, Viola; Zucchelli, Lucia; Re, Rebecca; Ieva, Francesca; Spinelli, Lorenzo; Contini, Davide; Paganoni, Anna; Torricelli, Alessandro

    2015-01-01

    We propose a new algorithm, based on a linear regression model, to statistically estimate the hemodynamic activations in fNIRS data sets. The main concern guiding the algorithm development was the minimization of assumptions and approximations made on the data set for the application of statistical tests. Further, we propose a K-means method to cluster fNIRS data (i.e. channels) as activated or not activated. The methods were validated both on simulated and in vivo fNIRS data. A time domain (TD) fNIRS technique was preferred because of its high performances in discriminating cortical activation and superficial physiological changes. However, the proposed method is also applicable to continuous wave or frequency domain fNIRS data sets. PMID:25780751

  12. K-Means Algorithm Performance Analysis With Determining The Value Of Starting Centroid With Random And KD-Tree Method

    NASA Astrophysics Data System (ADS)

    Sirait, Kamson; Tulus; Budhiarti Nababan, Erna

    2017-12-01

    Clustering methods that have high accuracy and time efficiency are necessary for the filtering process. One method that has been known and applied in clustering is K-Means Clustering. In its application, the determination of the begining value of the cluster center greatly affects the results of the K-Means algorithm. This research discusses the results of K-Means Clustering with starting centroid determination with a random and KD-Tree method. The initial determination of random centroid on the data set of 1000 student academic data to classify the potentially dropout has a sse value of 952972 for the quality variable and 232.48 for the GPA, whereas the initial centroid determination by KD-Tree has a sse value of 504302 for the quality variable and 214,37 for the GPA variable. The smaller sse values indicate that the result of K-Means Clustering with initial KD-Tree centroid selection have better accuracy than K-Means Clustering method with random initial centorid selection.

  13. Classifying epileptic EEG signals with delay permutation entropy and Multi-Scale K-means.

    PubMed

    Zhu, Guohun; Li, Yan; Wen, Peng Paul; Wang, Shuaifang

    2015-01-01

    Most epileptic EEG classification algorithms are supervised and require large training datasets, that hinder their use in real time applications. This chapter proposes an unsupervised Multi-Scale K-means (MSK-means) MSK-means algorithm to distinguish epileptic EEG signals and identify epileptic zones. The random initialization of the K-means algorithm can lead to wrong clusters. Based on the characteristics of EEGs, the MSK-means MSK-means algorithm initializes the coarse-scale centroid of a cluster with a suitable scale factor. In this chapter, the MSK-means algorithm is proved theoretically superior to the K-means algorithm on efficiency. In addition, three classifiers: the K-means, MSK-means MSK-means and support vector machine (SVM), are used to identify seizure and localize epileptogenic zone using delay permutation entropy features. The experimental results demonstrate that identifying seizure with the MSK-means algorithm and delay permutation entropy achieves 4. 7 % higher accuracy than that of K-means, and 0. 7 % higher accuracy than that of the SVM.

  14. K-means cluster analysis of rehabilitation service users in the Home Health Care System of Ontario: examining the heterogeneity of a complex geriatric population.

    PubMed

    Armstrong, Joshua J; Zhu, Mu; Hirdes, John P; Stolee, Paul

    2012-12-01

    To examine the heterogeneity of home care clients who use rehabilitation services by using the K-means algorithm to identify previously unknown patterns of clinical characteristics. Observational study of secondary data. Home care system. Assessment information was collected on 150,253 home care clients using the provincially mandated Resident Assessment Instrument-Home Care (RAI-HC) data system. Not applicable. Assessment information from every long-stay (>60 d) home care client that entered the home care system between 2005 and 2008 and used rehabilitation services within 3 months of their initial assessment was analyzed. The K-means clustering algorithm was applied using 37 variables from the RAI-HC assessment. The K-means cluster analysis resulted in the identification of 7 relatively homogeneous subgroups that differed on characteristics such as age, sex, cognition, and functional impairment. Client profiles were created to illustrate the diversity of this geriatric population. The K-means algorithm provided a useful way to segment a heterogeneous rehabilitation client population into more homogeneous subgroups. This analysis provides an enhanced understanding of client characteristics and needs, and could enable more appropriate targeting of rehabilitation services for home care clients. Copyright © 2012 American Congress of Rehabilitation Medicine. Published by Elsevier Inc. All rights reserved.

  15. Integrating K-means Clustering with Kernel Density Estimation for the Development of a Conditional Weather Generation Downscaling Model

    NASA Astrophysics Data System (ADS)

    Chen, Y.; Ho, C.; Chang, L.

    2011-12-01

    In previous decades, the climate change caused by global warming increases the occurrence frequency of extreme hydrological events. Water supply shortages caused by extreme events create great challenges for water resource management. To evaluate future climate variations, general circulation models (GCMs) are the most wildly known tools which shows possible weather conditions under pre-defined CO2 emission scenarios announced by IPCC. Because the study area of GCMs is the entire earth, the grid sizes of GCMs are much larger than the basin scale. To overcome the gap, a statistic downscaling technique can transform the regional scale weather factors into basin scale precipitations. The statistic downscaling technique can be divided into three categories include transfer function, weather generator and weather type. The first two categories describe the relationships between the weather factors and precipitations respectively based on deterministic algorithms, such as linear or nonlinear regression and ANN, and stochastic approaches, such as Markov chain theory and statistical distributions. In the weather type, the method has ability to cluster weather factors, which are high dimensional and continuous variables, into weather types, which are limited number of discrete states. In this study, the proposed downscaling model integrates the weather type, using the K-means clustering algorithm, and the weather generator, using the kernel density estimation. The study area is Shihmen basin in northern of Taiwan. In this study, the research process contains two steps, a calibration step and a synthesis step. Three sub-steps were used in the calibration step. First, weather factors, such as pressures, humidities and wind speeds, obtained from NCEP and the precipitations observed from rainfall stations were collected for downscaling. Second, the K-means clustering grouped the weather factors into four weather types. Third, the Markov chain transition matrixes and the

  16. Complex time series analysis of PM10 and PM2.5 for a coastal site using artificial neural network modelling and k-means clustering

    NASA Astrophysics Data System (ADS)

    Elangasinghe, M. A.; Singhal, N.; Dirks, K. N.; Salmond, J. A.; Samarasinghe, S.

    2014-09-01

    This paper uses artificial neural networks (ANN), combined with k-means clustering, to understand the complex time series of PM10 and PM2.5 concentrations at a coastal location of New Zealand based on data from a single site. Out of available meteorological parameters from the network (wind speed, wind direction, solar radiation, temperature, relative humidity), key factors governing the pattern of the time series concentrations were identified through input sensitivity analysis performed on the trained neural network model. The transport pathways of particulate matter under these key meteorological parameters were further analysed through bivariate concentration polar plots and k-means clustering techniques. The analysis shows that the external sources such as marine aerosols and local sources such as traffic and biomass burning contribute equally to the particulate matter concentrations at the study site. These results are in agreement with the results of receptor modelling by the Auckland Council based on Positive Matrix Factorization (PMF). Our findings also show that contrasting concentration-wind speed relationships exist between marine aerosols and local traffic sources resulting in very noisy and seemingly large random PM10 concentrations. The inclusion of cluster rankings as an input parameter to the ANN model showed a statistically significant (p < 0.005) improvement in the performance of the ANN time series model and also showed better performance in picking up high concentrations. For the presented case study, the correlation coefficient between observed and predicted concentrations improved from 0.77 to 0.79 for PM2.5 and from 0.63 to 0.69 for PM10 and reduced the root mean squared error (RMSE) from 5.00 to 4.74 for PM2.5 and from 6.77 to 6.34 for PM10. The techniques presented here enable the user to obtain an understanding of potential sources and their transport characteristics prior to the implementation of costly chemical analysis techniques or

  17. Comparison of K-means and fuzzy c-means algorithm performance for automated determination of the arterial input function.

    PubMed

    Yin, Jiandong; Sun, Hongzan; Yang, Jiawen; Guo, Qiyong

    2014-01-01

    The arterial input function (AIF) plays a crucial role in the quantification of cerebral perfusion parameters. The traditional method for AIF detection is based on manual operation, which is time-consuming and subjective. Two automatic methods have been reported that are based on two frequently used clustering algorithms: fuzzy c-means (FCM) and K-means. However, it is still not clear which is better for AIF detection. Hence, we compared the performance of these two clustering methods using both simulated and clinical data. The results demonstrate that K-means analysis can yield more accurate and robust AIF results, although it takes longer to execute than the FCM method. We consider that this longer execution time is trivial relative to the total time required for image manipulation in a PACS setting, and is acceptable if an ideal AIF is obtained. Therefore, the K-means method is preferable to FCM in AIF detection.

  18. A Fast SVM-Based Tongue's Colour Classification Aided by k-Means Clustering Identifiers and Colour Attributes as Computer-Assisted Tool for Tongue Diagnosis.

    PubMed

    Kamarudin, Nur Diyana; Ooi, Chia Yee; Kawanabe, Tadaaki; Odaguchi, Hiroshi; Kobayashi, Fuminori

    2017-01-01

    In tongue diagnosis, colour information of tongue body has kept valuable information regarding the state of disease and its correlation with the internal organs. Qualitatively, practitioners may have difficulty in their judgement due to the instable lighting condition and naked eye's ability to capture the exact colour distribution on the tongue especially the tongue with multicolour substance. To overcome this ambiguity, this paper presents a two-stage tongue's multicolour classification based on a support vector machine (SVM) whose support vectors are reduced by our proposed k -means clustering identifiers and red colour range for precise tongue colour diagnosis. In the first stage, k -means clustering is used to cluster a tongue image into four clusters of image background (black), deep red region, red/light red region, and transitional region. In the second-stage classification, red/light red tongue images are further classified into red tongue or light red tongue based on the red colour range derived in our work. Overall, true rate classification accuracy of the proposed two-stage classification to diagnose red, light red, and deep red tongue colours is 94%. The number of support vectors in SVM is improved by 41.2%, and the execution time for one image is recorded as 48 seconds.

  19. A Fast SVM-Based Tongue's Colour Classification Aided by k-Means Clustering Identifiers and Colour Attributes as Computer-Assisted Tool for Tongue Diagnosis

    PubMed Central

    Ooi, Chia Yee; Kawanabe, Tadaaki; Odaguchi, Hiroshi; Kobayashi, Fuminori

    2017-01-01

    In tongue diagnosis, colour information of tongue body has kept valuable information regarding the state of disease and its correlation with the internal organs. Qualitatively, practitioners may have difficulty in their judgement due to the instable lighting condition and naked eye's ability to capture the exact colour distribution on the tongue especially the tongue with multicolour substance. To overcome this ambiguity, this paper presents a two-stage tongue's multicolour classification based on a support vector machine (SVM) whose support vectors are reduced by our proposed k-means clustering identifiers and red colour range for precise tongue colour diagnosis. In the first stage, k-means clustering is used to cluster a tongue image into four clusters of image background (black), deep red region, red/light red region, and transitional region. In the second-stage classification, red/light red tongue images are further classified into red tongue or light red tongue based on the red colour range derived in our work. Overall, true rate classification accuracy of the proposed two-stage classification to diagnose red, light red, and deep red tongue colours is 94%. The number of support vectors in SVM is improved by 41.2%, and the execution time for one image is recorded as 48 seconds. PMID:29065640

  20. 2-Way k-Means as a Model for Microbiome Samples.

    PubMed

    Jackson, Weston J; Agarwal, Ipsita; Pe'er, Itsik

    2017-01-01

    Motivation . Microbiome sequencing allows defining clusters of samples with shared composition. However, this paradigm poorly accounts for samples whose composition is a mixture of cluster-characterizing ones and which therefore lie in between them in the cluster space. This paper addresses unsupervised learning of 2-way clusters. It defines a mixture model that allows 2-way cluster assignment and describes a variant of generalized k -means for learning such a model. We demonstrate applicability to microbial 16S rDNA sequencing data from the Human Vaginal Microbiome Project.

  1. 2-Way k-Means as a Model for Microbiome Samples

    PubMed Central

    2017-01-01

    Motivation. Microbiome sequencing allows defining clusters of samples with shared composition. However, this paradigm poorly accounts for samples whose composition is a mixture of cluster-characterizing ones and which therefore lie in between them in the cluster space. This paper addresses unsupervised learning of 2-way clusters. It defines a mixture model that allows 2-way cluster assignment and describes a variant of generalized k-means for learning such a model. We demonstrate applicability to microbial 16S rDNA sequencing data from the Human Vaginal Microbiome Project. PMID:29177026

  2. Star clusters and K2

    NASA Astrophysics Data System (ADS)

    Dotson, Jessie; Barentsen, Geert; Cody, Ann Marie

    2018-01-01

    The K2 survey has expanded the Kepler legacy by using the repurposed spacecraft to observe over 20 star clusters. The sample includes open and globular clusters at all ages, including very young (1-10 Myr, e.g. Taurus, Upper Sco, NGC 6530), moderately young (0.1-1 Gyr, e.g. M35, M44, Pleiades, Hyades), middle-aged (e.g. M67, Ruprecht 147, NGC 2158), and old globular clusters (e.g. M9, M19, Terzan 5). K2 observations of stellar clusters are exploring the rotation period-mass relationship to significantly lower masses than was previously possible, shedding light on the angular momentum budget and its dependence on mass and circumstellar disk properties, and illuminating the role of multiplicity in stellar angular momentum. Exoplanets discovered by K2 in stellar clusters provides planetary systems ripe for modeling given the extensive information available about their ages and environment. I will review the star clusters sampled by K2 across 16 fields so far, highlighting several characteristics, caveats, and unexplored uses of the public data set along the way. With fuel expected to run out in 2018, I will discuss the closing Campaigns, highlight the final target selection opportunities, and explain the data archive and TESS-compatible software tools the K2 mission intends to leave behind for posterity.

  3. Comparison of K-Means and Fuzzy c-Means Algorithm Performance for Automated Determination of the Arterial Input Function

    PubMed Central

    Yin, Jiandong; Sun, Hongzan; Yang, Jiawen; Guo, Qiyong

    2014-01-01

    The arterial input function (AIF) plays a crucial role in the quantification of cerebral perfusion parameters. The traditional method for AIF detection is based on manual operation, which is time-consuming and subjective. Two automatic methods have been reported that are based on two frequently used clustering algorithms: fuzzy c-means (FCM) and K-means. However, it is still not clear which is better for AIF detection. Hence, we compared the performance of these two clustering methods using both simulated and clinical data. The results demonstrate that K-means analysis can yield more accurate and robust AIF results, although it takes longer to execute than the FCM method. We consider that this longer execution time is trivial relative to the total time required for image manipulation in a PACS setting, and is acceptable if an ideal AIF is obtained. Therefore, the K-means method is preferable to FCM in AIF detection. PMID:24503700

  4. Noise reduction and functional maps image quality improvement in dynamic CT perfusion using a new k-means clustering guided bilateral filter (KMGB).

    PubMed

    Pisana, Francesco; Henzler, Thomas; Schönberg, Stefan; Klotz, Ernst; Schmidt, Bernhard; Kachelrieß, Marc

    2017-07-01

    Dynamic CT perfusion (CTP) consists in repeated acquisitions of the same volume in different time steps, slightly before, during and slightly afterwards the injection of contrast media. Important functional information can be derived for each voxel, which reflect the local hemodynamic properties and hence the metabolism of the tissue. Different approaches are being investigated to exploit data redundancy and prior knowledge for noise reduction of such datasets, ranging from iterative reconstruction schemes to high dimensional filters. We propose a new spatial bilateral filter which makes use of the k-means clustering algorithm and of an optimal calculated guiding image. We named the proposed filter as k-means clustering guided bilateral filter (KMGB). In this study, the KMGB filter is compared with the partial temporal non-local means filter (PATEN), with the time-intensity profile similarity (TIPS) filter, and with a new version derived from it, by introducing the guiding image (GB-TIPS). All the filters were tested on a digital in-house developed brain CTP phantom, were noise was added to simulate 80 kV and 200 mAs (default scanning parameters), 100 mAs and 30 mAs. Moreover, the filters performances were tested on 7 noisy clinical datasets with different pathologies in different body regions. The original contribution of our work is two-fold: first we propose an efficient algorithm to calculate a guiding image to improve the results of the TIPS filter, secondly we propose the introduction of the k-means clustering step and demonstrate how this can potentially replace the TIPS part of the filter obtaining better results at lower computational efforts. As expected, in the GB-TIPS, the introduction of the guiding image limits the over-smoothing of the TIPS filter, improving spatial resolution by more than 50%. Furthermore, replacing the time-intensity profile similarity calculation with a fuzzy k-means clustering strategy (KMGB) allows to control the edge preserving

  5. A diabetic retinopathy detection method using an improved pillar K-means algorithm.

    PubMed

    Gogula, Susmitha Valli; Divakar, Ch; Satyanarayana, Ch; Rao, Allam Appa

    2014-01-01

    The paper presents a new approach for medical image segmentation. Exudates are a visible sign of diabetic retinopathy that is the major reason of vision loss in patients with diabetes. If the exudates extend into the macular area, blindness may occur. Automated detection of exudates will assist ophthalmologists in early diagnosis. This segmentation process includes a new mechanism for clustering the elements of high-resolution images in order to improve precision and reduce computation time. The system applies K-means clustering to the image segmentation after getting optimized by Pillar algorithm; pillars are constructed in such a way that they can withstand the pressure. Improved pillar algorithm can optimize the K-means clustering for image segmentation in aspects of precision and computation time. This evaluates the proposed approach for image segmentation by comparing with Kmeans and Fuzzy C-means in a medical image. Using this method, identification of dark spot in the retina becomes easier and the proposed algorithm is applied on diabetic retinal images of all stages to identify hard and soft exudates, where the existing pillar K-means is more appropriate for brain MRI images. This proposed system help the doctors to identify the problem in the early stage and can suggest a better drug for preventing further retinal damage.

  6. Valley and channel networks extraction based on local topographic curvature and k-means clustering of contours

    NASA Astrophysics Data System (ADS)

    Hooshyar, Milad; Wang, Dingbao; Kim, Seoyoung; Medeiros, Stephen C.; Hagen, Scott C.

    2016-10-01

    A method for automatic extraction of valley and channel networks from high-resolution digital elevation models (DEMs) is presented. This method utilizes both positive (i.e., convergent topography) and negative (i.e., divergent topography) curvature to delineate the valley network. The valley and ridge skeletons are extracted using the pixels' curvature and the local terrain conditions. The valley network is generated by checking the terrain for the existence of at least one ridge between two intersecting valleys. The transition from unchannelized to channelized sections (i.e., channel head) in each first-order valley tributary is identified independently by categorizing the corresponding contours using an unsupervised approach based on k-means clustering. The method does not require a spatially constant channel initiation threshold (e.g., curvature or contributing area). Moreover, instead of a point attribute (e.g., curvature), the proposed clustering method utilizes the shape of contours, which reflects the entire cross-sectional profile including possible banks. The method was applied to three catchments: Indian Creek and Mid Bailey Run in Ohio and Feather River in California. The accuracy of channel head extraction from the proposed method is comparable to state-of-the-art channel extraction methods.

  7. Integrating an artificial intelligence approach with k-means clustering to model groundwater salinity: the case of Gaza coastal aquifer (Palestine)

    NASA Astrophysics Data System (ADS)

    Alagha, Jawad S.; Seyam, Mohammed; Md Said, Md Azlin; Mogheir, Yunes

    2017-12-01

    Artificial intelligence (AI) techniques have increasingly become efficient alternative modeling tools in the water resources field, particularly when the modeled process is influenced by complex and interrelated variables. In this study, two AI techniques—artificial neural networks (ANNs) and support vector machine (SVM)—were employed to achieve deeper understanding of the salinization process (represented by chloride concentration) in complex coastal aquifers influenced by various salinity sources. Both models were trained using 11 years of groundwater quality data from 22 municipal wells in Khan Younis Governorate, Gaza, Palestine. Both techniques showed satisfactory prediction performance, where the mean absolute percentage error (MAPE) and correlation coefficient ( R) for the test data set were, respectively, about 4.5 and 99.8% for the ANNs model, and 4.6 and 99.7% for SVM model. The performances of the developed models were further noticeably improved through preprocessing the wells data set using a k-means clustering method, then conducting AI techniques separately for each cluster. The developed models with clustered data were associated with higher performance, easiness and simplicity. They can be employed as an analytical tool to investigate the influence of input variables on coastal aquifer salinity, which is of great importance for understanding salinization processes, leading to more effective water-resources-related planning and decision making.

  8. Genetic k-Means Clustering Approach for Mapping Human Vulnerability to Chemical Hazards in the Industrialized City: A Case Study of Shanghai, China

    PubMed Central

    Shi, Weifang; Zeng, Weihua

    2013-01-01

    Reducing human vulnerability to chemical hazards in the industrialized city is a matter of great urgency. Vulnerability mapping is an alternative approach for providing vulnerability-reducing interventions in a region. This study presents a method for mapping human vulnerability to chemical hazards by using clustering analysis for effective vulnerability reduction. Taking the city of Shanghai as the study area, we measure human exposure to chemical hazards by using the proximity model with additionally considering the toxicity of hazardous substances, and capture the sensitivity and coping capacity with corresponding indicators. We perform an improved k-means clustering approach on the basis of genetic algorithm by using a 500 m × 500 m geographical grid as basic spatial unit. The sum of squared errors and silhouette coefficient are combined to measure the quality of clustering and to determine the optimal clustering number. Clustering result reveals a set of six typical human vulnerability patterns that show distinct vulnerability dimension combinations. The vulnerability mapping of the study area reflects cluster-specific vulnerability characteristics and their spatial distribution. Finally, we suggest specific points that can provide new insights in rationally allocating the limited funds for the vulnerability reduction of each cluster. PMID:23787337

  9. An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China.

    PubMed

    Zou, Hui; Zou, Zhihong; Wang, Xiaojing

    2015-11-12

    The increase and the complexity of data caused by the uncertain environment is today's reality. In order to identify water quality effectively and reliably, this paper presents a modified fast clustering algorithm for water quality analysis. The algorithm has adopted a varying weights K-means cluster algorithm to analyze water monitoring data. The varying weights scheme was the best weighting indicator selected by a modified indicator weight self-adjustment algorithm based on K-means, which is named MIWAS-K-means. The new clustering algorithm avoids the margin of the iteration not being calculated in some cases. With the fast clustering analysis, we can identify the quality of water samples. The algorithm is applied in water quality analysis of the Haihe River (China) data obtained by the monitoring network over a period of eight years (2006-2013) with four indicators at seven different sites (2078 samples). Both the theoretical and simulated results demonstrate that the algorithm is efficient and reliable for water quality analysis of the Haihe River. In addition, the algorithm can be applied to more complex data matrices with high dimensionality.

  10. Multivariate Spatial Condition Mapping Using Subtractive Fuzzy Cluster Means

    PubMed Central

    Sabit, Hakilo; Al-Anbuky, Adnan

    2014-01-01

    Wireless sensor networks are usually deployed for monitoring given physical phenomena taking place in a specific space and over a specific duration of time. The spatio-temporal distribution of these phenomena often correlates to certain physical events. To appropriately characterise these events-phenomena relationships over a given space for a given time frame, we require continuous monitoring of the conditions. WSNs are perfectly suited for these tasks, due to their inherent robustness. This paper presents a subtractive fuzzy cluster means algorithm and its application in data stream mining for wireless sensor systems over a cloud-computing-like architecture, which we call sensor cloud data stream mining. Benchmarking on standard mining algorithms, the k-means and the FCM algorithms, we have demonstrated that the subtractive fuzzy cluster means model can perform high quality distributed data stream mining tasks comparable to centralised data stream mining. PMID:25313495

  11. Location and Size Planning of Distributed Photovoltaic Generation in Distribution network System Based on K-means Clustering Analysis

    NASA Astrophysics Data System (ADS)

    Lu, Siqi; Wang, Xiaorong; Wu, Junyong

    2018-01-01

    The paper presents a method to generate the planning scenarios, which is based on K-means clustering analysis algorithm driven by data, for the location and size planning of distributed photovoltaic (PV) units in the network. Taken the power losses of the network, the installation and maintenance costs of distributed PV, the profit of distributed PV and the voltage offset as objectives and the locations and sizes of distributed PV as decision variables, Pareto optimal front is obtained through the self-adaptive genetic algorithm (GA) and solutions are ranked by a method called technique for order preference by similarity to an ideal solution (TOPSIS). Finally, select the planning schemes at the top of the ranking list based on different planning emphasis after the analysis in detail. The proposed method is applied to a 10-kV distribution network in Gansu Province, China and the results are discussed.

  12. Iris recognition using image moments and k-means algorithm.

    PubMed

    Khan, Yaser Daanial; Khan, Sher Afzal; Ahmad, Farooq; Islam, Saeed

    2014-01-01

    This paper presents a biometric technique for identification of a person using the iris image. The iris is first segmented from the acquired image of an eye using an edge detection algorithm. The disk shaped area of the iris is transformed into a rectangular form. Described moments are extracted from the grayscale image which yields a feature vector containing scale, rotation, and translation invariant moments. Images are clustered using the k-means algorithm and centroids for each cluster are computed. An arbitrary image is assumed to belong to the cluster whose centroid is the nearest to the feature vector in terms of Euclidean distance computed. The described model exhibits an accuracy of 98.5%.

  13. Iris Recognition Using Image Moments and k-Means Algorithm

    PubMed Central

    Khan, Yaser Daanial; Khan, Sher Afzal; Ahmad, Farooq; Islam, Saeed

    2014-01-01

    This paper presents a biometric technique for identification of a person using the iris image. The iris is first segmented from the acquired image of an eye using an edge detection algorithm. The disk shaped area of the iris is transformed into a rectangular form. Described moments are extracted from the grayscale image which yields a feature vector containing scale, rotation, and translation invariant moments. Images are clustered using the k-means algorithm and centroids for each cluster are computed. An arbitrary image is assumed to belong to the cluster whose centroid is the nearest to the feature vector in terms of Euclidean distance computed. The described model exhibits an accuracy of 98.5%. PMID:24977221

  14. Research on hotspot discovery in internet public opinions based on improved K-means.

    PubMed

    Wang, Gensheng

    2013-01-01

    How to discover hotspot in the Internet public opinions effectively is a hot research field for the researchers related which plays a key role for governments and corporations to find useful information from mass data in the Internet. An improved K-means algorithm for hotspot discovery in internet public opinions is presented based on the analysis of existing defects and calculation principle of original K-means algorithm. First, some new methods are designed to preprocess website texts, select and express the characteristics of website texts, and define the similarity between two website texts, respectively. Second, clustering principle and the method of initial classification centers selection are analyzed and improved in order to overcome the limitations of original K-means algorithm. Finally, the experimental results verify that the improved algorithm can improve the clustering stability and classification accuracy of hotspot discovery in internet public opinions when used in practice.

  15. Research on Hotspot Discovery in Internet Public Opinions Based on Improved K-Means

    PubMed Central

    2013-01-01

    How to discover hotspot in the Internet public opinions effectively is a hot research field for the researchers related which plays a key role for governments and corporations to find useful information from mass data in the Internet. An improved K-means algorithm for hotspot discovery in internet public opinions is presented based on the analysis of existing defects and calculation principle of original K-means algorithm. First, some new methods are designed to preprocess website texts, select and express the characteristics of website texts, and define the similarity between two website texts, respectively. Second, clustering principle and the method of initial classification centers selection are analyzed and improved in order to overcome the limitations of original K-means algorithm. Finally, the experimental results verify that the improved algorithm can improve the clustering stability and classification accuracy of hotspot discovery in internet public opinions when used in practice. PMID:24106496

  16. Applications of cluster analysis to the creation of perfectionism profiles: a comparison of two clustering approaches.

    PubMed

    Bolin, Jocelyn H; Edwards, Julianne M; Finch, W Holmes; Cassady, Jerrell C

    2014-01-01

    Although traditional clustering methods (e.g., K-means) have been shown to be useful in the social sciences it is often difficult for such methods to handle situations where clusters in the population overlap or are ambiguous. Fuzzy clustering, a method already recognized in many disciplines, provides a more flexible alternative to these traditional clustering methods. Fuzzy clustering differs from other traditional clustering methods in that it allows for a case to belong to multiple clusters simultaneously. Unfortunately, fuzzy clustering techniques remain relatively unused in the social and behavioral sciences. The purpose of this paper is to introduce fuzzy clustering to these audiences who are currently relatively unfamiliar with the technique. In order to demonstrate the advantages associated with this method, cluster solutions of a common perfectionism measure were created using both fuzzy clustering and K-means clustering, and the results compared. Results of these analyses reveal that different cluster solutions are found by the two methods, and the similarity between the different clustering solutions depends on the amount of cluster overlap allowed for in fuzzy clustering.

  17. Applications of cluster analysis to the creation of perfectionism profiles: a comparison of two clustering approaches

    PubMed Central

    Bolin, Jocelyn H.; Edwards, Julianne M.; Finch, W. Holmes; Cassady, Jerrell C.

    2014-01-01

    Although traditional clustering methods (e.g., K-means) have been shown to be useful in the social sciences it is often difficult for such methods to handle situations where clusters in the population overlap or are ambiguous. Fuzzy clustering, a method already recognized in many disciplines, provides a more flexible alternative to these traditional clustering methods. Fuzzy clustering differs from other traditional clustering methods in that it allows for a case to belong to multiple clusters simultaneously. Unfortunately, fuzzy clustering techniques remain relatively unused in the social and behavioral sciences. The purpose of this paper is to introduce fuzzy clustering to these audiences who are currently relatively unfamiliar with the technique. In order to demonstrate the advantages associated with this method, cluster solutions of a common perfectionism measure were created using both fuzzy clustering and K-means clustering, and the results compared. Results of these analyses reveal that different cluster solutions are found by the two methods, and the similarity between the different clustering solutions depends on the amount of cluster overlap allowed for in fuzzy clustering. PMID:24795683

  18. An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China

    PubMed Central

    Zou, Hui; Zou, Zhihong; Wang, Xiaojing

    2015-01-01

    The increase and the complexity of data caused by the uncertain environment is today’s reality. In order to identify water quality effectively and reliably, this paper presents a modified fast clustering algorithm for water quality analysis. The algorithm has adopted a varying weights K-means cluster algorithm to analyze water monitoring data. The varying weights scheme was the best weighting indicator selected by a modified indicator weight self-adjustment algorithm based on K-means, which is named MIWAS-K-means. The new clustering algorithm avoids the margin of the iteration not being calculated in some cases. With the fast clustering analysis, we can identify the quality of water samples. The algorithm is applied in water quality analysis of the Haihe River (China) data obtained by the monitoring network over a period of eight years (2006–2013) with four indicators at seven different sites (2078 samples). Both the theoretical and simulated results demonstrate that the algorithm is efficient and reliable for water quality analysis of the Haihe River. In addition, the algorithm can be applied to more complex data matrices with high dimensionality. PMID:26569283

  19. Distributed k-Means Algorithm and Fuzzy c-Means Algorithm for Sensor Networks Based on Multiagent Consensus Theory.

    PubMed

    Qin, Jiahu; Fu, Weiming; Gao, Huijun; Zheng, Wei Xing

    2016-03-03

    This paper is concerned with developing a distributed k-means algorithm and a distributed fuzzy c-means algorithm for wireless sensor networks (WSNs) where each node is equipped with sensors. The underlying topology of the WSN is supposed to be strongly connected. The consensus algorithm in multiagent consensus theory is utilized to exchange the measurement information of the sensors in WSN. To obtain a faster convergence speed as well as a higher possibility of having the global optimum, a distributed k-means++ algorithm is first proposed to find the initial centroids before executing the distributed k-means algorithm and the distributed fuzzy c-means algorithm. The proposed distributed k-means algorithm is capable of partitioning the data observed by the nodes into measure-dependent groups which have small in-group and large out-group distances, while the proposed distributed fuzzy c-means algorithm is capable of partitioning the data observed by the nodes into different measure-dependent groups with degrees of membership values ranging from 0 to 1. Simulation results show that the proposed distributed algorithms can achieve almost the same results as that given by the centralized clustering algorithms.

  20. Identification of column edges of DNA fragments by using K-means clustering and mean algorithm on lane histograms of DNA agarose gel electrophoresis images

    NASA Astrophysics Data System (ADS)

    Turan, Muhammed K.; Sehirli, Eftal; Elen, Abdullah; Karas, Ismail R.

    2015-07-01

    Gel electrophoresis (GE) is one of the most used method to separate DNA, RNA, protein molecules according to size, weight and quantity parameters in many areas such as genetics, molecular biology, biochemistry, microbiology. The main way to separate each molecule is to find borders of each molecule fragment. This paper presents a software application that show columns edges of DNA fragments in 3 steps. In the first step the application obtains lane histograms of agarose gel electrophoresis images by doing projection based on x-axis. In the second step, it utilizes k-means clustering algorithm to classify point values of lane histogram such as left side values, right side values and undesired values. In the third step, column edges of DNA fragments is shown by using mean algorithm and mathematical processes to separate DNA fragments from the background in a fully automated way. In addition to this, the application presents locations of DNA fragments and how many DNA fragments exist on images captured by a scientific camera.

  1. Research on classified real-time flood forecasting framework based on K-means cluster and rough set.

    PubMed

    Xu, Wei; Peng, Yong

    2015-01-01

    This research presents a new classified real-time flood forecasting framework. In this framework, historical floods are classified by a K-means cluster according to the spatial and temporal distribution of precipitation, the time variance of precipitation intensity and other hydrological factors. Based on the classified results, a rough set is used to extract the identification rules for real-time flood forecasting. Then, the parameters of different categories within the conceptual hydrological model are calibrated using a genetic algorithm. In real-time forecasting, the corresponding category of parameters is selected for flood forecasting according to the obtained flood information. This research tests the new classified framework on Guanyinge Reservoir and compares the framework with the traditional flood forecasting method. It finds that the performance of the new classified framework is significantly better in terms of accuracy. Furthermore, the framework can be considered in a catchment with fewer historical floods.

  2. Spectrally efficient digitized radio-over-fiber system with k-means clustering-based multidimensional quantization.

    PubMed

    Zhang, Lu; Pang, Xiaodan; Ozolins, Oskars; Udalcovs, Aleksejs; Popov, Sergei; Xiao, Shilin; Hu, Weisheng; Chen, Jiajia

    2018-04-01

    We propose a spectrally efficient digitized radio-over-fiber (D-RoF) system by grouping highly correlated neighboring samples of the analog signals into multidimensional vectors, where the k-means clustering algorithm is adopted for adaptive quantization. A 30  Gbit/s D-RoF system is experimentally demonstrated to validate the proposed scheme, reporting a carrier aggregation of up to 40 100 MHz orthogonal frequency division multiplexing (OFDM) channels with quadrate amplitude modulation (QAM) order of 4 and an aggregation of 10 100 MHz OFDM channels with a QAM order of 16384. The equivalent common public radio interface rates from 37 to 150  Gbit/s are supported. Besides, the error vector magnitude (EVM) of 8% is achieved with the number of quantization bits of 4, and the EVM can be further reduced to 1% by increasing the number of quantization bits to 7. Compared with conventional pulse coding modulation-based D-RoF systems, the proposed D-RoF system improves the signal-to-noise-ratio up to ∼9  dB and greatly reduces the EVM, given the same number of quantization bits.

  3. Segmentation of Brain Lesions in MRI and CT Scan Images: A Hybrid Approach Using k-Means Clustering and Image Morphology

    NASA Astrophysics Data System (ADS)

    Agrawal, Ritu; Sharma, Manisha; Singh, Bikesh Kumar

    2018-04-01

    Manual segmentation and analysis of lesions in medical images is time consuming and subjected to human errors. Automated segmentation has thus gained significant attention in recent years. This article presents a hybrid approach for brain lesion segmentation in different imaging modalities by combining median filter, k means clustering, Sobel edge detection and morphological operations. Median filter is an essential pre-processing step and is used to remove impulsive noise from the acquired brain images followed by k-means segmentation, Sobel edge detection and morphological processing. The performance of proposed automated system is tested on standard datasets using performance measures such as segmentation accuracy and execution time. The proposed method achieves a high accuracy of 94% when compared with manual delineation performed by an expert radiologist. Furthermore, the statistical significance test between lesion segmented using automated approach and that by expert delineation using ANOVA and correlation coefficient achieved high significance values of 0.986 and 1 respectively. The experimental results obtained are discussed in lieu of some recently reported studies.

  4. A new locally weighted K-means for cancer-aided microarray data analysis.

    PubMed

    Iam-On, Natthakan; Boongoen, Tossapon

    2012-11-01

    Cancer has been identified as the leading cause of death. It is predicted that around 20-26 million people will be diagnosed with cancer by 2020. With this alarming rate, there is an urgent need for a more effective methodology to understand, prevent and cure cancer. Microarray technology provides a useful basis of achieving this goal, with cluster analysis of gene expression data leading to the discrimination of patients, identification of possible tumor subtypes and individualized treatment. Amongst clustering techniques, k-means is normally chosen for its simplicity and efficiency. However, it does not account for the different importance of data attributes. This paper presents a new locally weighted extension of k-means, which has proven more accurate across many published datasets than the original and other extensions found in the literature.

  5. Fast segmentation of industrial quality pavement images using Laws texture energy measures and k -means clustering

    NASA Astrophysics Data System (ADS)

    Mathavan, Senthan; Kumar, Akash; Kamal, Khurram; Nieminen, Michael; Shah, Hitesh; Rahman, Mujib

    2016-09-01

    Thousands of pavement images are collected by road authorities daily for condition monitoring surveys. These images typically have intensity variations and texture nonuniformities that make their segmentation challenging. The automated segmentation of such pavement images is crucial for accurate, thorough, and expedited health monitoring of roads. In the pavement monitoring area, well-known texture descriptors, such as gray-level co-occurrence matrices and local binary patterns, are often used for surface segmentation and identification. These, despite being the established methods for texture discrimination, are inherently slow. This work evaluates Laws texture energy measures as a viable alternative for pavement images for the first time. k-means clustering is used to partition the feature space, limiting the human subjectivity in the process. Data classification, hence image segmentation, is performed by the k-nearest neighbor method. Laws texture energy masks are shown to perform well with resulting accuracy and precision values of more than 80%. The implementations of the algorithm, in both MATLAB® and OpenCV/C++, are extensively compared against the state of the art for execution speed, clearly showing the advantages of the proposed method. Furthermore, the OpenCV-based segmentation shows a 100% increase in processing speed when compared to the fastest algorithm available in literature.

  6. Recognizing upper limb movements with wrist worn inertial sensors using k-means clustering classification.

    PubMed

    Biswas, Dwaipayan; Cranny, Andy; Gupta, Nayaab; Maharatna, Koushik; Achner, Josy; Klemke, Jasmin; Jöbges, Michael; Ortmann, Steffen

    2015-04-01

    In this paper we present a methodology for recognizing three fundamental movements of the human forearm (extension, flexion and rotation) using pattern recognition applied to the data from a single wrist-worn, inertial sensor. We propose that this technique could be used as a clinical tool to assess rehabilitation progress in neurodegenerative pathologies such as stroke or cerebral palsy by tracking the number of times a patient performs specific arm movements (e.g. prescribed exercises) with their paretic arm throughout the day. We demonstrate this with healthy subjects and stroke patients in a simple proof of concept study in which these arm movements are detected during an archetypal activity of daily-living (ADL) - 'making-a-cup-of-tea'. Data is collected from a tri-axial accelerometer and a tri-axial gyroscope located proximal to the wrist. In a training phase, movements are initially performed in a controlled environment which are represented by a ranked set of 30 time-domain features. Using a sequential forward selection technique, for each set of feature combinations three clusters are formed using k-means clustering followed by 10 runs of 10-fold cross validation on the training data to determine the best feature combinations. For the testing phase, movements performed during the ADL are associated with each cluster label using a minimum distance classifier in a multi-dimensional feature space, comprised of the best ranked features, using Euclidean or Mahalanobis distance as the metric. Experiments were performed with four healthy subjects and four stroke survivors and our results show that the proposed methodology can detect the three movements performed during the ADL with an overall average accuracy of 88% using the accelerometer data and 83% using the gyroscope data across all healthy subjects and arm movement types. The average accuracy across all stroke survivors was 70% using accelerometer data and 66% using gyroscope data. We also use a Linear

  7. Contributions to "k"-Means Clustering and Regression via Classification Algorithms

    ERIC Educational Resources Information Center

    Salman, Raied

    2012-01-01

    The dissertation deals with clustering algorithms and transforming regression problems into classification problems. The main contributions of the dissertation are twofold; first, to improve (speed up) the clustering algorithms and second, to develop a strict learning environment for solving regression problems as classification tasks by using…

  8. Fuzzy Document Clustering Approach using WordNet Lexical Categories

    NASA Astrophysics Data System (ADS)

    Gharib, Tarek F.; Fouad, Mohammed M.; Aref, Mostafa M.

    Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. This area is growing rapidly mainly because of the strong need for analysing the huge and large amount of textual data that reside on internal file systems and the Web. Text document clustering provides an effective navigation mechanism to organize this large amount of data by grouping their documents into a small number of meaningful classes. In this paper we proposed a fuzzy text document clustering approach using WordNet lexical categories and Fuzzy c-Means algorithm. Some experiments are performed to compare efficiency of the proposed approach with the recently reported approaches. Experimental results show that Fuzzy clustering leads to great performance results. Fuzzy c-means algorithm overcomes other classical clustering algorithms like k-means and bisecting k-means in both clustering quality and running time efficiency.

  9. Prediction of forced expiratory volume in pulmonary function test using radial basis neural networks and k-means clustering.

    PubMed

    Manoharan, Sujatha C; Ramakrishnan, Swaminathan

    2009-10-01

    In this work, prediction of forced expiratory volume in pulmonary function test, carried out using spirometry and neural networks is presented. The pulmonary function data were recorded from volunteers using commercial available flow volume spirometer in standard acquisition protocol. The Radial Basis Function neural networks were used to predict forced expiratory volume in 1 s (FEV1) from the recorded flow volume curves. The optimal centres of the hidden layer of radial basis function were determined by k-means clustering algorithm. The performance of the neural network model was evaluated by computing their prediction error statistics of average value, standard deviation, root mean square and their correlation with the true data for normal, restrictive and obstructive cases. Results show that the adopted neural networks are capable of predicting FEV1 in both normal and abnormal cases. Prediction accuracy was more in obstructive abnormality when compared to restrictive cases. It appears that this method of assessment is useful in diagnosing the pulmonary abnormalities with incomplete data and data with poor recording.

  10. Hierarchical Adaptive Means (HAM) clustering for hardware-efficient, unsupervised and real-time spike sorting.

    PubMed

    Paraskevopoulou, Sivylla E; Wu, Di; Eftekhar, Amir; Constandinou, Timothy G

    2014-09-30

    This work presents a novel unsupervised algorithm for real-time adaptive clustering of neural spike data (spike sorting). The proposed Hierarchical Adaptive Means (HAM) clustering method combines centroid-based clustering with hierarchical cluster connectivity to classify incoming spikes using groups of clusters. It is described how the proposed method can adaptively track the incoming spike data without requiring any past history, iteration or training and autonomously determines the number of spike classes. Its performance (classification accuracy) has been tested using multiple datasets (both simulated and recorded) achieving a near-identical accuracy compared to k-means (using 10-iterations and provided with the number of spike classes). Also, its robustness in applying to different feature extraction methods has been demonstrated by achieving classification accuracies above 80% across multiple datasets. Last but crucially, its low complexity, that has been quantified through both memory and computation requirements makes this method hugely attractive for future hardware implementation. Copyright © 2014 Elsevier B.V. All rights reserved.

  11. Update of membership and mean proper motion of open clusters from UCAC5 catalog

    NASA Astrophysics Data System (ADS)

    Dias, W. S.; Monteiro, H.; Assafin, M.

    2018-06-01

    We present mean proper motions and membership probabilities of individual stars for optically visible open clusters, which have been determined using data from the UCAC5 catalog. This follows our previous studies with the UCAC2 and UCAC4 catalogs, but now using improved proper motions in the GAIA reference frame. In the present study results were obtained for a sample of 1108 open clusters. For five clusters, this is the first determination of mean proper motion, and for the whole sample, we present results with a much larger number of identified astrometric member stars than on previous studies. It is the last update of our Open cluster Catalog based on proper motion data only. Future updates will count on astrometric, photometric and spectroscopic GAIA data as input for analyses.

  12. Electrical Load Profile Analysis Using Clustering Techniques

    NASA Astrophysics Data System (ADS)

    Damayanti, R.; Abdullah, A. G.; Purnama, W.; Nandiyanto, A. B. D.

    2017-03-01

    Data mining is one of the data processing techniques to collect information from a set of stored data. Every day the consumption of electricity load is recorded by Electrical Company, usually at intervals of 15 or 30 minutes. This paper uses a clustering technique, which is one of data mining techniques to analyse the electrical load profiles during 2014. The three methods of clustering techniques were compared, namely K-Means (KM), Fuzzy C-Means (FCM), and K-Means Harmonics (KHM). The result shows that KHM is the most appropriate method to classify the electrical load profile. The optimum number of clusters is determined using the Davies-Bouldin Index. By grouping the load profile, the demand of variation analysis and estimation of energy loss from the group of load profile with similar pattern can be done. From the group of electric load profile, it can be known cluster load factor and a range of cluster loss factor that can help to find the range of values of coefficients for the estimated loss of energy without performing load flow studies.

  13. Individual participant data meta-analyses should not ignore clustering

    PubMed Central

    Abo-Zaid, Ghada; Guo, Boliang; Deeks, Jonathan J.; Debray, Thomas P.A.; Steyerberg, Ewout W.; Moons, Karel G.M.; Riley, Richard David

    2013-01-01

    Objectives Individual participant data (IPD) meta-analyses often analyze their IPD as if coming from a single study. We compare this approach with analyses that rather account for clustering of patients within studies. Study Design and Setting Comparison of effect estimates from logistic regression models in real and simulated examples. Results The estimated prognostic effect of age in patients with traumatic brain injury is similar, regardless of whether clustering is accounted for. However, a family history of thrombophilia is found to be a diagnostic marker of deep vein thrombosis [odds ratio, 1.30; 95% confidence interval (CI): 1.00, 1.70; P = 0.05] when clustering is accounted for but not when it is ignored (odds ratio, 1.06; 95% CI: 0.83, 1.37; P = 0.64). Similarly, the treatment effect of nicotine gum on smoking cessation is severely attenuated when clustering is ignored (odds ratio, 1.40; 95% CI: 1.02, 1.92) rather than accounted for (odds ratio, 1.80; 95% CI: 1.29, 2.52). Simulations show models accounting for clustering perform consistently well, but downwardly biased effect estimates and low coverage can occur when ignoring clustering. Conclusion Researchers must routinely account for clustering in IPD meta-analyses; otherwise, misleading effect estimates and conclusions may arise. PMID:23651765

  14. Bayesian hierarchical models for cost-effectiveness analyses that use data from cluster randomized trials.

    PubMed

    Grieve, Richard; Nixon, Richard; Thompson, Simon G

    2010-01-01

    Cost-effectiveness analyses (CEA) may be undertaken alongside cluster randomized trials (CRTs) where randomization is at the level of the cluster (for example, the hospital or primary care provider) rather than the individual. Costs (and outcomes) within clusters may be correlated so that the assumption made by standard bivariate regression models, that observations are independent, is incorrect. This study develops a flexible modeling framework to acknowledge the clustering in CEA that use CRTs. The authors extend previous Bayesian bivariate models for CEA of multicenter trials to recognize the specific form of clustering in CRTs. They develop new Bayesian hierarchical models (BHMs) that allow mean costs and outcomes, and also variances, to differ across clusters. They illustrate how each model can be applied using data from a large (1732 cases, 70 primary care providers) CRT evaluating alternative interventions for reducing postnatal depression. The analyses compare cost-effectiveness estimates from BHMs with standard bivariate regression models that ignore the data hierarchy. The BHMs show high levels of cost heterogeneity across clusters (intracluster correlation coefficient, 0.17). Compared with standard regression models, the BHMs yield substantially increased uncertainty surrounding the cost-effectiveness estimates, and altered point estimates. The authors conclude that ignoring clustering can lead to incorrect inferences. The BHMs that they present offer a flexible modeling framework that can be applied more generally to CEA that use CRTs.

  15. CLASSIFICATION OF IRANIAN NURSES ACCORDING TO THEIR MENTAL HEALTH OUTCOMES USING GHQ-12 QUESTIONNAIRE: A COMPARISON BETWEEN LATENT CLASS ANALYSIS AND K-MEANS CLUSTERING WITH TRADITIONAL SCORING METHOD

    PubMed Central

    Jamali, Jamshid; Ayatollahi, Seyyed Mohammad Taghi

    2015-01-01

    Background: Nurses constitute the most providers of health care systems. Their mental health can affect the quality of services and patients’ satisfaction. General Health Questionnaire (GHQ-12) is a general screening tool used to detect mental disorders. Scoring method and determining thresholds for this questionnaire are debatable and the cut-off points can vary from sample to sample. This study was conducted to estimate the prevalence of mental disorders among Iranian nurses using GHQ-12 and also compare Latent Class Analysis (LCA) and K-means clustering with traditional scoring method. Methodology: A cross-sectional study was carried out in Fars and Bushehr provinces of southern Iran in 2014. Participants were 771 Iranian nurses, who filled out the GHQ-12 questionnaire. Traditional scoring method, LCA and K-means were used to estimate the prevalence of mental disorder among Iranian nurses. Cohen’s kappa statistic was applied to assess the agreement between the LCA and K-means with traditional scoring method of GHQ-12. Results: The nurses with mental disorder by scoring method, LCA and K-mean were 36.3% (n=280), 32.2% (n=248), and 26.5% (n=204), respectively. LCA and logistic regression revealed that the prevalence of mental disorder in females was significantly higher than males. Conclusion: Mental disorder in nurses was in a medium level compared to other people living in Iran. There was a little difference between prevalence of mental disorder estimated by scoring method, K-means and LCA. According to the advantages of LCA than K-means and different results in scoring method, we suggest LCA for classification of Iranian nurses according to their mental health outcomes using GHQ-12 questionnaire PMID:26622202

  16. CLASSIFICATION OF IRANIAN NURSES ACCORDING TO THEIR MENTAL HEALTH OUTCOMES USING GHQ-12 QUESTIONNAIRE: A COMPARISON BETWEEN LATENT CLASS ANALYSIS AND K-MEANS CLUSTERING WITH TRADITIONAL SCORING METHOD.

    PubMed

    Jamali, Jamshid; Ayatollahi, Seyyed Mohammad Taghi

    2015-10-01

    Nurses constitute the most providers of health care systems. Their mental health can affect the quality of services and patients' satisfaction. General Health Questionnaire (GHQ-12) is a general screening tool used to detect mental disorders. Scoring method and determining thresholds for this questionnaire are debatable and the cut-off points can vary from sample to sample. This study was conducted to estimate the prevalence of mental disorders among Iranian nurses using GHQ-12 and also compare Latent Class Analysis (LCA) and K-means clustering with traditional scoring method. A cross-sectional study was carried out in Fars and Bushehr provinces of southern Iran in 2014. Participants were 771 Iranian nurses, who filled out the GHQ-12 questionnaire. Traditional scoring method, LCA and K-means were used to estimate the prevalence of mental disorder among Iranian nurses. Cohen's kappa statistic was applied to assess the agreement between the LCA and K-means with traditional scoring method of GHQ-12. The nurses with mental disorder by scoring method, LCA and K-mean were 36.3% (n=280), 32.2% (n=248), and 26.5% (n=204), respectively. LCA and logistic regression revealed that the prevalence of mental disorder in females was significantly higher than males. Mental disorder in nurses was in a medium level compared to other people living in Iran. There was a little difference between prevalence of mental disorder estimated by scoring method, K-means and LCA. According to the advantages of LCA than K-means and different results in scoring method, we suggest LCA for classification of Iranian nurses according to their mental health outcomes using GHQ-12 questionnaire.

  17. AUTOMATED UNSUPERVISED CLASSIFICATION OF THE SLOAN DIGITAL SKY SURVEY STELLAR SPECTRA USING k-MEANS CLUSTERING

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sanchez Almeida, J.; Allende Prieto, C., E-mail: jos@iac.es, E-mail: callende@iac.es

    2013-01-20

    Large spectroscopic surveys require automated methods of analysis. This paper explores the use of k-means clustering as a tool for automated unsupervised classification of massive stellar spectral catalogs. The classification criteria are defined by the data and the algorithm, with no prior physical framework. We work with a representative set of stellar spectra associated with the Sloan Digital Sky Survey (SDSS) SEGUE and SEGUE-2 programs, which consists of 173,390 spectra from 3800 to 9200 A sampled on 3849 wavelengths. We classify the original spectra as well as the spectra with the continuum removed. The second set only contains spectral lines,more » and it is less dependent on uncertainties of the flux calibration. The classification of the spectra with continuum renders 16 major classes. Roughly speaking, stars are split according to their colors, with enough finesse to distinguish dwarfs from giants of the same effective temperature, but with difficulties to separate stars with different metallicities. There are classes corresponding to particular MK types, intrinsically blue stars, dust-reddened, stellar systems, and also classes collecting faulty spectra. Overall, there is no one-to-one correspondence between the classes we derive and the MK types. The classification of spectra without continuum renders 13 classes, the color separation is not so sharp, but it distinguishes stars of the same effective temperature and different metallicities. Some classes thus obtained present a fairly small range of physical parameters (200 K in effective temperature, 0.25 dex in surface gravity, and 0.35 dex in metallicity), so that the classification can be used to estimate the main physical parameters of some stars at a minimum computational cost. We also analyze the outliers of the classification. Most of them turn out to be failures of the reduction pipeline, but there are also high redshift QSOs, multiple stellar systems, dust-reddened stars, galaxies, and, finally

  18. Machine learning in APOGEE. Unsupervised spectral classification with K-means

    NASA Astrophysics Data System (ADS)

    Garcia-Dias, Rafael; Allende Prieto, Carlos; Sánchez Almeida, Jorge; Ordovás-Pascual, Ignacio

    2018-05-01

    Context. The volume of data generated by astronomical surveys is growing rapidly. Traditional analysis techniques in spectroscopy either demand intensive human interaction or are computationally expensive. In this scenario, machine learning, and unsupervised clustering algorithms in particular, offer interesting alternatives. The Apache Point Observatory Galactic Evolution Experiment (APOGEE) offers a vast data set of near-infrared stellar spectra, which is perfect for testing such alternatives. Aims: Our research applies an unsupervised classification scheme based on K-means to the massive APOGEE data set. We explore whether the data are amenable to classification into discrete classes. Methods: We apply the K-means algorithm to 153 847 high resolution spectra (R ≈ 22 500). We discuss the main virtues and weaknesses of the algorithm, as well as our choice of parameters. Results: We show that a classification based on normalised spectra captures the variations in stellar atmospheric parameters, chemical abundances, and rotational velocity, among other factors. The algorithm is able to separate the bulge and halo populations, and distinguish dwarfs, sub-giants, RC, and RGB stars. However, a discrete classification in flux space does not result in a neat organisation in the parameters' space. Furthermore, the lack of obvious groups in flux space causes the results to be fairly sensitive to the initialisation, and disrupts the efficiency of commonly-used methods to select the optimal number of clusters. Our classification is publicly available, including extensive online material associated with the APOGEE Data Release 12 (DR12). Conclusions: Our description of the APOGEE database can help greatly with the identification of specific types of targets for various applications. We find a lack of obvious groups in flux space, and identify limitations of the K-means algorithm in dealing with this kind of data. Full Tables B.1-B.4 are only available at the CDS via

  19. Integrative Sparse K-Means With Overlapping Group Lasso in Genomic Applications for Disease Subtype Discovery

    PubMed Central

    Huo, Zhiguang; Tseng, George

    2017-01-01

    Cancer subtypes discovery is the first step to deliver personalized medicine to cancer patients. With the accumulation of massive multi-level omics datasets and established biological knowledge databases, omics data integration with incorporation of rich existing biological knowledge is essential for deciphering a biological mechanism behind the complex diseases. In this manuscript, we propose an integrative sparse K-means (is-K means) approach to discover disease subtypes with the guidance of prior biological knowledge via sparse overlapping group lasso. An algorithm using an alternating direction method of multiplier (ADMM) will be applied for fast optimization. Simulation and three real applications in breast cancer and leukemia will be used to compare is-K means with existing methods and demonstrate its superior clustering accuracy, feature selection, functional annotation of detected molecular features and computing efficiency. PMID:28959370

  20. Integrative Sparse K-Means With Overlapping Group Lasso in Genomic Applications for Disease Subtype Discovery.

    PubMed

    Huo, Zhiguang; Tseng, George

    2017-06-01

    Cancer subtypes discovery is the first step to deliver personalized medicine to cancer patients. With the accumulation of massive multi-level omics datasets and established biological knowledge databases, omics data integration with incorporation of rich existing biological knowledge is essential for deciphering a biological mechanism behind the complex diseases. In this manuscript, we propose an integrative sparse K -means (is- K means) approach to discover disease subtypes with the guidance of prior biological knowledge via sparse overlapping group lasso. An algorithm using an alternating direction method of multiplier (ADMM) will be applied for fast optimization. Simulation and three real applications in breast cancer and leukemia will be used to compare is- K means with existing methods and demonstrate its superior clustering accuracy, feature selection, functional annotation of detected molecular features and computing efficiency.

  1. Business and Marketing Cluster. Task Analyses.

    ERIC Educational Resources Information Center

    Henrico County Public Schools, Glen Allen, VA. Virginia Vocational Curriculum and Resource Center.

    Developed in Virginia, this publication contains task analysis guides to support selected tech prep programs that prepare students for careers in the business and marketing cluster. Guides are included for accounting systems, legal systems administration, office systems technology, and retail marketing. Each task analyses guide has the following…

  2. K-means cluster analysis of tourist destination in special region of Yogyakarta using spatial approach and social network analysis (a case study: post of @explorejogja instagram account in 2016)

    NASA Astrophysics Data System (ADS)

    Iswandhani, N.; Muhajir, M.

    2018-03-01

    This research was conducted in Department of Statistics Islamic University of Indonesia. The data used are primary data obtained by post @explorejogja instagram account from January until December 2016. In the @explorejogja instagram account found many tourist destinations that can be visited by tourists both in the country and abroad, Therefore it is necessary to form a cluster of existing tourist destinations based on the number of likes from user instagram assumed as the most popular. The purpose of this research is to know the most popular distribution of tourist spot, the cluster formation of tourist destinations, and central popularity of tourist destinations based on @explorejogja instagram account in 2016. Statistical analysis used is descriptive statistics, k-means clustering, and social network analysis. The results of this research were obtained the top 10 most popular destinations in Yogyakarta, map of html-based tourist destination distribution consisting of 121 tourist destination points, formed 3 clusters each consisting of cluster 1 with 52 destinations, cluster 2 with 9 destinations and cluster 3 with 60 destinations, and Central popularity of tourist destinations in the special region of Yogyakarta by district.

  3. A spectral k-means approach to bright-field cell image segmentation.

    PubMed

    Bradbury, Laura; Wan, Justin W L

    2010-01-01

    Automatic segmentation of bright-field cell images is important to cell biologists, but difficult to complete due to the complex nature of the cells in bright-field images (poor contrast, broken halo, missing boundaries). Standard approaches such as level set segmentation and active contours work well for fluorescent images where cells appear as round shape, but become less effective when optical artifacts such as halo exist in bright-field images. In this paper, we present a robust segmentation method which combines the spectral and k-means clustering techniques to locate cells in bright-field images. This approach models an image as a matrix graph and segment different regions of the image by computing the appropriate eigenvectors of the matrix graph and using the k-means algorithm. We illustrate the effectiveness of the method by segmentation results of C2C12 (muscle) cells in bright-field images.

  4. Classification of aquifer vulnerability using K-means cluster analysis

    NASA Astrophysics Data System (ADS)

    Javadi, S.; Hashemy, S. M.; Mohammadi, K.; Howard, K. W. F.; Neshat, A.

    2017-06-01

    Groundwater is one of the main sources of drinking and agricultural water in arid and semi-arid regions but is becoming increasingly threatened by contamination. Vulnerability mapping has been used for many years as an effective tool for assessing the potential for aquifer pollution and the most common method of intrinsic vulnerability assessment is DRASTIC (Depth to water table, net Recharge, Aquifer media, Soil media, Topography, Impact of vadose zone and hydraulic Conductivity). An underlying problem with the DRASTIC approach relates to the subjectivity involved in selecting relative weightings for each of the DRASTIC factors and assigning rating values to ranges or media types within each factor. In this study, a clustering technique is introduced that removes some of the subjectivity associated with the indexing method. It creates a vulnerability map that does not rely on fixed weights and ratings and, thereby provides a more objective representation of the system's physical characteristics. This methodology was applied to an aquifer in Iran and compared with the standard DRASTIC approach using the water quality parameters nitrate, chloride and total dissolved solids (TDS) as surrogate indicators of aquifer vulnerability. The proposed method required only four of DRASTIC's seven factors - depth to groundwater, hydraulic conductivity, recharge value and the nature of the vadose zone, to produce a superior result. For nitrate, chloride, and TDS, respectively, the clustering approach delivered Pearson correlation coefficients that were 15, 22 and 5 percentage points higher than those obtained for the DRASTIC method.

  5. Detecting COPD exacerbations early using daily telemonitoring of symptoms and k-means clustering: a pilot study.

    PubMed

    Sanchez-Morillo, Daniel; Fernandez-Granero, Miguel Angel; Jiménez, Antonio León

    2015-05-01

    COPD places an enormous burden on the healthcare systems and causes diminished health-related quality of life. The highest proportion of human and economic cost is associated with admissions for acute exacerbation of respiratory symptoms (AECOPD). Since prompt detection and treatment of exacerbations may improve outcomes, early detection of AECOPD is a critical issue. This pilot study was aimed to determine whether a mobile health system could enable early detection of AECOPD on a day-to-day basis. A novel electronic questionnaire for the early detection of COPD exacerbations was evaluated during a 6-months field trial in a group of 16 patients. Pattern recognition techniques were applied. A k-means clustering algorithm was trained and validated, and its accuracy in detecting AECOPD was assessed. Sensitivity and specificity were 74.6 and 89.7 %, respectively, and area under the receiver operating characteristic curve was 0.84. 31 out of 33 AECOPD were early identified with an average of 4.5 ± 2.1 days prior to the onset of the exacerbation that was considered the day of medical attendance. Based on the findings of this preliminary pilot study, the proposed electronic questionnaire and the applied methodology could help to early detect COPD exacerbations on a day-to-day basis and therefore could provide support to patients and physicians.

  6. Improved infrared precipitation estimation approaches based on k-means clustering: Application to north Algeria using MSG-SEVIRI satellite data

    NASA Astrophysics Data System (ADS)

    Mokdad, Fatiha; Haddad, Boualem

    2017-06-01

    In this paper, two new infrared precipitation estimation approaches based on the concept of k-means clustering are first proposed, named the NAW-Kmeans and the GPI-Kmeans methods. Then, they are adapted to the southern Mediterranean basin, where the subtropical climate prevails. The infrared data (10.8 μm channel) acquired by MSG-SEVIRI sensor in winter and spring 2012 are used. Tests are carried out in eight areas distributed over northern Algeria: Sebra, El Bordj, Chlef, Blida, Bordj Menael, Sidi Aich, Beni Ourthilane, and Beni Aziz. The validation is performed by a comparison of the estimated rainfalls to rain gauges observations collected by the National Office of Meteorology in Dar El Beida (Algeria). Despite the complexity of the subtropical climate, the obtained results indicate that the NAW-Kmeans and the GPI-Kmeans approaches gave satisfactory results for the considered rain rates. Also, the proposed schemes lead to improvement in precipitation estimation performance when compared to the original algorithms NAW (Nagri, Adler, and Wetzel) and GPI (GOES Precipitation Index).

  7. Inflation data clustering of some cities in Indonesia

    NASA Astrophysics Data System (ADS)

    Setiawan, Adi; Susanto, Bambang; Mahatma, Tundjung

    2017-06-01

    In this paper, it is presented how to cluster inflation data of cities in Indonesia by using k-means cluster method and fuzzy c-means method. The data that are used is limited to the monthly inflation data from 15 cities across Indonesia which have highest weight of donations and is supplemented with 5 cities used in the calculation of inflation in Indonesia. When they are applied into two clusters with k = 2 for k-means cluster method and c = 2, w = 1.25 for fuzzy c-means cluster method, Ambon, Manado and Jayapura tend to become one cluster (high inflation) meanwhile other cities tend to become members of other cluster (low inflation). However, if they are applied into two clusters with c=2, w=1.5, Surabaya, Medan, Makasar, Samarinda, Makasar, Manado, Ambon dan Jayapura tend to become one cluster (high inflation) meanwhile other cities tend to become members of other cluster (low inflation). Furthermore, when we use two clusters with k=3 for k-means cluster method and c=3, w = 1.25 for fuzzy c-means cluster method, Ambon tends to become member of first cluster (high inflation), Manado and Jayapura tend to become member of second cluster (moderate inflation), other cities tend to become members of third cluster (low inflation). If it is applied c=3, w = 1.5, Ambon, Manado and Jayapura tend to become member of first cluster (high inflation), Surabaya, Bandung, Medan, Makasar, Banyuwangi, Denpasar, Samarinda dan Mataram tend to become members of second cluster (moderate inflation), meanwhile other cities tend to become members of third cluster (low inflation). Similarly, interpretation can be made to the results of applying 5 clusters.

  8. Analyzing Flood Vulnerability Due to Sea Level Rise Using K-Means Clustering: Implications for Regional Flood Mitigation Planning

    NASA Astrophysics Data System (ADS)

    Hummel, M.; Wood, N. J.; Stacey, M. T.; Schweikert, A.; Barnard, P.; Erikson, L. H.

    2016-12-01

    The threat of tidal flooding in coastal regions is exacerbated by sea level rise (SLR), which can lead to more frequent and persistent nuisance flooding and permanent inundation of low-lying areas. When coupled with extreme storm events, SLR also increases the extent and depth of flooding due to storm surges. To mitigate these impacts, bayfront communities are considering a variety of options for shoreline protection, including restoration of natural features such as wetlands and hardening of the shoreline using levees and sea walls. These shoreline modifications can produce changes in the tidal dynamics in a basin, either by increasing dissipation of tidal energy or enhancing tidal amplification [1]. As a result, actions taken by individual communities not only impact local inundation, but can also have implications for flooding on a regional scale. However, regional collaboration is lacking in flood mitigation planning, which is often done on a community-by-community basis. This can lead to redundancy in planning efforts and can also have adverse effects on communities that are not included in discussions about shoreline infrastructure improvements. Using flooding extent outputs from a hydrodynamic model of San Francisco Bay, we performed a K-means clustering analysis to identify similarities between 65 bayfront communities in terms of the spatial, demographic, and economic characteristics of their vulnerable assets for a suite of SLR and storm scenarios. Our clustering analysis identifies communities with similar vulnerabilities and allows for more effective collaboration and decision-making at a regional level by encouraging comparable communities to work together and pool resources to find effective adaptation strategies as flooding becomes more frequent and severe. [1] Holleman RC, Stacey MT (2014) Coupling of sea level rise, tidal amplification, and inundation. Journal of Physical Oceanography 44:1439-1455.

  9. Uncertainty based modeling of rainfall-runoff: Combined differential evolution adaptive Metropolis (DREAM) and K-means clustering

    NASA Astrophysics Data System (ADS)

    Zahmatkesh, Zahra; Karamouz, Mohammad; Nazif, Sara

    2015-09-01

    Simulation of rainfall-runoff process in urban areas is of great importance considering the consequences and damages of extreme runoff events and floods. The first issue in flood hazard analysis is rainfall simulation. Large scale climate signals have been proved to be effective in rainfall simulation and prediction. In this study, an integrated scheme is developed for rainfall-runoff modeling considering different sources of uncertainty. This scheme includes three main steps of rainfall forecasting, rainfall-runoff simulation and future runoff prediction. In the first step, data driven models are developed and used to forecast rainfall using large scale climate signals as rainfall predictors. Due to high effect of different sources of uncertainty on the output of hydrologic models, in the second step uncertainty associated with input data, model parameters and model structure is incorporated in rainfall-runoff modeling and simulation. Three rainfall-runoff simulation models are developed for consideration of model conceptual (structural) uncertainty in real time runoff forecasting. To analyze the uncertainty of the model structure, streamflows generated by alternative rainfall-runoff models are combined, through developing a weighting method based on K-means clustering. Model parameters and input uncertainty are investigated using an adaptive Markov Chain Monte Carlo method. Finally, calibrated rainfall-runoff models are driven using the forecasted rainfall to predict future runoff for the watershed. The proposed scheme is employed in the case study of the Bronx River watershed, New York City. Results of uncertainty analysis of rainfall-runoff modeling reveal that simultaneous estimation of model parameters and input uncertainty significantly changes the probability distribution of the model parameters. It is also observed that by combining the outputs of the hydrological models using the proposed clustering scheme, the accuracy of runoff simulation in the

  10. Identification of new candidate drugs for lung cancer using chemical-chemical interactions, chemical-protein interactions and a K-means clustering algorithm.

    PubMed

    Lu, Jing; Chen, Lei; Yin, Jun; Huang, Tao; Bi, Yi; Kong, Xiangyin; Zheng, Mingyue; Cai, Yu-Dong

    2016-01-01

    Lung cancer, characterized by uncontrolled cell growth in the lung tissue, is the leading cause of global cancer deaths. Until now, effective treatment of this disease is limited. Many synthetic compounds have emerged with the advancement of combinatorial chemistry. Identification of effective lung cancer candidate drug compounds among them is a great challenge. Thus, it is necessary to build effective computational methods that can assist us in selecting for potential lung cancer drug compounds. In this study, a computational method was proposed to tackle this problem. The chemical-chemical interactions and chemical-protein interactions were utilized to select candidate drug compounds that have close associations with approved lung cancer drugs and lung cancer-related genes. A permutation test and K-means clustering algorithm were employed to exclude candidate drugs with low possibilities to treat lung cancer. The final analysis suggests that the remaining drug compounds have potential anti-lung cancer activities and most of them have structural dissimilarity with approved drugs for lung cancer.

  11. The Classification of Diabetes Mellitus Using Kernel k-means

    NASA Astrophysics Data System (ADS)

    Alamsyah, M.; Nafisah, Z.; Prayitno, E.; Afida, A. M.; Imah, E. M.

    2018-01-01

    Diabetes Mellitus is a metabolic disorder which is characterized by chronicle hypertensive glucose. Automatics detection of diabetes mellitus is still challenging. This study detected diabetes mellitus by using kernel k-Means algorithm. Kernel k-means is an algorithm which was developed from k-means algorithm. Kernel k-means used kernel learning that is able to handle non linear separable data; where it differs with a common k-means. The performance of kernel k-means in detecting diabetes mellitus is also compared with SOM algorithms. The experiment result shows that kernel k-means has good performance and a way much better than SOM.

  12. "K"-Means Clustering and Mixture Model Clustering: Reply to McLachlan (2011) and Vermunt (2011)

    ERIC Educational Resources Information Center

    Steinley, Douglas; Brusco, Michael J.

    2011-01-01

    McLachlan (2011) and Vermunt (2011) each provided thoughtful replies to our original article (Steinley & Brusco, 2011). This response serves to incorporate some of their comments while simultaneously clarifying our position. We argue that greater caution against overparamaterization must be taken when assuming that clusters are highly elliptical…

  13. The Effects of Including Observed Means or Latent Means as Covariates in Multilevel Models for Cluster Randomized Trials

    ERIC Educational Resources Information Center

    Aydin, Burak; Leite, Walter L.; Algina, James

    2016-01-01

    We investigated methods of including covariates in two-level models for cluster randomized trials to increase power to detect the treatment effect. We compared multilevel models that included either an observed cluster mean or a latent cluster mean as a covariate, as well as the effect of including Level 1 deviation scores in the model. A Monte…

  14. The ergot alkaloid gene cluster: functional analyses and evolutionary aspects.

    PubMed

    Lorenz, Nicole; Haarmann, Thomas; Pazoutová, Sylvie; Jung, Manfred; Tudzynski, Paul

    2009-01-01

    Ergot alkaloids and their derivatives have been traditionally used as therapeutic agents in migraine, blood pressure regulation and help in childbirth and abortion. Their production in submerse culture is a long established biotechnological process. Ergot alkaloids are produced mainly by members of the genus Claviceps, with Claviceps purpurea as best investigated species concerning the biochemistry of ergot alkaloid synthesis (EAS). Genes encoding enzymes involved in EAS have been shown to be clustered; functional analyses of EAS cluster genes have allowed to assign specific functions to several gene products. Various Claviceps species differ with respect to their host specificity and their alkaloid content; comparison of the ergot alkaloid clusters in these species (and of clavine alkaloid clusters in other genera) yields interesting insights into the evolution of cluster structure. This review focuses on recently published and also yet unpublished data on the structure and evolution of the EAS gene cluster and on the function and regulation of cluster genes. These analyses have also significant biotechnological implications: the characterization of non-ribosomal peptide synthetases (NRPS) involved in the synthesis of the peptide moiety of ergopeptines opened interesting perspectives for the synthesis of ergot alkaloids; on the other hand, defined mutants could be generated producing interesting intermediates or only single peptide alkaloids (instead of the alkaloid mixtures usually produced by industrial strains).

  15. Fuzzy C-mean clustering on kinetic parameter estimation with generalized linear least square algorithm in SPECT

    NASA Astrophysics Data System (ADS)

    Choi, Hon-Chit; Wen, Lingfeng; Eberl, Stefan; Feng, Dagan

    2006-03-01

    Dynamic Single Photon Emission Computed Tomography (SPECT) has the potential to quantitatively estimate physiological parameters by fitting compartment models to the tracer kinetics. The generalized linear least square method (GLLS) is an efficient method to estimate unbiased kinetic parameters and parametric images. However, due to the low sensitivity of SPECT, noisy data can cause voxel-wise parameter estimation by GLLS to fail. Fuzzy C-Mean (FCM) clustering and modified FCM, which also utilizes information from the immediate neighboring voxels, are proposed to improve the voxel-wise parameter estimation of GLLS. Monte Carlo simulations were performed to generate dynamic SPECT data with different noise levels and processed by general and modified FCM clustering. Parametric images were estimated by Logan and Yokoi graphical analysis and GLLS. The influx rate (K I), volume of distribution (V d) were estimated for the cerebellum, thalamus and frontal cortex. Our results show that (1) FCM reduces the bias and improves the reliability of parameter estimates for noisy data, (2) GLLS provides estimates of micro parameters (K I-k 4) as well as macro parameters, such as volume of distribution (Vd) and binding potential (BP I & BP II) and (3) FCM clustering incorporating neighboring voxel information does not improve the parameter estimates, but improves noise in the parametric images. These findings indicated that it is desirable for pre-segmentation with traditional FCM clustering to generate voxel-wise parametric images with GLLS from dynamic SPECT data.

  16. ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

    PubMed

    Koslicki, David; Chatterjee, Saikat; Shahrivar, Damon; Walker, Alan W; Francis, Suzanna C; Fraser, Louise J; Vehkaperä, Mikko; Lan, Yueheng; Corander, Jukka

    2015-01-01

    Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

  17. Generalized fuzzy C-means clustering algorithm with improved fuzzy partitions.

    PubMed

    Zhu, Lin; Chung, Fu-Lai; Wang, Shitong

    2009-06-01

    The fuzziness index m has important influence on the clustering result of fuzzy clustering algorithms, and it should not be forced to fix at the usual value m = 2. In view of its distinctive features in applications and its limitation in having m = 2 only, a recent advance of fuzzy clustering called fuzzy c-means clustering with improved fuzzy partitions (IFP-FCM) is extended in this paper, and a generalized algorithm called GIFP-FCM for more effective clustering is proposed. By introducing a novel membership constraint function, a new objective function is constructed, and furthermore, GIFP-FCM clustering is derived. Meanwhile, from the viewpoints of L(p) norm distance measure and competitive learning, the robustness and convergence of the proposed algorithm are analyzed. Furthermore, the classical fuzzy c-means algorithm (FCM) and IFP-FCM can be taken as two special cases of the proposed algorithm. Several experimental results including its application to noisy image texture segmentation are presented to demonstrate its average advantage over FCM and IFP-FCM in both clustering and robustness capabilities.

  18. Performance Assessment of Kernel Density Clustering for Gene Expression Profile Data

    PubMed Central

    Zeng, Beiyan; Chen, Yiping P.; Smith, Oscar H.

    2003-01-01

    Kernel density smoothing techniques have been used in classification or supervised learning of gene expression profile (GEP) data, but their applications to clustering or unsupervised learning of those data have not been explored and assessed. Here we report a kernel density clustering method for analysing GEP data and compare its performance with the three most widely-used clustering methods: hierarchical clustering, K-means clustering, and multivariate mixture model-based clustering. Using several methods to measure agreement, between-cluster isolation, and withincluster coherence, such as the Adjusted Rand Index, the Pseudo F test, the r2 test, and the profile plot, we have assessed the effectiveness of kernel density clustering for recovering clusters, and its robustness against noise on clustering both simulated and real GEP data. Our results show that the kernel density clustering method has excellent performance in recovering clusters from simulated data and in grouping large real expression profile data sets into compact and well-isolated clusters, and that it is the most robust clustering method for analysing noisy expression profile data compared to the other three methods assessed. PMID:18629292

  19. Visualizing Confidence in Cluster-Based Ensemble Weather Forecast Analyses.

    PubMed

    Kumpf, Alexander; Tost, Bianca; Baumgart, Marlene; Riemer, Michael; Westermann, Rudiger; Rautenhaus, Marc

    2018-01-01

    In meteorology, cluster analysis is frequently used to determine representative trends in ensemble weather predictions in a selected spatio-temporal region, e.g., to reduce a set of ensemble members to simplify and improve their analysis. Identified clusters (i.e., groups of similar members), however, can be very sensitive to small changes of the selected region, so that clustering results can be misleading and bias subsequent analyses. In this article, we - a team of visualization scientists and meteorologists-deliver visual analytics solutions to analyze the sensitivity of clustering results with respect to changes of a selected region. We propose an interactive visual interface that enables simultaneous visualization of a) the variation in composition of identified clusters (i.e., their robustness), b) the variability in cluster membership for individual ensemble members, and c) the uncertainty in the spatial locations of identified trends. We demonstrate that our solution shows meteorologists how representative a clustering result is, and with respect to which changes in the selected region it becomes unstable. Furthermore, our solution helps to identify those ensemble members which stably belong to a given cluster and can thus be considered similar. In a real-world application case we show how our approach is used to analyze the clustering behavior of different regions in a forecast of "Tropical Cyclone Karl", guiding the user towards the cluster robustness information required for subsequent ensemble analysis.

  20. Unemployment and sovereign debt crisis in the Eurozone: A k-means- r analysis

    NASA Astrophysics Data System (ADS)

    Dias, João

    2017-09-01

    Some southern countries in Europe, together with Ireland, were particularly affected by the sovereign debt crises in the Eurozone and were obliged to implement tough corrective measures which proved to be very recessive in nature. As a result, not only GDP declined but unemployment jumped to very high levels as well. This paper uses a modified version of k-means (restricted k-means) to analyze the clustering of the Eurozone countries during the recent sovereign debt crisis, combining monthly data on unemployment and government bond yield rates. Our method shows that the separation of southern Europe from the other Eurozone is not necessarily a good characterization of this area before the crisis but the group of externally assisted countries plus Italy gains consistence as the crisis evolved, although there is no perfect homogeneity in this group, since the problems they faced, the type of response requested, the speed of reaction to the crisis and the lasting effects were not the same for all these countries.

  1. Segmentation of White Blood Cells From Microscopic Images Using a Novel Combination of K-Means Clustering and Modified Watershed Algorithm.

    PubMed

    Ghane, Narjes; Vard, Alireza; Talebi, Ardeshir; Nematollahy, Pardis

    2017-01-01

    Recognition of white blood cells (WBCs) is the first step to diagnose some particular diseases such as acquired immune deficiency syndrome, leukemia, and other blood-related diseases that are usually done by pathologists using an optical microscope. This process is time-consuming, extremely tedious, and expensive and needs experienced experts in this field. Thus, a computer-aided diagnosis system that assists pathologists in the diagnostic process can be so effective. Segmentation of WBCs is usually a first step in developing a computer-aided diagnosis system. The main purpose of this paper is to segment WBCs from microscopic images. For this purpose, we present a novel combination of thresholding, k-means clustering, and modified watershed algorithms in three stages including (1) segmentation of WBCs from a microscopic image, (2) extraction of nuclei from cell's image, and (3) separation of overlapping cells and nuclei. The evaluation results of the proposed method show that similarity measures, precision, and sensitivity respectively were 92.07, 96.07, and 94.30% for nucleus segmentation and 92.93, 97.41, and 93.78% for cell segmentation. In addition, statistical analysis presents high similarity between manual segmentation and the results obtained by the proposed method.

  2. Breast Cancer Symptom Clusters Derived from Social Media and Research Study Data Using Improved K-Medoid Clustering.

    PubMed

    Ping, Qing; Yang, Christopher C; Marshall, Sarah A; Avis, Nancy E; Ip, Edward H

    2016-06-01

    Most cancer patients, including patients with breast cancer, experience multiple symptoms simultaneously while receiving active treatment. Some symptoms tend to occur together and may be related, such as hot flashes and night sweats. Co-occurring symptoms may have a multiplicative effect on patients' functioning, mental health, and quality of life. Symptom clusters in the context of oncology were originally described as groups of three or more related symptoms. Some authors have suggested symptom clusters may have practical applications, such as the formulation of more effective therapeutic interventions that address the combined effects of symptoms rather than treating each symptom separately. Most studies that have sought to identify clusters in breast cancer survivors have relied on traditional research studies. Social media, such as online health-related forums, contain a bevy of user-generated content in the form of threads and posts, and could be used as a data source to identify and characterize symptom clusters among cancer patients. The present study seeks to determine patterns of symptom clusters in breast cancer survivors derived from both social media and research study data using improved K-Medoid clustering. A total of 50,426 publicly available messages were collected from Medhelp.com and 653 questionnaires were collected as part of a research study. The network of symptoms built from social media was sparse compared to that of the research study data, making the social media data easier to partition. The proposed revised K-Medoid clustering helps to improve the clustering performance by re-assigning some of the negative-ASW (average silhouette width) symptoms to other clusters after initial K-Medoid clustering. This retains an overall non-decreasing ASW and avoids the problem of trapping in local optima. The overall ASW, individual ASW, and improved interpretation of the final clustering solution suggest improvement. The clustering results suggest

  3. Breast Cancer Symptom Clusters Derived from Social Media and Research Study Data Using Improved K-Medoid Clustering

    PubMed Central

    Ping, Qing; Yang, Christopher C.; Marshall, Sarah A.; Avis, Nancy E.; Ip, Edward H.

    2017-01-01

    Most cancer patients, including patients with breast cancer, experience multiple symptoms simultaneously while receiving active treatment. Some symptoms tend to occur together and may be related, such as hot flashes and night sweats. Co-occurring symptoms may have a multiplicative effect on patients’ functioning, mental health, and quality of life. Symptom clusters in the context of oncology were originally described as groups of three or more related symptoms. Some authors have suggested symptom clusters may have practical applications, such as the formulation of more effective therapeutic interventions that address the combined effects of symptoms rather than treating each symptom separately. Most studies that have sought to identify clusters in breast cancer survivors have relied on traditional research studies. Social media, such as online health-related forums, contain a bevy of user-generated content in the form of threads and posts, and could be used as a data source to identify and characterize symptom clusters among cancer patients. The present study seeks to determine patterns of symptom clusters in breast cancer survivors derived from both social media and research study data using improved K-Medoid clustering. A total of 50,426 publicly available messages were collected from Medhelp.com and 653 questionnaires were collected as part of a research study. The network of symptoms built from social media was sparse compared to that of the research study data, making the social media data easier to partition. The proposed revised K-Medoid clustering helps to improve the clustering performance by re-assigning some of the negative-ASW (average silhouette width) symptoms to other clusters after initial K-Medoid clustering. This retains an overall non-decreasing ASW and avoids the problem of trapping in local optima. The overall ASW, individual ASW, and improved interpretation of the final clustering solution suggest improvement. The clustering results suggest

  4. Genomic analyses of bacterial porin-cytochrome gene clusters

    DOE PAGES

    Shi, Liang; Fredrickson, James K.; Zachara, John M.

    2014-11-26

    In this study, the porin-cytochrome (Pcc) protein complex is responsible for trans-outer membrane electron transfer during extracellular reduction of Fe(III) by the dissimilatory metal-reducing bacterium Geobacter sulfurreducens PCA. The identified and characterized Pcc complex of G. sulfurreducens PCA consists of a porin-like outer-membrane protein, a periplasmic 8-heme c type cytochrome (c-Cyt) and an outer-membrane 12-heme c-Cyt, and the genes encoding the Pcc proteins are clustered in the same regions of genome (i.e., the pcc gene clusters) of G. sulfurreducens PCA. A survey of additionally microbial genomes has identified the pcc gene clusters in all sequenced Geobacter spp. and other bacteriamore » from six different phyla, including Anaeromyxobacter dehalogenans 2CP-1, A. dehalogenans 2CP-C, Anaeromyxobacter sp. K, Candidatus Kuenenia stuttgartiensis, Denitrovibrio acetiphilus DSM 12809, Desulfurispirillum indicum S5, Desulfurivibrio alkaliphilus AHT2, Desulfurobacterium thermolithotrophum DSM 11699, Desulfuromonas acetoxidans DSM 684, Ignavibacterium album JCM 16511, and Thermovibrio ammonificans HB-1. The numbers of genes in the pcc gene clusters vary, ranging from two to nine. Similar to the metal-reducing (Mtr) gene clusters of other Fe(III)-reducing bacteria, such as Shewanella spp., additional genes that encode putative c-Cyts with predicted cellular localizations at the cytoplasmic membrane, periplasm and outer membrane often associate with the pcc gene clusters. This suggests that the Pcc-associated c-Cyts may be part of the pathways for extracellular electron transfer reactions. The presence of pcc gene clusters in the microorganisms that do not reduce solid-phase Fe(III) and Mn(IV) oxides, such as D. alkaliphilus AHT2 and I. album JCM 16511, also suggests that some of the pcc gene clusters may be involved in extracellular electron transfer reactions with the substrates other than Fe(III) and Mn(IV) oxides.« less

  5. K-Means Subject Matter Expert Refined Topic Model Methodology

    DTIC Science & Technology

    2017-01-01

    Refined Topic Model Methodology Topic Model Estimation via K-Means U.S. Army TRADOC Analysis Center-Monterey 700 Dyer Road...January 2017 K-means Subject Matter Expert Refined Topic Model Methodology Topic Model Estimation via K-Means Theodore T. Allen, Ph.D. Zhenhuan...Matter Expert Refined Topic Model Methodology Topic Model Estimation via K-means 5a. CONTRACT NUMBER W9124N-15-P-0022 5b. GRANT NUMBER 5c

  6. Fast Constrained Spectral Clustering and Cluster Ensemble with Random Projection

    PubMed Central

    Liu, Wenfen

    2017-01-01

    Constrained spectral clustering (CSC) method can greatly improve the clustering accuracy with the incorporation of constraint information into spectral clustering and thus has been paid academic attention widely. In this paper, we propose a fast CSC algorithm via encoding landmark-based graph construction into a new CSC model and applying random sampling to decrease the data size after spectral embedding. Compared with the original model, the new algorithm has the similar results with the increase of its model size asymptotically; compared with the most efficient CSC algorithm known, the new algorithm runs faster and has a wider range of suitable data sets. Meanwhile, a scalable semisupervised cluster ensemble algorithm is also proposed via the combination of our fast CSC algorithm and dimensionality reduction with random projection in the process of spectral ensemble clustering. We demonstrate by presenting theoretical analysis and empirical results that the new cluster ensemble algorithm has advantages in terms of efficiency and effectiveness. Furthermore, the approximate preservation of random projection in clustering accuracy proved in the stage of consensus clustering is also suitable for the weighted k-means clustering and thus gives the theoretical guarantee to this special kind of k-means clustering where each point has its corresponding weight. PMID:29312447

  7. Assessment and application of clustering techniques to atmospheric particle number size distribution for the purpose of source apportionment

    NASA Astrophysics Data System (ADS)

    Salimi, F.; Ristovski, Z.; Mazaheri, M.; Laiman, R.; Crilley, L. R.; He, C.; Clifford, S.; Morawska, L.

    2014-06-01

    Long-term measurements of particle number size distribution (PNSD) produce a very large number of observations and their analysis requires an efficient approach in order to produce results in the least possible time and with maximum accuracy. Clustering techniques are a family of sophisticated methods which have been recently employed to analyse PNSD data, however, very little information is available comparing the performance of different clustering techniques on PNSD data. This study aims to apply several clustering techniques (i.e. K-means, PAM, CLARA and SOM) to PNSD data, in order to identify and apply the optimum technique to PNSD data measured at 25 sites across Brisbane, Australia. A new method, based on the Generalised Additive Model (GAM) with a basis of penalised B-splines, was proposed to parameterise the PNSD data and the temporal weight of each cluster was also estimated using the GAM. In addition, each cluster was associated with its possible source based on the results of this parameterisation, together with the characteristics of each cluster. The performances of four clustering techniques were compared using the Dunn index and silhouette width validation values and the K-means technique was found to have the highest performance, with five clusters being the optimum. Therefore, five clusters were found within the data using the K-means technique. The diurnal occurrence of each cluster was used together with other air quality parameters, temporal trends and the physical properties of each cluster, in order to attribute each cluster to its source and origin. The five clusters were attributed to three major sources and origins, including regional background particles, photochemically induced nucleated particles and vehicle generated particles. Overall, clustering was found to be an effective technique for attributing each particle size spectra to its source and the GAM was suitable to parameterise the PNSD data. These two techniques can help

  8. Assessment and application of clustering techniques to atmospheric particle number size distribution for the purpose of source apportionment

    NASA Astrophysics Data System (ADS)

    Salimi, F.; Ristovski, Z.; Mazaheri, M.; Laiman, R.; Crilley, L. R.; He, C.; Clifford, S.; Morawska, L.

    2014-11-01

    Long-term measurements of particle number size distribution (PNSD) produce a very large number of observations and their analysis requires an efficient approach in order to produce results in the least possible time and with maximum accuracy. Clustering techniques are a family of sophisticated methods that have been recently employed to analyse PNSD data; however, very little information is available comparing the performance of different clustering techniques on PNSD data. This study aims to apply several clustering techniques (i.e. K means, PAM, CLARA and SOM) to PNSD data, in order to identify and apply the optimum technique to PNSD data measured at 25 sites across Brisbane, Australia. A new method, based on the Generalised Additive Model (GAM) with a basis of penalised B-splines, was proposed to parameterise the PNSD data and the temporal weight of each cluster was also estimated using the GAM. In addition, each cluster was associated with its possible source based on the results of this parameterisation, together with the characteristics of each cluster. The performances of four clustering techniques were compared using the Dunn index and Silhouette width validation values and the K means technique was found to have the highest performance, with five clusters being the optimum. Therefore, five clusters were found within the data using the K means technique. The diurnal occurrence of each cluster was used together with other air quality parameters, temporal trends and the physical properties of each cluster, in order to attribute each cluster to its source and origin. The five clusters were attributed to three major sources and origins, including regional background particles, photochemically induced nucleated particles and vehicle generated particles. Overall, clustering was found to be an effective technique for attributing each particle size spectrum to its source and the GAM was suitable to parameterise the PNSD data. These two techniques can help

  9. Fast clustering algorithm for large ECG data sets based on CS theory in combination with PCA and K-NN methods.

    PubMed

    Balouchestani, Mohammadreza; Krishnan, Sridhar

    2014-01-01

    Long-term recording of Electrocardiogram (ECG) signals plays an important role in health care systems for diagnostic and treatment purposes of heart diseases. Clustering and classification of collecting data are essential parts for detecting concealed information of P-QRS-T waves in the long-term ECG recording. Currently used algorithms do have their share of drawbacks: 1) clustering and classification cannot be done in real time; 2) they suffer from huge energy consumption and load of sampling. These drawbacks motivated us in developing novel optimized clustering algorithm which could easily scan large ECG datasets for establishing low power long-term ECG recording. In this paper, we present an advanced K-means clustering algorithm based on Compressed Sensing (CS) theory as a random sampling procedure. Then, two dimensionality reduction methods: Principal Component Analysis (PCA) and Linear Correlation Coefficient (LCC) followed by sorting the data using the K-Nearest Neighbours (K-NN) and Probabilistic Neural Network (PNN) classifiers are applied to the proposed algorithm. We show our algorithm based on PCA features in combination with K-NN classifier shows better performance than other methods. The proposed algorithm outperforms existing algorithms by increasing 11% classification accuracy. In addition, the proposed algorithm illustrates classification accuracy for K-NN and PNN classifiers, and a Receiver Operating Characteristics (ROC) area of 99.98%, 99.83%, and 99.75% respectively.

  10. A New Variable Weighting and Selection Procedure for K-Means Cluster Analysis

    ERIC Educational Resources Information Center

    Steinley, Douglas; Brusco, Michael J.

    2008-01-01

    A variance-to-range ratio variable weighting procedure is proposed. We show how this weighting method is theoretically grounded in the inherent variability found in data exhibiting cluster structure. In addition, a variable selection procedure is proposed to operate in conjunction with the variable weighting technique. The performances of these…

  11. Properties of small Ar sub N-1 K/+/ ionic clusters

    NASA Technical Reports Server (NTRS)

    Etters, R. D.; Danilowicz, R.; Dugan, J.

    1977-01-01

    A self-consistent formalism is developed that, based upon a many-body potential, dynamically determines the thermodynamic properties of ionic clusters without an a priori designation of the equilibrium structures. Aggregates consisting of a single closed shell K(+) ion and N-1 isoelectronic argon atoms were studied. The clusters form crystallites at low temperatures, and melting transitions and spontaneous dissociations are indicated. The results confirm experimental evidence that shows that ionic clusters become less stable with increasing N. The crystallite structures formed by four different clusters are isosceles triangle, skewed form, octahedron with ion in the middle, and icosahedron with the ion in the middle.

  12. Research on Abnormal Detection Based on Improved Combination of K - means and SVDD

    NASA Astrophysics Data System (ADS)

    Hao, Xiaohong; Zhang, Xiaofeng

    2018-01-01

    In order to improve the efficiency of network intrusion detection and reduce the false alarm rate, this paper proposes an anomaly detection algorithm based on improved K-means and SVDD. The algorithm first uses the improved K-means algorithm to cluster the training samples of each class, so that each class is independent and compact in class; Then, according to the training samples, the SVDD algorithm is used to construct the minimum superspheres. The subordinate relationship of the samples is determined by calculating the distance of the minimum superspheres constructed by SVDD. If the test sample is less than the center of the hypersphere, the test sample belongs to this class, otherwise it does not belong to this class, after several comparisons, the final test of the effective detection of the test sample.In this paper, we use KDD CUP99 data set to simulate the proposed anomaly detection algorithm. The results show that the algorithm has high detection rate and low false alarm rate, which is an effective network security protection method.

  13. Percolation analyses of observed and simulated galaxy clustering

    NASA Astrophysics Data System (ADS)

    Bhavsar, S. P.; Barrow, J. D.

    1983-11-01

    A percolation cluster analysis is performed on equivalent regions of the CFA redshift survey of galaxies and the 4000 body simulations of gravitational clustering made by Aarseth, Gott and Turner (1979). The observed and simulated percolation properties are compared and, unlike correlation and multiplicity function analyses, favour high density (Omega = 1) models with n = - 1 initial data. The present results show that the three-dimensional data are consistent with the degree of filamentary structure present in isothermal models of galaxy formation at the level of percolation analysis. It is also found that the percolation structure of the CFA data is a function of depth. Percolation structure does not appear to be a sensitive probe of intrinsic filamentary structure.

  14. Simultaneous Two-Way Clustering of Multiple Correspondence Analysis

    ERIC Educational Resources Information Center

    Hwang, Heungsun; Dillon, William R.

    2010-01-01

    A 2-way clustering approach to multiple correspondence analysis is proposed to account for cluster-level heterogeneity of both respondents and variable categories in multivariate categorical data. Specifically, in the proposed method, multiple correspondence analysis is combined with k-means in a unified framework in which "k"-means is…

  15. Bearing performance degradation assessment based on a combination of empirical mode decomposition and k-medoids clustering

    NASA Astrophysics Data System (ADS)

    Rai, Akhand; Upadhyay, S. H.

    2017-09-01

    Bearing is the most critical component in rotating machinery since it is more susceptible to failure. The monitoring of degradation in bearings becomes of great concern for averting the sudden machinery breakdown. In this study, a novel method for bearing performance degradation assessment (PDA) based on an amalgamation of empirical mode decomposition (EMD) and k-medoids clustering is encouraged. The fault features are extracted from the bearing signals using the EMD process. The extracted features are then subjected to k-medoids based clustering for obtaining the normal state and failure state cluster centres. A confidence value (CV) curve based on dissimilarity of the test data object to the normal state is obtained and employed as the degradation indicator for assessing the health of bearings. The proposed outlook is applied on the vibration signals collected in run-to-failure tests of bearings to assess its effectiveness in bearing PDA. To validate the superiority of the suggested approach, it is compared with commonly used time-domain features RMS and kurtosis, well-known fault diagnosis method envelope analysis (EA) and existing PDA classifiers i.e. self-organizing maps (SOM) and Fuzzy c-means (FCM). The results demonstrate that the recommended method outperforms the time-domain features, SOM and FCM based PDA in detecting the early stage degradation more precisely. Moreover, EA can be used as an accompanying method to confirm the early stage defect detected by the proposed bearing PDA approach. The study shows the potential application of k-medoids clustering as an effective tool for PDA of bearings.

  16. LOFT L2-3 blowdown experiment safety analyses D, E, and G; LOCA analyses H, K, K1

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Perryman, J.L.; Keeler, C.D.; Saukkoriipi, L.O.

    1978-12-01

    Three calculations using conservative off-nominal conditions and evaluation model options were made using RELAP4/MOD5 for blowdown-refill and RELAP4/MOD6 for reflood for Loss-of-Fluid Test Experiment L2-3 to support the experiment safety analysis effort. The three analyses are as follows: Analysis D: Loss of commercial power during Experiment L2-3; Analysis E: Hot leg quick-opening blowdown valve (QOBV) does not open during Experiment L2-3; and Analysis G: Cold leg QOBV does not open during Experiment L2-3. In addition, the results of three LOFT loss-of-coolant accident (LOCA) analyses using a power of 56.1 MW and a primary coolant system flow rate of 3.6 millionmore » 1bm/hr are presented: Analysis H: Intact loop 200% hot leg break; emergency core cooling (ECC) system B unavailable; Analysis K: Pressurizer relief valve stuck in open position; ECC system B unavailable; and Analysis K1: Same as analysis K, but using a primary coolant system flow rate of 1.92 million 1bm/hr (L2-4 pre-LOCE flow rate). For analysis D, the maximum cladding temperature reached was 1762/sup 0/F, 22 sec into reflood. In analyses E and G, the blowdowns were slower due to one of the QOBVs not functioning. The maximum cladding temperature reached in analysis E was 1700/sup 0/F, 64.7 sec into reflood; for analysis G, it was 1300/sup 0/F at the start of reflood. For analysis H, the maximum cladding temperature reached was 1825/sup 0/F, 0.01 sec into reflood. Analysis K was a very slow blowdown, and the cladding temperatures followed the saturation temperature of the system. The results of analysis K1 was nearly identical to analysis K; system depressurization was not affected by the primary coolant system flow rate.« less

  17. EXPLORING FUNCTIONAL CONNECTIVITY IN FMRI VIA CLUSTERING.

    PubMed

    Venkataraman, Archana; Van Dijk, Koene R A; Buckner, Randy L; Golland, Polina

    2009-04-01

    In this paper we investigate the use of data driven clustering methods for functional connectivity analysis in fMRI. In particular, we consider the K-Means and Spectral Clustering algorithms as alternatives to the commonly used Seed-Based Analysis. To enable clustering of the entire brain volume, we use the Nyström Method to approximate the necessary spectral decompositions. We apply K-Means, Spectral Clustering and Seed-Based Analysis to resting-state fMRI data collected from 45 healthy young adults. Without placing any a priori constraints, both clustering methods yield partitions that are associated with brain systems previously identified via Seed-Based Analysis. Our empirical results suggest that clustering provides a valuable tool for functional connectivity analysis.

  18. Prediction of Tibial Rotation Pathologies Using Particle Swarm Optimization and K-Means Algorithms.

    PubMed

    Sari, Murat; Tuna, Can; Akogul, Serkan

    2018-03-28

    The aim of this article is to investigate pathological subjects from a population through different physical factors. To achieve this, particle swarm optimization (PSO) and K-means (KM) clustering algorithms have been combined (PSO-KM). Datasets provided by the literature were divided into three clusters based on age and weight parameters and each one of right tibial external rotation (RTER), right tibial internal rotation (RTIR), left tibial external rotation (LTER), and left tibial internal rotation (LTIR) values were divided into three types as Type 1, Type 2 and Type 3 (Type 2 is non-pathological (normal) and the other two types are pathological (abnormal)), respectively. The rotation values of every subject in any cluster were noted. Then the algorithm was run and the produced values were also considered. The values of the produced algorithm, the PSO-KM, have been compared with the real values. The hybrid PSO-KM algorithm has been very successful on the optimal clustering of the tibial rotation types through the physical criteria. In this investigation, Type 2 (pathological subjects) is of especially high predictability and the PSO-KM algorithm has been very successful as an operation system for clustering and optimizing the tibial motion data assessments. These research findings are expected to be very useful for health providers, such as physiotherapists, orthopedists, and so on, in which this consequence may help clinicians to appropriately designing proper treatment schedules for patients.

  19. Are clusters of dietary patterns and cluster membership stable over time? Results of a longitudinal cluster analysis study.

    PubMed

    Walthouwer, Michel Jean Louis; Oenema, Anke; Soetens, Katja; Lechner, Lilian; de Vries, Hein

    2014-11-01

    Developing nutrition education interventions based on clusters of dietary patterns can only be done adequately when it is clear if distinctive clusters of dietary patterns can be derived and reproduced over time, if cluster membership is stable, and if it is predictable which type of people belong to a certain cluster. Hence, this study aimed to: (1) identify clusters of dietary patterns among Dutch adults, (2) test the reproducibility of these clusters and stability of cluster membership over time, and (3) identify sociodemographic predictors of cluster membership and cluster transition. This study had a longitudinal design with online measurements at baseline (N=483) and 6 months follow-up (N=379). Dietary intake was assessed with a validated food frequency questionnaire. A hierarchical cluster analysis was performed, followed by a K-means cluster analysis. Multinomial logistic regression analyses were conducted to identify the sociodemographic predictors of cluster membership and cluster transition. At baseline and follow-up, a comparable three-cluster solution was derived, distinguishing a healthy, moderately healthy, and unhealthy dietary pattern. Male and lower educated participants were significantly more likely to have a less healthy dietary pattern. Further, 251 (66.2%) participants remained in the same cluster, 45 (11.9%) participants changed to an unhealthier cluster, and 83 (21.9%) participants shifted to a healthier cluster. Men and people living alone were significantly more likely to shift toward a less healthy dietary pattern. Distinctive clusters of dietary patterns can be derived. Yet, cluster membership is unstable and only few sociodemographic factors were associated with cluster membership and cluster transition. These findings imply that clusters based on dietary intake may not be suitable as a basis for nutrition education interventions. Copyright © 2014 Elsevier Ltd. All rights reserved.

  20. Cluster K Mycobacteriophages: Insights into the Evolutionary Origins of Mycobacteriophage TM4

    PubMed Central

    Pope, Welkin H.; Ferreira, Christina M.; Jacobs-Sera, Deborah; Benjamin, Robert C.; Davis, Ariangela J.; DeJong, Randall J.; Elgin, Sarah C. R.; Guilfoile, Forrest R.; Forsyth, Mark H.; Harris, Alexander D.; Harvey, Samuel E.; Hughes, Lee E.; Hynes, Peter M.; Jackson, Arrykka S.; Jalal, Marilyn D.; MacMurray, Elizabeth A.; Manley, Coreen M.; McDonough, Molly J.; Mosier, Jordan L.; Osterbann, Larissa J.; Rabinowitz, Hannah S.; Rhyan, Corwin N.; Russell, Daniel A.; Saha, Margaret S.; Shaffer, Christopher D.; Simon, Stephanie E.; Sims, Erika F.; Tovar, Isabel G.; Weisser, Emilie G.; Wertz, John T.; Weston-Hafer, Kathleen A.; Williamson, Kurt E.; Zhang, Bo; Cresawn, Steven G.; Jain, Paras; Piuri, Mariana; Jacobs, William R.; Hendrix, Roger W.; Hatfull, Graham F.

    2011-01-01

    Five newly isolated mycobacteriophages –Angelica, CrimD, Adephagia, Anaya, and Pixie – have similar genomic architectures to mycobacteriophage TM4, a previously characterized phage that is widely used in mycobacterial genetics. The nucleotide sequence similarities warrant grouping these into Cluster K, with subdivision into three subclusters: K1, K2, and K3. Although the overall genome architectures of these phages are similar, TM4 appears to have lost at least two segments of its genome, a central region containing the integration apparatus, and a segment at the right end. This suggests that TM4 is a recent derivative of a temperate parent, resolving a long-standing conundrum about its biology, in that it was reportedly recovered from a lysogenic strain of Mycobacterium avium, but it is not capable of forming lysogens in any mycobacterial host. Like TM4, all of the Cluster K phages infect both fast- and slow-growing mycobacteria, and all of them – with the exception of TM4 – form stable lysogens in both Mycobacterium smegmatis and Mycobacterium tuberculosis; immunity assays show that all five of these phages share the same immune specificity. TM4 infects these lysogens suggesting that it was either derived from a heteroimmune temperate parent or that it has acquired a virulent phenotype. We have also characterized a widely-used conditionally replicating derivative of TM4 and identified mutations conferring the temperature-sensitive phenotype. All of the Cluster K phages contain a series of well conserved 13 bp repeats associated with the translation initiation sites of a subset of the genes; approximately one half of these contain an additional sequence feature composed of imperfectly conserved 17 bp inverted repeats separated by a variable spacer. The K1 phages integrate into the host tmRNA and the Cluster K phages represent potential new tools for the genetics of M. tuberculosis and related species. PMID:22053209

  1. Model selection for semiparametric marginal mean regression accounting for within-cluster subsampling variability and informative cluster size.

    PubMed

    Shen, Chung-Wei; Chen, Yi-Hau

    2018-03-13

    We propose a model selection criterion for semiparametric marginal mean regression based on generalized estimating equations. The work is motivated by a longitudinal study on the physical frailty outcome in the elderly, where the cluster size, that is, the number of the observed outcomes in each subject, is "informative" in the sense that it is related to the frailty outcome itself. The new proposal, called Resampling Cluster Information Criterion (RCIC), is based on the resampling idea utilized in the within-cluster resampling method (Hoffman, Sen, and Weinberg, 2001, Biometrika 88, 1121-1134) and accommodates informative cluster size. The implementation of RCIC, however, is free of performing actual resampling of the data and hence is computationally convenient. Compared with the existing model selection methods for marginal mean regression, the RCIC method incorporates an additional component accounting for variability of the model over within-cluster subsampling, and leads to remarkable improvements in selecting the correct model, regardless of whether the cluster size is informative or not. Applying the RCIC method to the longitudinal frailty study, we identify being female, old age, low income and life satisfaction, and chronic health conditions as significant risk factors for physical frailty in the elderly. © 2018, The International Biometric Society.

  2. Atmospheric effects on cluster analyses. [for remote sensing application

    NASA Technical Reports Server (NTRS)

    Kiang, R. K.

    1979-01-01

    Ground reflected radiance, from which information is extracted through techniques of cluster analyses for remote sensing application, is altered by the atmosphere when it reaches the satellite. Therefore it is essential to understand the effects of the atmosphere on Landsat measurements, cluster characteristics and analysis accuracy. A doubling model is employed to compute the effective reflectivity, observed from the satellite, as a function of ground reflectivity, solar zenith angle and aerosol optical thickness for standard atmosphere. The relation between the effective reflectivity and ground reflectivity is approximately linear. It is shown that for a horizontally homogeneous atmosphere, the classification statistics from a maximum likelihood classifier remains unchanged under these transforms. If inhomogeneity is present, the divergence between clusters is reduced, and correlation between spectral bands increases. Radiance reflected by the background area surrounding the target may also reach the satellite. The influence of background reflectivity on effective reflectivity is discussed.

  3. Categorizing document by fuzzy C-Means and K-nearest neighbors approach

    NASA Astrophysics Data System (ADS)

    Priandini, Novita; Zaman, Badrus; Purwanti, Endah

    2017-08-01

    Increasing of technology had made categorizing documents become important. It caused by increasing of number of documents itself. Managing some documents by categorizing is one of Information Retrieval application, because it involve text mining on its process. Whereas, categorization technique could be done both Fuzzy C-Means (FCM) and K-Nearest Neighbors (KNN) method. This experiment would consolidate both methods. The aim of the experiment is increasing performance of document categorize. First, FCM is in order to clustering training documents. Second, KNN is in order to categorize testing document until the output of categorization is shown. Result of the experiment is 14 testing documents retrieve relevantly to its category. Meanwhile 6 of 20 testing documents retrieve irrelevant to its category. Result of system evaluation shows that both precision and recall are 0,7.

  4. A comparison of visual search strategies of elite and non-elite tennis players through cluster analysis.

    PubMed

    Murray, Nicholas P; Hunfalvay, Melissa

    2017-02-01

    Considerable research has documented that successful performance in interceptive tasks (such as return of serve in tennis) is based on the performers' capability to capture appropriate anticipatory information prior to the flight path of the approaching object. Athletes of higher skill tend to fixate on different locations in the playing environment prior to initiation of a skill than their lesser skilled counterparts. The purpose of this study was to examine visual search behaviour strategies of elite (world ranked) tennis players and non-ranked competitive tennis players (n = 43) utilising cluster analysis. The results of hierarchical (Ward's method) and nonhierarchical (k means) cluster analyses revealed three different clusters. The clustering method distinguished visual behaviour of high, middle-and low-ranked players. Specifically, high-ranked players demonstrated longer mean fixation duration and lower variation of visual search than middle-and low-ranked players. In conclusion, the results demonstrated that cluster analysis is a useful tool for detecting and analysing the areas of interest for use in experimental analysis of expertise and to distinguish visual search variables among participants'.

  5. Finding gene clusters for a replicated time course study

    PubMed Central

    2014-01-01

    Background Finding genes that share similar expression patterns across samples is an important question that is frequently asked in high-throughput microarray studies. Traditional clustering algorithms such as K-means clustering and hierarchical clustering base gene clustering directly on the observed measurements and do not take into account the specific experimental design under which the microarray data were collected. A new model-based clustering method, the clustering of regression models method, takes into account the specific design of the microarray study and bases the clustering on how genes are related to sample covariates. It can find useful gene clusters for studies from complicated study designs such as replicated time course studies. Findings In this paper, we applied the clustering of regression models method to data from a time course study of yeast on two genotypes, wild type and YOX1 mutant, each with two technical replicates, and compared the clustering results with K-means clustering. We identified gene clusters that have similar expression patterns in wild type yeast, two of which were missed by K-means clustering. We further identified gene clusters whose expression patterns were changed in YOX1 mutant yeast compared to wild type yeast. Conclusions The clustering of regression models method can be a valuable tool for identifying genes that are coordinately transcribed by a common mechanism. PMID:24460656

  6. Convalescing Cluster Configuration Using a Superlative Framework

    PubMed Central

    Sabitha, R.; Karthik, S.

    2015-01-01

    Competent data mining methods are vital to discover knowledge from databases which are built as a result of enormous growth of data. Various techniques of data mining are applied to obtain knowledge from these databases. Data clustering is one such descriptive data mining technique which guides in partitioning data objects into disjoint segments. K-means algorithm is a versatile algorithm among the various approaches used in data clustering. The algorithm and its diverse adaptation methods suffer certain problems in their performance. To overcome these issues a superlative algorithm has been proposed in this paper to perform data clustering. The specific feature of the proposed algorithm is discretizing the dataset, thereby improving the accuracy of clustering, and also adopting the binary search initialization method to generate cluster centroids. The generated centroids are fed as input to K-means approach which iteratively segments the data objects into respective clusters. The clustered results are measured for accuracy and validity. Experiments conducted by testing the approach on datasets from the UC Irvine Machine Learning Repository evidently show that the accuracy and validity measure is higher than the other two approaches, namely, simple K-means and Binary Search method. Thus, the proposed approach proves that discretization process will improve the efficacy of descriptive data mining tasks. PMID:26543895

  7. Glutamic acid promotes monacolin K production and monacolin K biosynthetic gene cluster expression in Monascus.

    PubMed

    Zhang, Chan; Liang, Jian; Yang, Le; Chai, Shiyuan; Zhang, Chenxi; Sun, Baoguo; Wang, Chengtao

    2017-12-01

    This study investigated the effects of glutamic acid on production of monacolin K and expression of the monacolin K biosynthetic gene cluster. When Monascus M1 was grown in glutamic medium instead of in the original medium, monacolin K production increased from 48.4 to 215.4 mg l -1 , monacolin K production increased by 3.5 times. Glutamic acid enhanced monacolin K production by upregulating the expression of mokB-mokI; on day 8, the expression level of mokA tended to decrease by Reverse Transcription-polymerase Chain Reaction. Our findings demonstrated that mokA was not a key gene responsible for the quantity of monacolin K production in the presence of glutamic acid. Observation of Monascus mycelium morphology using Scanning Electron Microscope showed glutamic acid significantly increased the content of Monascus mycelium, altered the permeability of Monascus mycelium, enhanced secretion of monacolin K from the cell, and reduced the monacolin K content in Monascus mycelium, thereby enhancing monacolin K production.

  8. A hybrid monkey search algorithm for clustering analysis.

    PubMed

    Chen, Xin; Zhou, Yongquan; Luo, Qifang

    2014-01-01

    Clustering is a popular data analysis and data mining technique. The k-means clustering algorithm is one of the most commonly used methods. However, it highly depends on the initial solution and is easy to fall into local optimum solution. In view of the disadvantages of the k-means method, this paper proposed a hybrid monkey algorithm based on search operator of artificial bee colony algorithm for clustering analysis and experiment on synthetic and real life datasets to show that the algorithm has a good performance than that of the basic monkey algorithm for clustering analysis.

  9. Clustering Binary Data in the Presence of Masking Variables

    ERIC Educational Resources Information Center

    Brusco, Michael J.

    2004-01-01

    A number of important applications require the clustering of binary data sets. Traditional nonhierarchical cluster analysis techniques, such as the popular K-means algorithm, can often be successfully applied to these data sets. However, the presence of masking variables in a data set can impede the ability of the K-means algorithm to recover the…

  10. Dynamic Trajectory Extraction from Stereo Vision Using Fuzzy Clustering

    NASA Astrophysics Data System (ADS)

    Onishi, Masaki; Yoda, Ikushi

    In recent years, many human tracking researches have been proposed in order to analyze human dynamic trajectory. These researches are general technology applicable to various fields, such as customer purchase analysis in a shopping environment and safety control in a (railroad) crossing. In this paper, we present a new approach for tracking human positions by stereo image. We use the framework of two-stepped clustering with k-means method and fuzzy clustering to detect human regions. In the initial clustering, k-means method makes middle clusters from objective features extracted by stereo vision at high speed. In the last clustering, c-means fuzzy method cluster middle clusters based on attributes into human regions. Our proposed method can be correctly clustered by expressing ambiguity using fuzzy clustering, even when many people are close to each other. The validity of our technique was evaluated with the experiment of trajectories extraction of doctors and nurses in an emergency room of a hospital.

  11. Evaluation of different approaches for identifying optimal sites to predict mean hillslope soil moisture content

    NASA Astrophysics Data System (ADS)

    Liao, Kaihua; Zhou, Zhiwen; Lai, Xiaoming; Zhu, Qing; Feng, Huihui

    2017-04-01

    The identification of representative soil moisture sampling sites is important for the validation of remotely sensed mean soil moisture in a certain area and ground-based soil moisture measurements in catchment or hillslope hydrological studies. Numerous approaches have been developed to identify optimal sites for predicting mean soil moisture. Each method has certain advantages and disadvantages, but they have rarely been evaluated and compared. In our study, surface (0-20 cm) soil moisture data from January 2013 to March 2016 (a total of 43 sampling days) were collected at 77 sampling sites on a mixed land-use (tea and bamboo) hillslope in the hilly area of Taihu Lake Basin, China. A total of 10 methods (temporal stability (TS) analyses based on 2 indices, K-means clustering based on 6 kinds of inputs and 2 random sampling strategies) were evaluated for determining optimal sampling sites for mean soil moisture estimation. They were TS analyses based on the smallest index of temporal stability (ITS, a combination of the mean relative difference and standard deviation of relative difference (SDRD)) and based on the smallest SDRD, K-means clustering based on soil properties and terrain indices (EFs), repeated soil moisture measurements (Theta), EFs plus one-time soil moisture data (EFsTheta), and the principal components derived from EFs (EFs-PCA), Theta (Theta-PCA), and EFsTheta (EFsTheta-PCA), and global and stratified random sampling strategies. Results showed that the TS based on the smallest ITS was better (RMSE = 0.023 m3 m-3) than that based on the smallest SDRD (RMSE = 0.034 m3 m-3). The K-means clustering based on EFsTheta (-PCA) was better (RMSE <0.020 m3 m-3) than these based on EFs (-PCA) and Theta (-PCA). The sampling design stratified by the land use was more efficient than the global random method. Forty and 60 sampling sites are needed for stratified sampling and global sampling respectively to make their performances comparable to the best K-means

  12. Research on retailer data clustering algorithm based on Spark

    NASA Astrophysics Data System (ADS)

    Huang, Qiuman; Zhou, Feng

    2017-03-01

    Big data analysis is a hot topic in the IT field now. Spark is a high-reliability and high-performance distributed parallel computing framework for big data sets. K-means algorithm is one of the classical partition methods in clustering algorithm. In this paper, we study the k-means clustering algorithm on Spark. Firstly, the principle of the algorithm is analyzed, and then the clustering analysis is carried out on the supermarket customers through the experiment to find out the different shopping patterns. At the same time, this paper proposes the parallelization of k-means algorithm and the distributed computing framework of Spark, and gives the concrete design scheme and implementation scheme. This paper uses the two-year sales data of a supermarket to validate the proposed clustering algorithm and achieve the goal of subdividing customers, and then analyze the clustering results to help enterprises to take different marketing strategies for different customer groups to improve sales performance.

  13. Vessel Segmentation in Retinal Images Using Multi-scale Line Operator and K-Means Clustering.

    PubMed

    Saffarzadeh, Vahid Mohammadi; Osareh, Alireza; Shadgar, Bita

    2014-04-01

    Detecting blood vessels is a vital task in retinal image analysis. The task is more challenging with the presence of bright and dark lesions in retinal images. Here, a method is proposed to detect vessels in both normal and abnormal retinal fundus images based on their linear features. First, the negative impact of bright lesions is reduced by using K-means segmentation in a perceptive space. Then, a multi-scale line operator is utilized to detect vessels while ignoring some of the dark lesions, which have intensity structures different from the line-shaped vessels in the retina. The proposed algorithm is tested on two publicly available STARE and DRIVE databases. The performance of the method is measured by calculating the area under the receiver operating characteristic curve and the segmentation accuracy. The proposed method achieves 0.9483 and 0.9387 localization accuracy against STARE and DRIVE respectively.

  14. Formation of metallic clusters in oxide insulators by means of ion beam mixing

    NASA Astrophysics Data System (ADS)

    Talut, G.; Potzger, K.; Mücklich, A.; Zhou, Shengqiang

    2008-04-01

    The intermixing and near-interface cluster formation of Pt and FePt thin films deposited on different oxide surfaces by means of Pt+ ion irradiation and subsequent annealing was investigated. Irradiated as well as postannealed samples were investigated using high resolution transmission electron microscopy. In MgO and Y :ZrO2 covered with Pt, crystalline clusters with mean sizes of 2 and 3.5nm were found after the Pt+ irradiations with 8×1015 and 2×1016cm-2 and subsequent annealing, respectively. In MgO samples covered with FePt, clusters with mean sizes of 1 and 2nm were found after the Pt+ irradiations with 8×1015 and 2×1016cm-2 and subsequent annealing, respectively. In Y :ZrO2 samples covered with FePt, clusters up to 5nm in size were found after the Pt+ irradiation with 2×1016cm-2 and subsequent annealing. In LaAlO3 the irradiation was accompanied by a full amorphization of the host matrix and appearance of embedded clusters of different sizes. The determination of the lattice constant and thus the kind of the clusters in samples covered by FePt was hindered due to strong deviation of the electron beam by the ferromagnetic FePt.

  15. Lowest-energy structures of (C60)nX (X=Li+,Na+,K+,Cl-) and (C60)nYCl (Y=Li,Na,K) clusters for n

    PubMed

    Hernández-Rojas, J; Bretón, J; Gomez Llorente, J M; Wales, D J

    2004-12-22

    Basin-hopping global optimization is used to find likely candidates for the lowest minima on the potential energy surface of (C(60))(n)X (X=Li(+),Na(+),K(+),Cl(-)) and (C(60))(n)YCl (Y=Li,Na,K) clusters with ncluster, the coordination shell being triangular for Li(+), tetrahedral for Na(+) and K(+), and octahedral for Cl(-). When the required coordination site does not exist in the corresponding (C(60))(n) global minimum, the lowest minimum of the doped system may be based on an alternative geometry. This situation is particularly common in the Cl(-) complexes, where the (C(60))(n) global minima with icosahedral packing change into decahedral or closed-packed forms for the ions. In all the ions we find a significant binding energy for the doped cluster. In the alkali chloride complexes the preferred coordination for the diatomic moiety is octahedral and is basically determined by the Cl(-) ion. However, the smaller polarization energies in this case mean that a change in structure from the (C(60))(n) global minimum does not necessarily occur if there is no octahedral site. (c) 2004 American Institute of Physics.

  16. Quantum annealing for combinatorial clustering

    NASA Astrophysics Data System (ADS)

    Kumar, Vaibhaw; Bass, Gideon; Tomlin, Casey; Dulny, Joseph

    2018-02-01

    Clustering is a powerful machine learning technique that groups "similar" data points based on their characteristics. Many clustering algorithms work by approximating the minimization of an objective function, namely the sum of within-the-cluster distances between points. The straightforward approach involves examining all the possible assignments of points to each of the clusters. This approach guarantees the solution will be a global minimum; however, the number of possible assignments scales quickly with the number of data points and becomes computationally intractable even for very small datasets. In order to circumvent this issue, cost function minima are found using popular local search-based heuristic approaches such as k-means and hierarchical clustering. Due to their greedy nature, such techniques do not guarantee that a global minimum will be found and can lead to sub-optimal clustering assignments. Other classes of global search-based techniques, such as simulated annealing, tabu search, and genetic algorithms, may offer better quality results but can be too time-consuming to implement. In this work, we describe how quantum annealing can be used to carry out clustering. We map the clustering objective to a quadratic binary optimization problem and discuss two clustering algorithms which are then implemented on commercially available quantum annealing hardware, as well as on a purely classical solver "qbsolv." The first algorithm assigns N data points to K clusters, and the second one can be used to perform binary clustering in a hierarchical manner. We present our results in the form of benchmarks against well-known k-means clustering and discuss the advantages and disadvantages of the proposed techniques.

  17. Genetic diversity of K-antigen gene clusters of Escherichia coli and their molecular typing using a suspension array.

    PubMed

    Yang, Shuang; Xi, Daoyi; Jing, Fuyi; Kong, Deju; Wu, Junli; Feng, Lu; Cao, Boyang; Wang, Lei

    2018-04-01

    Capsular polysaccharides (CPSs), or K-antigens, are the major surface antigens of Escherichia coli. More than 80 serologically unique K-antigens are classified into 4 groups (Groups 1-4) of capsules. Groups 1 and 4 contain the Wzy-dependent polymerization pathway and the gene clusters are in the order galF to gnd; Groups 2 and 3 contain the ABC-transporter-dependent pathway and the gene clusters consist of 3 regions, regions 1, 2 and 3. Little is known about the variations among the gene clusters. In this study, 9 serotypes of K-antigen gene clusters (K2ab, K11, K20, K24, K38, K84, K92, K96, and K102) were sequenced and correlated with their CPS chemical structures. On the basis of sequence data, a K-antigen-specific suspension array that detects 10 distinct CPSs, including the above 9 CPSs plus K30, was developed. This is the first report to catalog the genetic features of E. coli K-antigen variations and to develop a suspension array for their molecular typing. The method has a number of advantages over traditional bacteriophage and serum agglutination methods and lays the foundation for straightforward identification and detection of additional K-antigens in the future.

  18. Internal Cluster Validation on Earthquake Data in the Province of Bengkulu

    NASA Astrophysics Data System (ADS)

    Rini, D. S.; Novianti, P.; Fransiska, H.

    2018-04-01

    K-means method is an algorithm for cluster n object based on attribute to k partition, where k < n. There is a deficiency of algorithms that is before the algorithm is executed, k points are initialized randomly so that the resulting data clustering can be different. If the random value for initialization is not good, the clustering becomes less optimum. Cluster validation is a technique to determine the optimum cluster without knowing prior information from data. There are two types of cluster validation, which are internal cluster validation and external cluster validation. This study aims to examine and apply some internal cluster validation, including the Calinski-Harabasz (CH) Index, Sillhouette (S) Index, Davies-Bouldin (DB) Index, Dunn Index (D), and S-Dbw Index on earthquake data in the Bengkulu Province. The calculation result of optimum cluster based on internal cluster validation is CH index, S index, and S-Dbw index yield k = 2, DB Index with k = 6 and Index D with k = 15. Optimum cluster (k = 6) based on DB Index gives good results for clustering earthquake in the Bengkulu Province.

  19. Envelopment filter and K-means for the detection of QRS waveforms in electrocardiogram.

    PubMed

    Merino, Manuel; Gómez, Isabel María; Molina, Alberto J

    2015-06-01

    The electrocardiogram (ECG) is a well-established technique for determining the electrical activity of the heart and studying its diseases. One of the most common pieces of information that can be read from the ECG is the heart rate (HR) through the detection of its most prominent feature: the QRS complex. This paper describes an offline version and a real-time implementation of a new algorithm to determine QRS localization in the ECG signal based on its envelopment and K-means clustering algorithm. The envelopment is used to obtain a signal with only QRS complexes, deleting P, T, and U waves and baseline wander. Two moving average filters are applied to smooth data. The K-means algorithm classifies data into QRS and non-QRS. The technique is validated using 22 h of ECG data from five Physionet databases. These databases were arbitrarily selected to analyze different morphologies of QRS complexes: three stored data with cardiac pathologies, and two had data with normal heartbeats. The algorithm has a low computational load, with no decision thresholds. Furthermore, it does not require any additional parameter. Sensitivity, positive prediction and accuracy from results are over 99.7%. Copyright © 2015 IPEM. Published by Elsevier Ltd. All rights reserved.

  20. Semi-supervised clustering methods.

    PubMed

    Bair, Eric

    2013-01-01

    Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as "semi-supervised clustering" methods) that can be applied in these situations. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided.

  1. Small traveling clusters in attractive and repulsive Hamiltonian mean-field models.

    PubMed

    Barré, Julien; Yamaguchi, Yoshiyuki Y

    2009-03-01

    Long-lasting small traveling clusters are studied in the Hamiltonian mean-field model by comparing between attractive and repulsive interactions. Nonlinear Landau damping theory predicts that a Gaussian momentum distribution on a spatially homogeneous background permits the existence of traveling clusters in the repulsive case, as in plasma systems, but not in the attractive case. Nevertheless, extending the analysis to a two-parameter family of momentum distributions of Fermi-Dirac type, we theoretically predict the existence of traveling clusters in the attractive case; these findings are confirmed by direct N -body numerical simulations. The parameter region with the traveling clusters is much reduced in the attractive case with respect to the repulsive case.

  2. Mining the National Career Assessment Examination Result Using Clustering Algorithm

    NASA Astrophysics Data System (ADS)

    Pagudpud, M. V.; Palaoag, T. T.; Padirayon, L. M.

    2018-03-01

    Education is an essential process today which elicits authorities to discover and establish innovative strategies for educational improvement. This study applied data mining using clustering technique for knowledge extraction from the National Career Assessment Examination (NCAE) result in the Division of Quirino. The NCAE is an examination given to all grade 9 students in the Philippines to assess their aptitudes in the different domains. Clustering the students is helpful in identifying students’ learning considerations. With the use of the RapidMiner tool, clustering algorithms such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), k-means, k-medoid, expectation maximization clustering, and support vector clustering algorithms were analyzed. The silhouette indexes of the said clustering algorithms were compared, and the result showed that the k-means algorithm with k = 3 and silhouette index equal to 0.196 is the most appropriate clustering algorithm to group the students. Three groups were formed having 477 students in the determined group (cluster 0), 310 proficient students (cluster 1) and 396 developing students (cluster 2). The data mining technique used in this study is essential in extracting useful information from the NCAE result to better understand the abilities of students which in turn is a good basis for adopting teaching strategies.

  3. GDPC: Gravitation-based Density Peaks Clustering algorithm

    NASA Astrophysics Data System (ADS)

    Jiang, Jianhua; Hao, Dehao; Chen, Yujun; Parmar, Milan; Li, Keqin

    2018-07-01

    The Density Peaks Clustering algorithm, which we refer to as DPC, is a novel and efficient density-based clustering approach, and it is published in Science in 2014. The DPC has advantages of discovering clusters with varying sizes and varying densities, but has some limitations of detecting the number of clusters and identifying anomalies. We develop an enhanced algorithm with an alternative decision graph based on gravitation theory and nearby distance to identify centroids and anomalies accurately. We apply our method to some UCI and synthetic data sets. We report comparative clustering performances using F-Measure and 2-dimensional vision. We also compare our method to other clustering algorithms, such as K-Means, Affinity Propagation (AP) and DPC. We present F-Measure scores and clustering accuracies of our GDPC algorithm compared to K-Means, AP and DPC on different data sets. We show that the GDPC has the superior performance in its capability of: (1) detecting the number of clusters obviously; (2) aggregating clusters with varying sizes, varying densities efficiently; (3) identifying anomalies accurately.

  4. Basic firefly algorithm for document clustering

    NASA Astrophysics Data System (ADS)

    Mohammed, Athraa Jasim; Yusof, Yuhanis; Husni, Husniza

    2015-12-01

    The Document clustering plays significant role in Information Retrieval (IR) where it organizes documents prior to the retrieval process. To date, various clustering algorithms have been proposed and this includes the K-means and Particle Swarm Optimization. Even though these algorithms have been widely applied in many disciplines due to its simplicity, such an approach tends to be trapped in a local minimum during its search for an optimal solution. To address the shortcoming, this paper proposes a Basic Firefly (Basic FA) algorithm to cluster text documents. The algorithm employs the Average Distance to Document Centroid (ADDC) as the objective function of the search. Experiments utilizing the proposed algorithm were conducted on the 20Newsgroups benchmark dataset. Results demonstrate that the Basic FA generates a more robust and compact clusters than the ones produced by K-means and Particle Swarm Optimization (PSO).

  5. COVARIATE-ADAPTIVE CLUSTERING OF EXPOSURES FOR AIR POLLUTION EPIDEMIOLOGY COHORTS*

    PubMed Central

    Keller, Joshua P.; Drton, Mathias; Larson, Timothy; Kaufman, Joel D.; Sandler, Dale P.; Szpiro, Adam A.

    2017-01-01

    Cohort studies in air pollution epidemiology aim to establish associations between health outcomes and air pollution exposures. Statistical analysis of such associations is complicated by the multivariate nature of the pollutant exposure data as well as the spatial misalignment that arises from the fact that exposure data are collected at regulatory monitoring network locations distinct from cohort locations. We present a novel clustering approach for addressing this challenge. Specifically, we present a method that uses geographic covariate information to cluster multi-pollutant observations and predict cluster membership at cohort locations. Our predictive k-means procedure identifies centers using a mixture model and is followed by multi-class spatial prediction. In simulations, we demonstrate that predictive k-means can reduce misclassification error by over 50% compared to ordinary k-means, with minimal loss in cluster representativeness. The improved prediction accuracy results in large gains of 30% or more in power for detecting effect modification by cluster in a simulated health analysis. In an analysis of the NIEHS Sister Study cohort using predictive k-means, we find that the association between systolic blood pressure (SBP) and long-term fine particulate matter (PM2.5) exposure varies significantly between different clusters of PM2.5 component profiles. Our cluster-based analysis shows that for subjects assigned to a cluster located in the Midwestern U.S., a 10 μg/m3 difference in exposure is associated with 4.37 mmHg (95% CI, 2.38, 6.35) higher SBP. PMID:28572869

  6. Mechanisms behind overshoots in mean cluster size profiles in aggregation-breakup processes.

    PubMed

    Sadegh-Vaziri, Ramiar; Ludwig, Kristin; Sundmacher, Kai; Babler, Matthaus U

    2018-05-26

    Aggregation and breakup of small particles in stirred suspensions often shows an overshoot in the time evolution of the mean cluster size: Starting from a suspension of primary particles the mean cluster size first increases before going through a maximum beyond which a slow relaxation sets in. Such behavior was observed in various systems, including polymeric latices, inorganic colloids, asphaltenes, proteins, and, as shown by independent experiments in this work, in the flocculation of microalgae. This work aims at investigating possible mechanism to explain this phenomenon using detailed population balance modeling that incorporates refined rate models for aggregation and breakup of small particles in turbulence. Four mechanisms are considered: (1) restructuring, (2) decay of aggregate strength, (3) deposition of large clusters, and (4) primary particle aggregation where only aggregation events between clusters and primary particles are permitted. We show that all four mechanisms can lead to an overshoot in the mean size profile, while in contrast, aggregation and breakup alone lead to a monotonic, "S"-shaped size evolution profile. In order to distinguish between the different mechanisms simple protocols based on variations of the shear rate during the aggregation-breakup process are proposed. Copyright © 2018 Elsevier Inc. All rights reserved.

  7. [Cluster analysis in biomedical researches].

    PubMed

    Akopov, A S; Moskovtsev, A A; Dolenko, S A; Savina, G D

    2013-01-01

    Cluster analysis is one of the most popular methods for the analysis of multi-parameter data. The cluster analysis reveals the internal structure of the data, group the separate observations on the degree of their similarity. The review provides a definition of the basic concepts of cluster analysis, and discusses the most popular clustering algorithms: k-means, hierarchical algorithms, Kohonen networks algorithms. Examples are the use of these algorithms in biomedical research.

  8. Cluster mislocation in kinematic Sunyaev-Zel'dovich (kSZ) effect extraction

    NASA Astrophysics Data System (ADS)

    Calafut, Victoria Rose; Bean, Rachel; Yu, Byeonghee

    2018-01-01

    We investigate the impact of a variety of analysis assumptions that influence cluster identification and location on the kSZ pairwise momentum signal and covariance estimation. Photometric and spectroscopic galaxy tracers from SDSS, WISE, and DECaLs, spanning redshifts 0.05kSZ statistical error budget obtained with a jackknife estimator. We also show that jackknife covariance estimates are significantly more conservative than those obtained by CMB rotation methods. Using redMaPPer data, we concurrently compare uncertainties for photometric redshift errors and miscentering and find them comparable for separations <˜ 50 Mpc where the kSZ signal is largest.For the next generation of CMB and LSS surveys the statistical and photometric errors will shrink markedly. Our results demonstrate that uncertainties introduced through using galaxy proxies for cluster locations will need to be fully incorporated, and actively mitigated, for the kSZ to reach its full potential as a cosmological constraining tool for dark energy and neutrino physics.

  9. A comparison of heuristic and model-based clustering methods for dietary pattern analysis.

    PubMed

    Greve, Benjamin; Pigeot, Iris; Huybrechts, Inge; Pala, Valeria; Börnhorst, Claudia

    2016-02-01

    Cluster analysis is widely applied to identify dietary patterns. A new method based on Gaussian mixture models (GMM) seems to be more flexible compared with the commonly applied k-means and Ward's method. In the present paper, these clustering approaches are compared to find the most appropriate one for clustering dietary data. The clustering methods were applied to simulated data sets with different cluster structures to compare their performance knowing the true cluster membership of observations. Furthermore, the three methods were applied to FFQ data assessed in 1791 children participating in the IDEFICS (Identification and Prevention of Dietary- and Lifestyle-Induced Health Effects in Children and Infants) Study to explore their performance in practice. The GMM outperformed the other methods in the simulation study in 72 % up to 100 % of cases, depending on the simulated cluster structure. Comparing the computationally less complex k-means and Ward's methods, the performance of k-means was better in 64-100 % of cases. Applied to real data, all methods identified three similar dietary patterns which may be roughly characterized as a 'non-processed' cluster with a high consumption of fruits, vegetables and wholemeal bread, a 'balanced' cluster with only slight preferences of single foods and a 'junk food' cluster. The simulation study suggests that clustering via GMM should be preferred due to its higher flexibility regarding cluster volume, shape and orientation. The k-means seems to be a good alternative, being easier to use while giving similar results when applied to real data.

  10. Application of K-Mean Algorithm for Medicine Data Clustering in Puskesmas Rumbai

    NASA Astrophysics Data System (ADS)

    Taslim; Fajrizal; Toresa, Dafwen

    2017-12-01

    Through the government’s health insurance program, efforts are made to ensure the health of the community through Puskesmas or community clinics. One of the most important components in health is the availability of medicines. The availability of medicines should be well managed to ensure that the medicines needed by the community are always available in sufficient quantities. Clustering on Data mining can be used to analyze the use of medicines during this time at a Puskesmas to be used as one of considerations for the Puskesmas to submit the demand of medicines in the period to come. The results of this study are expected to classify the level of medicines used in the pharmacy of Puskesmas in Rumbai Bukit Pekanbaru.

  11. Self-organization and clustering algorithms

    NASA Technical Reports Server (NTRS)

    Bezdek, James C.

    1991-01-01

    Kohonen's feature maps approach to clustering is often likened to the k or c-means clustering algorithms. Here, the author identifies some similarities and differences between the hard and fuzzy c-Means (HCM/FCM) or ISODATA algorithms and Kohonen's self-organizing approach. The author concludes that some differences are significant, but at the same time there may be some important unknown relationships between the two methodologies. Several avenues of research are proposed.

  12. Crystal structure of the new A2SnTa6X18 (A = K, Rb, Cs; X = Cl, Br) cluster compounds

    NASA Astrophysics Data System (ADS)

    Lemoine, P.; Wilmet, M.; Malaman, B.; Paofai, S.; Dumait, N.; Cordier, S.

    2018-01-01

    The crystal structure of the new cluster compounds A2SnTa6X18 (with A = K, Rb, Cs, and X = Cl, Br) was determined by using single-crystal and powder X-ray diffraction, and 119Sn Mössbauer spectroscopy. Those compounds crystallize in the Cs2EuNb6Br18-type structure of space group R 3 ̅. This type of structure is built up on discrete edge-bridged [M6Xi12Xa6]4- cluster units arranged according to a pseudo face-centered cubic stacking, where the octahedral and tetrahedral vacancies are fully occupied by divalent tin cations and monovalent alkaline cations, respectively. The tin cations influence on the halogen matrix and the electronic effects on the cluster units in the Cs2EuNb6Br18-type structure are discussed by comparison with isotype compounds. From those analyses, the ionic radius of Sn2+ in coordination number VI is estimated to be 1.14(1) Å. Finally, K2SnTa6Br18 might be considered as a new example of compound containing a quite bare stannous ion (5 s2 configuration).

  13. Efficient computation of k-Nearest Neighbour Graphs for large high-dimensional data sets on GPU clusters.

    PubMed

    Dashti, Ali; Komarov, Ivan; D'Souza, Roshan M

    2013-01-01

    This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG) construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs) and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU). The method is applicable to homogeneous computing clusters with a varying number of nodes and GPUs per node. We achieve a 6-fold speedup in data processing as compared with an optimized method running on a cluster of CPUs and bring a hitherto impossible [Formula: see text]-NNG generation for a dataset of twenty million images with 15 k dimensionality into the realm of practical possibility.

  14. Fractal dimension to classify the heart sound recordings with KNN and fuzzy c-mean clustering methods

    NASA Astrophysics Data System (ADS)

    Juniati, D.; Khotimah, C.; Wardani, D. E. K.; Budayasa, K.

    2018-01-01

    The heart abnormalities can be detected from heart sound. A heart sound can be heard directly with a stethoscope or indirectly by a phonocardiograph, a machine of the heart sound recording. This paper presents the implementation of fractal dimension theory to make a classification of phonocardiograms into a normal heart sound, a murmur, or an extrasystole. The main algorithm used to calculate the fractal dimension was Higuchi’s Algorithm. There were two steps to make a classification of phonocardiograms, feature extraction, and classification. For feature extraction, we used Discrete Wavelet Transform to decompose the signal of heart sound into several sub-bands depending on the selected level. After the decomposition process, the signal was processed using Fast Fourier Transform (FFT) to determine the spectral frequency. The fractal dimension of the FFT output was calculated using Higuchi Algorithm. The classification of fractal dimension of all phonocardiograms was done with KNN and Fuzzy c-mean clustering methods. Based on the research results, the best accuracy obtained was 86.17%, the feature extraction by DWT decomposition level 3 with the value of kmax 50, using 5-fold cross validation and the number of neighbors was 5 at K-NN algorithm. Meanwhile, for fuzzy c-mean clustering, the accuracy was 78.56%.

  15. Semi-supervised clustering methods

    PubMed Central

    Bair, Eric

    2013-01-01

    Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as “semi-supervised clustering” methods) that can be applied in these situations. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided. PMID:24729830

  16. FLO1K, global maps of mean, maximum and minimum annual streamflow at 1 km resolution from 1960 through 2015

    NASA Astrophysics Data System (ADS)

    Barbarossa, Valerio; Huijbregts, Mark A. J.; Beusen, Arthur H. W.; Beck, Hylke E.; King, Henry; Schipper, Aafke M.

    2018-03-01

    Streamflow data is highly relevant for a variety of socio-economic as well as ecological analyses or applications, but a high-resolution global streamflow dataset is yet lacking. We created FLO1K, a consistent streamflow dataset at a resolution of 30 arc seconds (~1 km) and global coverage. FLO1K comprises mean, maximum and minimum annual flow for each year in the period 1960-2015, provided as spatially continuous gridded layers. We mapped streamflow by means of artificial neural networks (ANNs) regression. An ensemble of ANNs were fitted on monthly streamflow observations from 6600 monitoring stations worldwide, i.e., minimum and maximum annual flows represent the lowest and highest mean monthly flows for a given year. As covariates we used the upstream-catchment physiography (area, surface slope, elevation) and year-specific climatic variables (precipitation, temperature, potential evapotranspiration, aridity index and seasonality indices). Confronting the maps with independent data indicated good agreement (R2 values up to 91%). FLO1K delivers essential data for freshwater ecology and water resources analyses at a global scale and yet high spatial resolution.

  17. Clustering Millions of Faces by Identity.

    PubMed

    Otto, Charles; Wang, Dayong; Jain, Anil K

    2018-02-01

    Given a large collection of unlabeled face images, we address the problem of clustering faces into an unknown number of identities. This problem is of interest in social media, law enforcement, and other applications, where the number of faces can be of the order of hundreds of million, while the number of identities (clusters) can range from a few thousand to millions. To address the challenges of run-time complexity and cluster quality, we present an approximate Rank-Order clustering algorithm that performs better than popular clustering algorithms (k-Means and Spectral). Our experiments include clustering up to 123 million face images into over 10 million clusters. Clustering results are analyzed in terms of external (known face labels) and internal (unknown face labels) quality measures, and run-time. Our algorithm achieves an F-measure of 0.87 on the LFW benchmark (13 K faces of 5,749 individuals), which drops to 0.27 on the largest dataset considered (13 K faces in LFW + 123M distractor images). Additionally, we show that frames in the YouTube benchmark can be clustered with an F-measure of 0.71. An internal per-cluster quality measure is developed to rank individual clusters for manual exploration of high quality clusters that are compact and isolated.

  18. K2: A NEW METHOD FOR THE DETECTION OF GALAXY CLUSTERS BASED ON CANADA-FRANCE-HAWAII TELESCOPE LEGACY SURVEY MULTICOLOR IMAGES

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Thanjavur, Karun; Willis, Jon; Crampton, David, E-mail: karun@uvic.c

    2009-11-20

    We have developed a new method, K2, optimized for the detection of galaxy clusters in multicolor images. Based on the Red Sequence approach, K2 detects clusters using simultaneous enhancements in both colors and position. The detection significance is robustly determined through extensive Monte Carlo simulations and through comparison with available cluster catalogs based on two different optical methods, and also on X-ray data. K2 also provides quantitative estimates of the candidate clusters' richness and photometric redshifts. Initially, K2 was applied to the two color (gri) 161 deg{sup 2} images of the Canada-France-Hawaii Telescope Legacy Survey Wide (CFHTLS-W) data. Our simulationsmore » show that the false detection rate for these data, at our selected threshold, is only approx1%, and that the cluster catalogs are approx80% complete up to a redshift of z = 0.6 for Fornax-like and richer clusters and to z approx 0.3 for poorer clusters. Based on the g-, r-, and i-band photometric catalogs of the Terapix T05 release, 35 clusters/deg{sup 2} are detected, with 1-2 Fornax-like or richer clusters every 2 deg{sup 2}. Catalogs containing data for 6144 galaxy clusters have been prepared, of which 239 are rich clusters. These clusters, especially the latter, are being searched for gravitational lenses-one of our chief motivations for cluster detection in CFHTLS. The K2 method can be easily extended to use additional color information and thus improve overall cluster detection to higher redshifts. The complete set of K2 cluster catalogs, along with the supplementary catalogs for the member galaxies, are available on request from the authors.« less

  19. Image Segmentation Method Using Fuzzy C Mean Clustering Based on Multi-Objective Optimization

    NASA Astrophysics Data System (ADS)

    Chen, Jinlin; Yang, Chunzhi; Xu, Guangkui; Ning, Li

    2018-04-01

    Image segmentation is not only one of the hottest topics in digital image processing, but also an important part of computer vision applications. As one kind of image segmentation algorithms, fuzzy C-means clustering is an effective and concise segmentation algorithm. However, the drawback of FCM is that it is sensitive to image noise. To solve the problem, this paper designs a novel fuzzy C-mean clustering algorithm based on multi-objective optimization. We add a parameter λ to the fuzzy distance measurement formula to improve the multi-objective optimization. The parameter λ can adjust the weights of the pixel local information. In the algorithm, the local correlation of neighboring pixels is added to the improved multi-objective mathematical model to optimize the clustering cent. Two different experimental results show that the novel fuzzy C-means approach has an efficient performance and computational time while segmenting images by different type of noises.

  20. A ground truth based comparative study on clustering of gene expression data.

    PubMed

    Zhu, Yitan; Wang, Zuyi; Miller, David J; Clarke, Robert; Xuan, Jianhua; Hoffman, Eric P; Wang, Yue

    2008-05-01

    Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG toolkit (VIsual Statistical Data Analyzer--VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.

  1. Transcriptional organization of the DNA region controlling expression of the K99 gene cluster.

    PubMed

    Roosendaal, B; Damoiseaux, J; Jordi, W; de Graaf, F K

    1989-01-01

    The transcriptional organization of the K99 gene cluster was investigated in two ways. First, the DNA region, containing the transcriptional signals was analyzed using a transcription vector system with Escherichia coli galactokinase (GalK) as assayable marker and second, an in vitro transcription system was employed. A detailed analysis of the transcription signals revealed that a strong promoter PA and a moderate promoter PB are located upstream of fanA and fanB, respectively. No promoter activity was detected in the intercistronic region between fanB and fanC. Factor-dependent terminators of transcription were detected and are probably located in the intercistronic region between fanA and fanB (T1), and between fanB and fanC (T2). A third terminator (T3) was observed between fanC and fanD and has an efficiency of 90%. Analysis of the regulatory region in an in vitro transcription system confirmed the location of the respective transcription signals. A model for the transcriptional organization of the K99 cluster is presented. Indications were obtained that the trans-acting regulatory polypeptides FanA and FanB both function as anti-terminators. A model for the regulation of expression of the K99 gene cluster is postulated.

  2. Application of clustering for customer segmentation in private banking

    NASA Astrophysics Data System (ADS)

    Yang, Xuan; Chen, Jin; Hao, Pengpeng; Wang, Yanbo J.

    2015-07-01

    With fierce competition in banking industry, more and more banks have realised that accurate customer segmentation is of fundamental importance, especially for the identification of those high-value customers. In order to solve this problem, we collected real data about private banking customers of a commercial bank in China, conducted empirical analysis by applying K-means clustering technique. When determine the K value, we propose a mechanism that meet both academic requirements and practical needs. Through K-means clustering, we successfully segmented the customers into three categories, and features of each group have been illustrated in details.

  3. Diary Data Subjected to Cluster Analysis of Intake/Output/Void Habits with Resulting Clusters Compared by Continence Status, Age, Race

    PubMed Central

    Miller, Janis M; Guo, Ying; Rodseth, Sarah Becker

    2011-01-01

    Background Data that incorporate the full complexity of healthy beverage intake and voiding frequency do not exist; therefore, clinicians reviewing bladder habits or voiding diaries for continence care must rely on expert opinion recommendations. Objective To use data-driven cluster analyses to reduce complex voiding diary variables into discrete patterns or data cluster profiles, descriptively name the clusters, and perform validity testing. Method Participants were 352 community women who filled out a 3-day voiding diary. Six variables (void frequency during daytime hours, void frequency during nighttime hours, modal output, total output, total intake, and body mass index) were entered into cluster analyses. The clusters were analyzed for differences by continence status, age, race (Black women, n = 196 White women, n = 156), and for those who were incontinent, by leakage episode severity. Results Three clusters emerged, labeled descriptively as Conventional, Benchmark, and Superplus. The Conventional cluster (68% of the sample) demonstrated mean daily intake of 45 ±13 ounces; mean daily output of 37 ± 15 ounces, mean daily voids 5 ± 2 times, mean modal daytime output 10±0.5 ounces, and mean nighttime voids 1±1 times. The Superplus cluster (7% of the sample) showed double or triple these values across the 5 variables, and the Benchmark cluster (25%) showed values consistent with current popular recommendations on intake and output (e.g., meeting or exceeding the 8 × 8 fluid intake rule of thumb). The clusters differed significantly (p < .05) by age, race, amount of irritating beverages consumed, and incontinence status. Discussion Identification of three discrete clusters provides for a potential parsimonious but data-driven means of classifying individuals for additional epidemiological or clinical study. The clinical utility rests with potential for intervening to move an individual from a high risk to low risk cluster with regards to incontinence. PMID

  4. Application of cluster analysis to geochemical compositional data for identifying ore-related geochemical anomalies

    NASA Astrophysics Data System (ADS)

    Zhou, Shuguang; Zhou, Kefa; Wang, Jinlin; Yang, Genfang; Wang, Shanshan

    2017-12-01

    Cluster analysis is a well-known technique that is used to analyze various types of data. In this study, cluster analysis is applied to geochemical data that describe 1444 stream sediment samples collected in northwestern Xinjiang with a sample spacing of approximately 2 km. Three algorithms (the hierarchical, k-means, and fuzzy c-means algorithms) and six data transformation methods (the z-score standardization, ZST; the logarithmic transformation, LT; the additive log-ratio transformation, ALT; the centered log-ratio transformation, CLT; the isometric log-ratio transformation, ILT; and no transformation, NT) are compared in terms of their effects on the cluster analysis of the geochemical compositional data. The study shows that, on the one hand, the ZST does not affect the results of column- or variable-based (R-type) cluster analysis, whereas the other methods, including the LT, the ALT, and the CLT, have substantial effects on the results. On the other hand, the results of the row- or observation-based (Q-type) cluster analysis obtained from the geochemical data after applying NT and the ZST are relatively poor. However, we derive some improved results from the geochemical data after applying the CLT, the ILT, the LT, and the ALT. Moreover, the k-means and fuzzy c-means clustering algorithms are more reliable than the hierarchical algorithm when they are used to cluster the geochemical data. We apply cluster analysis to the geochemical data to explore for Au deposits within the study area, and we obtain a good correlation between the results retrieved by combining the CLT or the ILT with the k-means or fuzzy c-means algorithms and the potential zones of Au mineralization. Therefore, we suggest that the combination of the CLT or the ILT with the k-means or fuzzy c-means algorithms is an effective tool to identify potential zones of mineralization from geochemical data.

  5. Inventory of File mean.sref.cluster1.f03.grib2

    Science.gov Websites

    Records: 40 Number Level/Layer Parameter Forecast Valid Description 001 2 m above ground TMP 3 hour fcst Temperature [K] wt ens-mean 002 2 m above ground TMP 3 hour fcst Temperature [K] wt ens-mean 003 2 m above ground SPFH 3 hour fcst Specific Humidity [kg/kg] wt ens-mean 004 2 m above ground RH 3 hour fcst

  6. Clustering "N" Objects into "K" Groups under Optimal Scaling of Variables.

    ERIC Educational Resources Information Center

    van Buuren, Stef; Heiser, Willem J.

    1989-01-01

    A method based on homogeneity analysis (multiple correspondence analysis or multiple scaling) is proposed to reduce many categorical variables to one variable with "k" categories. The method is a generalization of the sum of squared distances cluster analysis problem to the case of mixed measurement level variables. (SLD)

  7. Efficient similarity-based data clustering by optimal object to cluster reallocation.

    PubMed

    Rossignol, Mathias; Lagrange, Mathieu; Cont, Arshia

    2018-01-01

    We present an iterative flat hard clustering algorithm designed to operate on arbitrary similarity matrices, with the only constraint that these matrices be symmetrical. Although functionally very close to kernel k-means, our proposal performs a maximization of average intra-class similarity, instead of a squared distance minimization, in order to remain closer to the semantics of similarities. We show that this approach permits the relaxing of some conditions on usable affinity matrices like semi-positiveness, as well as opening possibilities for computational optimization required for large datasets. Systematic evaluation on a variety of data sets shows that compared with kernel k-means and the spectral clustering methods, the proposed approach gives equivalent or better performance, while running much faster. Most notably, it significantly reduces memory access, which makes it a good choice for large data collections. Material enabling the reproducibility of the results is made available online.

  8. The Exoplanet Migration Timescale from K2 Young Clusters

    NASA Astrophysics Data System (ADS)

    Rizzuto, Aaron

    A significant fraction of exoplanets orbit within 0.1 AU of their host star, with periods of <20 days. The discovery of these close-in planets has defied conventional models of planet formation and evolution based on our own solar system. It is widely accepted that these close-in planets did not form in such close proximity to their host stars (both rocky planets and hot Jupiters), but rather that dynamical or interactive processes caused them to migrate inwards from larger orbital semimajor axes and periods. There are multiple planet migration scenarios proposed in the literature, though it is unclear how much of the known planet population is attributable to each mechanism. Planetary migration models can be loosely divided into two categories: disk-driven migration and dynamical migration. Disk migration occurs over the lifetime of the protoplanetary disk (<5 Myr), while migration involving dynamical multi-body interactions operates on timescales of 100 Myr to 1Gyr, a lengthier process than disk migration. The K2 mission has measured planet formation timescales and migration pathways by sampling groups of stars at key ages. Over the past 10 campaigns, multiple groups of young stars have been observed by K2, ranging from the 10 Myr Upper Scorpius OB association, through the <120 Myr Pleiades cluster, to the ,600-800 Myr Hyades and Praesepe clusters. Upcoming data from more recent campaigns include the 2Myr Taurus region and significantly more Upper Scorpius members in C13 and 15. The frequency, orbital properties, and compositions of the exoplanet population in these samples of different age, with careful treatment of detection completeness, distinguish these scenarios of exoplanet migration as their host stars are settling onto the main sequence. We have pioneered efforts to identify transiting exoplanets in the K2 data for young clusters and moving groups, and have developed a new, highly complete, detrending algorithm for rotational induced variability that is

  9. Spectroscopic Analyses of Neutron Capture Elements in Open Clusters

    NASA Astrophysics Data System (ADS)

    O'Connell, Julia E.

    The evolution of elements as a function or age throughout the Milky Way disk provides strong constraints for galaxy evolution models, and on star formation epochs. In an effort to provide such constraints, we conducted an investigation into r- and s-process elemental abundances for a large sample of open clusters as part of an optical follow-up to the SDSS-III/APOGEE-1 near infrared survey. To obtain data for neutron capture abundance analysis, we conducted a long-term observing campaign spanning three years (2013-2016) using the McDonald Observatory Otto Struve 2.1-meter telescope and Sandiford Cass Echelle Spectrograph (SES, R(lambda/Deltalambda) ˜60,000). The SES provides a wavelength range of ˜1400 A, making it uniquely suited to investigate a number of other important chemical abundances as well as the neutron capture elements. For this study, we derive abundances for 18 elements covering four nucleosynthetic families- light, iron-peak, neutron capture and alpha-elements- for ˜30 open clusters within 6 kpc of the Sun with ages ranging from ˜80 Myr to ˜10 Gyr. Both equivalent width (EW) measurements and spectral synthesis methods were employed to derive abundances for all elements. Initial estimates for model stellar atmospheres- effective temperature and surface gravity- were provided by the APOGEE data set, and then re-derived for our optical spectra by removing abundance trends as a function of excitation potential and reduced width log(EW/lambda). With the exception of Ba II and Zr I, abundance analyses for all neutron capture elements were performed by generating synthetic spectra from the new stellar parameters. In order to remove molecular contamination, or blending from nearby atomic features, the synthetic spectra were modeled by a best-fit Gaussian to the observed data. Nd II shows a slight enhancement in all cluster stars, while other neutron capture elements follow solar abundance trends. Ba II shows a large cluster-to-cluster abundance spread

  10. The association between content of the elements S, Cl, K, Fe, Cu, Zn and Br in normal and cirrhotic liver tissue from Danes and Greenlandic Inuit examined by dual hierarchical clustering analysis.

    PubMed

    Laursen, Jens; Milman, Nils; Pind, Niels; Pedersen, Henrik; Mulvad, Gert

    2014-01-01

    Meta-analysis of previous studies evaluating associations between content of elements sulphur (S), chlorine (Cl), potassium (K), iron (Fe), copper (Cu), zinc (Zn) and bromine (Br) in normal and cirrhotic autopsy liver tissue samples. Normal liver samples from 45 Greenlandic Inuit, median age 60 years and from 71 Danes, median age 61 years. Cirrhotic liver samples from 27 Danes, median age 71 years. Element content was measured using X-ray fluorescence spectrometry. Dual hierarchical clustering analysis, creating a dual dendrogram, one clustering element contents according to calculated similarities, one clustering elements according to correlation coefficients between the element contents, both using Euclidian distance and Ward Procedure. One dendrogram separated subjects in 7 clusters showing no differences in ethnicity, gender or age. The analysis discriminated between elements in normal and cirrhotic livers. The other dendrogram clustered elements in four clusters: sulphur and chlorine; copper and bromine; potassium and zinc; iron. There were significant correlations between the elements in normal liver samples: S was associated with Cl, K, Br and Zn; Cl with S and Br; K with S, Br and Zn; Cu with Br. Zn with S and K. Br with S, Cl, K and Cu. Fe did not show significant associations with any other element. In contrast to simple statistical methods, which analyses content of elements separately one by one, dual hierarchical clustering analysis incorporates all elements at the same time and can be used to examine the linkage and interplay between multiple elements in tissue samples. Copyright © 2013 Elsevier GmbH. All rights reserved.

  11. K2 and M4: A Unique Opportunity to Unlock the Mysteries of Globular Clusters

    NASA Astrophysics Data System (ADS)

    Kuehn, Charles A.; Stello, Dennis; Campbell, Simon; Drury, Jason; de Silva, Gayandhi; Maclean, Ben; Bedding, Timothy R.; Huber, Daniel

    2016-01-01

    One of the most exciting opportunities presented by K2 is the ability to study variable stars in globular clusters (GCs). The K2 observations allow us to perform ensemble asteroseismology of a population that is much older than that in the open clusters in the original Kepler field. This should help us answer long-standing questions concerning mass loss on the red giant branch and the spread in masses along the horizontal branch. By combining the asteroseismic data with chemical tagging of sub-populations from spectroscopy, we hope to better constrain stellar evolution models and potentially shed some light on the formation history of GCs. The very crowded nature of stars in GCs poses a challenge, however, due to Kepler's large pixels. M4, observed during K2's campaign 2, presents an excellent opportunity to study GCs with a combination of K2 and ground-based data. M4 is one of the two nearest GCs and thus should appear less crowded and brighter; in fact M4 is likely the only GC whose horizontal branch stars, other than RR Lyraes, will be accessible with K2. We discuss our method of obtaining photometry for the stars in M4 and present sample lightcurves for different classes of oscillating stars in the cluster. We also discuss efforts to use ground-based observations to increase the utility of the K2 dataset.

  12. Membership, binarity, and rotation of F-G-K stars in the open cluster Blanco 1

    NASA Astrophysics Data System (ADS)

    Mermilliod, J.-C.; Platais, I.; James, D. J.; Grenon, M.; Cargile, P. A.

    2008-07-01

    Context: The nearby open cluster Blanco 1 is of considerable astrophysical interest for formation and evolution studies of open clusters because it is the third highest Galactic latitude cluster known. It has been observed often, but so far no definitive and comprehensive membership determination is readily available. Aims: An observing programme was carried out to study the stellar population of Blanco 1, and especially the membership and binary frequency of the F5-K0 dwarfs. Methods: We obtained radial-velocities with the CORAVEL spectrograph in the field of Blanco 1 for a sample of 148 F-G-K candidate stars in the magnitude range 10 < V < 14. New proper motions and UBVI CCD photometric data from two extensive surveys were obtained independently and are used to establish reliable cluster membership assignments in concert with radial-velocity data. Results: The membership of 68 stars is confirmed on the basis of proper motion, radial velocity, and photometric criteria. Fourteen spectroscopic- and suspected binaries (2 SB2s, 9 SB1s, 3 SB?) have been discovered among the confirmed members. Thirteen additional stars are located above the main sequence or close to the binary ridge, with radial velocities and proper motions supporting their membership. These are probable binaries with wide separations. Nine binaries (7 SB1 and 2 SB2) were detected among the field stars. The spectroscopic binary frequency among members is 20% (14/68); however, the overall binary rate reaches 40% (27/68) if one includes the photometric binaries. The cluster mean heliocentric radial velocity is +5.53 ± 0.11 km s-1 based on the most reliable 49 members. The V sin i distribution is similar to that of the Pleiades, confirming the age similarities between the two clusters. Conclusions: This study clearly demonstrates that, in spite of the cluster's high Galactic latitude, three membership criteria - radial velocity, proper motion, and photometry - are necessary for performing a reliable

  13. Orbit Clustering Based on Transfer Cost

    NASA Technical Reports Server (NTRS)

    Gustafson, Eric D.; Arrieta-Camacho, Juan J.; Petropoulos, Anastassios E.

    2013-01-01

    We propose using cluster analysis to perform quick screening for combinatorial global optimization problems. The key missing component currently preventing cluster analysis from use in this context is the lack of a useable metric function that defines the cost to transfer between two orbits. We study several proposed metrics and clustering algorithms, including k-means and the expectation maximization algorithm. We also show that proven heuristic methods such as the Q-law can be modified to work with cluster analysis.

  14. [Predicting Incidence of Hepatitis E in Chinausing Fuzzy Time Series Based on Fuzzy C-Means Clustering Analysis].

    PubMed

    Luo, Yi; Zhang, Tao; Li, Xiao-song

    2016-05-01

    To explore the application of fuzzy time series model based on fuzzy c-means clustering in forecasting monthly incidence of Hepatitis E in mainland China. Apredictive model (fuzzy time series method based on fuzzy c-means clustering) was developed using Hepatitis E incidence data in mainland China between January 2004 and July 2014. The incidence datafrom August 2014 to November 2014 were used to test the fitness of the predictive model. The forecasting results were compared with those resulted from traditional fuzzy time series models. The fuzzy time series model based on fuzzy c-means clustering had 0.001 1 mean squared error (MSE) of fitting and 6.977 5 x 10⁻⁴ MSE of forecasting, compared with 0.0017 and 0.0014 from the traditional forecasting model. The results indicate that the fuzzy time series model based on fuzzy c-means clustering has a better performance in forecasting incidence of Hepatitis E.

  15. Noise-enhanced clustering and competitive learning algorithms.

    PubMed

    Osoba, Osonde; Kosko, Bart

    2013-01-01

    Noise can provably speed up convergence in many centroid-based clustering algorithms. This includes the popular k-means clustering algorithm. The clustering noise benefit follows from the general noise benefit for the expectation-maximization algorithm because many clustering algorithms are special cases of the expectation-maximization algorithm. Simulations show that noise also speeds up convergence in stochastic unsupervised competitive learning, supervised competitive learning, and differential competitive learning. Copyright © 2012 Elsevier Ltd. All rights reserved.

  16. Model-free data analysis for source separation based on Non-Negative Matrix Factorization and k-means clustering (NMFk)

    NASA Astrophysics Data System (ADS)

    Vesselinov, V. V.; Alexandrov, B.

    2014-12-01

    The identification of the physical sources causing spatial and temporal fluctuations of state variables such as river stage levels and aquifer hydraulic heads is challenging. The fluctuations can be caused by variations in natural and anthropogenic sources such as precipitation events, infiltration, groundwater pumping, barometric pressures, etc. The source identification and separation can be crucial for conceptualization of the hydrological conditions and characterization of system properties. If the original signals that cause the observed state-variable transients can be successfully "unmixed", decoupled physics models may then be applied to analyze the propagation of each signal independently. We propose a new model-free inverse analysis of transient data based on Non-negative Matrix Factorization (NMF) method for Blind Source Separation (BSS) coupled with k-means clustering algorithm, which we call NMFk. NMFk is capable of identifying a set of unique sources from a set of experimentally measured mixed signals, without any information about the sources, their transients, and the physical mechanisms and properties controlling the signal propagation through the system. A classical BSS conundrum is the so-called "cocktail-party" problem where several microphones are recording the sounds in a ballroom (music, conversations, noise, etc.). Each of the microphones is recording a mixture of the sounds. The goal of BSS is to "unmix'" and reconstruct the original sounds from the microphone records. Similarly to the "cocktail-party" problem, our model-freee analysis only requires information about state-variable transients at a number of observation points, m, where m > r, and r is the number of unknown unique sources causing the observed fluctuations. We apply the analysis on a dataset from the Los Alamos National Laboratory (LANL) site. We identify and estimate the impact and sources are barometric pressure and water-supply pumping effects. We also estimate the

  17. Exemplar-Based Clustering via Simulated Annealing

    ERIC Educational Resources Information Center

    Brusco, Michael J.; Kohn, Hans-Friedrich

    2009-01-01

    Several authors have touted the p-median model as a plausible alternative to within-cluster sums of squares (i.e., K-means) partitioning. Purported advantages of the p-median model include the provision of "exemplars" as cluster centers, robustness with respect to outliers, and the accommodation of a diverse range of similarity data. We developed…

  18. Data Clustering

    NASA Astrophysics Data System (ADS)

    Wagstaff, Kiri L.

    2012-03-01

    particular application involves considerations of the kind of data being analyzed, algorithm runtime efficiency, and how much prior knowledge is available about the problem domain, which can dictate the nature of clusters sought. Fundamentally, the clustering method and its representations of clusters carries with it a definition of what a cluster is, and it is important that this be aligned with the analysis goals for the problem at hand. In this chapter, I emphasize this point by identifying for each algorithm the cluster representation as a model, m_j , even for algorithms that are not typically thought of as creating a “model.” This chapter surveys a basic collection of clustering methods useful to any practitioner who is interested in applying clustering to a new data set. The algorithms include k-means (Section 25.2), EM (Section 25.3), agglomerative (Section 25.4), and spectral (Section 25.5) clustering, with side mentions of variants such as kernel k-means and divisive clustering. The chapter also discusses each algorithm’s strengths and limitations and provides pointers to additional in-depth reading for each subject. Section 25.6 discusses methods for incorporating domain knowledge into the clustering process. This chapter concludes with a brief survey of interesting applications of clustering methods to astronomy data (Section 25.7). The chapter begins with k-means because it is both generally accessible and so widely used that understanding it can be considered a necessary prerequisite for further work in the field. EM can be viewed as a more sophisticated version of k-means that uses a generative model for each cluster and probabilistic item assignments. Agglomerative clustering is the most basic form of hierarchical clustering and provides a basis for further exploration of algorithms in that vein. Spectral clustering permits a departure from feature-vector-based clustering and can operate on data sets instead represented as affinity, or similarity

  19. K2 eclipsing binaries in the benchmark open cluster Ruprecht 147

    NASA Astrophysics Data System (ADS)

    Torres, Guillermo

    spectroscopic investigations. It was observed photometrically by the K2 mission for 80 days in late 2015, enabling both asteroseismic and rotation period studies of dozens of members. What makes it truly unique, however, is that it has no less than five eclipsing binaries brighter than 13th magnitude that lend themselves to high-precision mass and radius determinations. No other open cluster has as many, let alone an old one. The brightest binary happens to be at the tip of the turnoff and provides an unusually strong constraint on age. A very special opportunity for study has thus presented itself. This is a proposal to analyze publicly available K2 photometry for the five bright eclipsing binaries discovered in Ruprecht 147, with the goal of fashioning the cluster into an important new benchmark for high-precision testing of stellar astrophysics. We will supplement the K2 light curves, processed with special detrending techniques, with ground-based spectroscopic observations yielding radial velocities for the stars. With these we will derive accurate masses, radii, and temperatures for the components of each binary using well-proven classical methodologies. The impact of the project is that the large number of binaries will allow for an unprecedented and extraordinarily strong test of stellar evolution theory over a range of masses, not available for any other open cluster. The ages we will infer are completely independent of, and of a different nature than other estimates in Ruprecht 147, coming from isochrone fitting in the colormagnitude diagram, asteroseismology of the brighter cluster members, or the use of gyrochronology relations. We will thus have a unique opportunity to cross-validate four different age-dating techniques in the same cluster. Additionally, our accurate eclipsing binary masses and radii will enable crucial tests of the asteroseismic scaling relations, which will improve their use for single stars.

  20. Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion.

    PubMed

    Zhou, Feng; De la Torre, Fernando; Hodgins, Jessica K

    2013-03-01

    Temporal segmentation of human motion into plausible motion primitives is central to understanding and building computational models of human motion. Several issues contribute to the challenge of discovering motion primitives: the exponential nature of all possible movement combinations, the variability in the temporal scale of human actions, and the complexity of representing articulated motion. We pose the problem of learning motion primitives as one of temporal clustering, and derive an unsupervised hierarchical bottom-up framework called hierarchical aligned cluster analysis (HACA). HACA finds a partition of a given multidimensional time series into m disjoint segments such that each segment belongs to one of k clusters. HACA combines kernel k-means with the generalized dynamic time alignment kernel to cluster time series data. Moreover, it provides a natural framework to find a low-dimensional embedding for time series. HACA is efficiently optimized with a coordinate descent strategy and dynamic programming. Experimental results on motion capture and video data demonstrate the effectiveness of HACA for segmenting complex motions and as a visualization tool. We also compare the performance of HACA to state-of-the-art algorithms for temporal clustering on data of a honey bee dance. The HACA code is available online.

  1. Clustering-based spot segmentation of cDNA microarray images.

    PubMed

    Uslan, Volkan; Bucak, Ihsan Ömür

    2010-01-01

    Microarrays are utilized as that they provide useful information about thousands of gene expressions simultaneously. In this study segmentation step of microarray image processing has been implemented. Clustering-based methods, fuzzy c-means and k-means, have been applied for the segmentation step that separates the spots from the background. The experiments show that fuzzy c-means have segmented spots of the microarray image more accurately than the k-means.

  2. Two-Way Regularized Fuzzy Clustering of Multiple Correspondence Analysis.

    PubMed

    Kim, Sunmee; Choi, Ji Yeh; Hwang, Heungsun

    2017-01-01

    Multiple correspondence analysis (MCA) is a useful tool for investigating the interrelationships among dummy-coded categorical variables. MCA has been combined with clustering methods to examine whether there exist heterogeneous subclusters of a population, which exhibit cluster-level heterogeneity. These combined approaches aim to classify either observations only (one-way clustering of MCA) or both observations and variable categories (two-way clustering of MCA). The latter approach is favored because its solutions are easier to interpret by providing explicitly which subgroup of observations is associated with which subset of variable categories. Nonetheless, the two-way approach has been built on hard classification that assumes observations and/or variable categories to belong to only one cluster. To relax this assumption, we propose two-way fuzzy clustering of MCA. Specifically, we combine MCA with fuzzy k-means simultaneously to classify a subgroup of observations and a subset of variable categories into a common cluster, while allowing both observations and variable categories to belong partially to multiple clusters. Importantly, we adopt regularized fuzzy k-means, thereby enabling us to decide the degree of fuzziness in cluster memberships automatically. We evaluate the performance of the proposed approach through the analysis of simulated and real data, in comparison with existing two-way clustering approaches.

  3. Manifold Learning in MR spectroscopy using nonlinear dimensionality reduction and unsupervised clustering.

    PubMed

    Yang, Guang; Raschke, Felix; Barrick, Thomas R; Howe, Franklyn A

    2015-09-01

    To investigate whether nonlinear dimensionality reduction improves unsupervised classification of (1) H MRS brain tumor data compared with a linear method. In vivo single-voxel (1) H magnetic resonance spectroscopy (55 patients) and (1) H magnetic resonance spectroscopy imaging (MRSI) (29 patients) data were acquired from histopathologically diagnosed gliomas. Data reduction using Laplacian eigenmaps (LE) or independent component analysis (ICA) was followed by k-means clustering or agglomerative hierarchical clustering (AHC) for unsupervised learning to assess tumor grade and for tissue type segmentation of MRSI data. An accuracy of 93% in classification of glioma grade II and grade IV, with 100% accuracy in distinguishing tumor and normal spectra, was obtained by LE with unsupervised clustering, but not with the combination of k-means and ICA. With (1) H MRSI data, LE provided a more linear distribution of data for cluster analysis and better cluster stability than ICA. LE combined with k-means or AHC provided 91% accuracy for classifying tumor grade and 100% accuracy for identifying normal tissue voxels. Color-coded visualization of normal brain, tumor core, and infiltration regions was achieved with LE combined with AHC. The LE method is promising for unsupervised clustering to separate brain and tumor tissue with automated color-coding for visualization of (1) H MRSI data after cluster analysis. © 2014 Wiley Periodicals, Inc.

  4. A nonparametric clustering technique which estimates the number of clusters

    NASA Technical Reports Server (NTRS)

    Ramey, D. B.

    1983-01-01

    In applications of cluster analysis, one usually needs to determine the number of clusters, K, and the assignment of observations to each cluster. A clustering technique based on recursive application of a multivariate test of bimodality which automatically estimates both K and the cluster assignments is presented.

  5. fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data.

    PubMed

    Hung, Ling-Hong; Samudrala, Ram

    2014-06-15

    fast_protein_cluster is a fast, parallel and memory efficient package used to cluster 60 000 sets of protein models (with up to 550 000 models per set) generated by the Nutritious Rice for the World project. fast_protein_cluster is an optimized and extensible toolkit that supports Root Mean Square Deviation after optimal superposition (RMSD) and Template Modeling score (TM-score) as metrics. RMSD calculations using a laptop CPU are 60× faster than qcprot and 3× faster than current graphics processing unit (GPU) implementations. New GPU code further increases the speed of RMSD and TM-score calculations. fast_protein_cluster provides novel k-means and hierarchical clustering methods that are up to 250× and 2000× faster, respectively, than Clusco, and identify significantly more accurate models than Spicker and Clusco. fast_protein_cluster is written in C++ using OpenMP for multi-threading support. Custom streaming Single Instruction Multiple Data (SIMD) extensions and advanced vector extension intrinsics code accelerate CPU calculations, and OpenCL kernels support AMD and Nvidia GPUs. fast_protein_cluster is available under the M.I.T. license. (http://software.compbio.washington.edu/fast_protein_cluster) © The Author 2014. Published by Oxford University Press.

  6. Implementation of hierarchical clustering using k-mer sparse matrix to analyze MERS-CoV genetic relationship

    NASA Astrophysics Data System (ADS)

    Bustamam, A.; Ulul, E. D.; Hura, H. F. A.; Siswantining, T.

    2017-07-01

    Hierarchical clustering is one of effective methods in creating a phylogenetic tree based on the distance matrix between DNA (deoxyribonucleic acid) sequences. One of the well-known methods to calculate the distance matrix is k-mer method. Generally, k-mer is more efficient than some distance matrix calculation techniques. The steps of k-mer method are started from creating k-mer sparse matrix, and followed by creating k-mer singular value vectors. The last step is computing the distance amongst vectors. In this paper, we analyze the sequences of MERS-CoV (Middle East Respiratory Syndrome - Coronavirus) DNA by implementing hierarchical clustering using k-mer sparse matrix in order to perform the phylogenetic analysis. Our results show that the ancestor of our MERS-CoV is coming from Egypt. Moreover, we found that the MERS-CoV infection that occurs in one country may not necessarily come from the same country of origin. This suggests that the process of MERS-CoV mutation might not only be influenced by geographical factor.

  7. Permutation Tests of Hierarchical Cluster Analyses of Carrion Communities and Their Potential Use in Forensic Entomology.

    PubMed

    van der Ham, Joris L

    2016-05-19

    Forensic entomologists can use carrion communities' ecological succession data to estimate the postmortem interval (PMI). Permutation tests of hierarchical cluster analyses of these data provide a conceptual method to estimate part of the PMI, the post-colonization interval (post-CI). This multivariate approach produces a baseline of statistically distinct clusters that reflect changes in the carrion community composition during the decomposition process. Carrion community samples of unknown post-CIs are compared with these baseline clusters to estimate the post-CI. In this short communication, I use data from previously published studies to demonstrate the conceptual feasibility of this multivariate approach. Analyses of these data produce series of significantly distinct clusters, which represent carrion communities during 1- to 20-day periods of the decomposition process. For 33 carrion community samples, collected over an 11-day period, this approach correctly estimated the post-CI within an average range of 3.1 days. © The Authors 2016. Published by Oxford University Press on behalf of Entomological Society of America. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  8. Estimating the concrete compressive strength using hard clustering and fuzzy clustering based regression techniques.

    PubMed

    Nagwani, Naresh Kumar; Deo, Shirish V

    2014-01-01

    Understanding of the compressive strength of concrete is important for activities like construction arrangement, prestressing operations, and proportioning new mixtures and for the quality assurance. Regression techniques are most widely used for prediction tasks where relationship between the independent variables and dependent (prediction) variable is identified. The accuracy of the regression techniques for prediction can be improved if clustering can be used along with regression. Clustering along with regression will ensure the more accurate curve fitting between the dependent and independent variables. In this work cluster regression technique is applied for estimating the compressive strength of the concrete and a novel state of the art is proposed for predicting the concrete compressive strength. The objective of this work is to demonstrate that clustering along with regression ensures less prediction errors for estimating the concrete compressive strength. The proposed technique consists of two major stages: in the first stage, clustering is used to group the similar characteristics concrete data and then in the second stage regression techniques are applied over these clusters (groups) to predict the compressive strength from individual clusters. It is found from experiments that clustering along with regression techniques gives minimum errors for predicting compressive strength of concrete; also fuzzy clustering algorithm C-means performs better than K-means algorithm.

  9. Estimating the Concrete Compressive Strength Using Hard Clustering and Fuzzy Clustering Based Regression Techniques

    PubMed Central

    Nagwani, Naresh Kumar; Deo, Shirish V.

    2014-01-01

    Understanding of the compressive strength of concrete is important for activities like construction arrangement, prestressing operations, and proportioning new mixtures and for the quality assurance. Regression techniques are most widely used for prediction tasks where relationship between the independent variables and dependent (prediction) variable is identified. The accuracy of the regression techniques for prediction can be improved if clustering can be used along with regression. Clustering along with regression will ensure the more accurate curve fitting between the dependent and independent variables. In this work cluster regression technique is applied for estimating the compressive strength of the concrete and a novel state of the art is proposed for predicting the concrete compressive strength. The objective of this work is to demonstrate that clustering along with regression ensures less prediction errors for estimating the concrete compressive strength. The proposed technique consists of two major stages: in the first stage, clustering is used to group the similar characteristics concrete data and then in the second stage regression techniques are applied over these clusters (groups) to predict the compressive strength from individual clusters. It is found from experiments that clustering along with regression techniques gives minimum errors for predicting compressive strength of concrete; also fuzzy clustering algorithm C-means performs better than K-means algorithm. PMID:25374939

  10. Clustering approaches to feature change detection

    NASA Astrophysics Data System (ADS)

    G-Michael, Tesfaye; Gunzburger, Max; Peterson, Janet

    2018-05-01

    The automated detection of changes occurring between multi-temporal images is of significant importance in a wide range of medical, environmental, safety, as well as many other settings. The usage of k-means clustering is explored as a means for detecting objects added to a scene. The silhouette score for the clustering is used to define the optimal number of clusters that should be used. For simple images having a limited number of colors, new objects can be detected by examining the change between the optimal number of clusters for the original and modified images. For more complex images, new objects may need to be identified by examining the relative areas covered by corresponding clusters in the original and modified images. Which method is preferable depends on the composition and range of colors present in the images. In addition to describing the clustering and change detection methodology of our proposed approach, we provide some simple illustrations of its application.

  11. Model-based branching point detection in single-cell data by K-branches clustering

    PubMed Central

    Chlis, Nikolaos K.; Wolf, F. Alexander; Theis, Fabian J.

    2017-01-01

    Abstract Motivation The identification of heterogeneities in cell populations by utilizing single-cell technologies such as single-cell RNA-Seq, enables inference of cellular development and lineage trees. Several methods have been proposed for such inference from high-dimensional single-cell data. They typically assign each cell to a branch in a differentiation trajectory. However, they commonly assume specific geometries such as tree-like developmental hierarchies and lack statistically sound methods to decide on the number of branching events. Results We present K-Branches, a solution to the above problem by locally fitting half-lines to single-cell data, introducing a clustering algorithm similar to K-Means. These halflines are proxies for branches in the differentiation trajectory of cells. We propose a modified version of the GAP statistic for model selection, in order to decide on the number of lines that best describe the data locally. In this manner, we identify the location and number of subgroups of cells that are associated with branching events and full differentiation, respectively. We evaluate the performance of our method on single-cell RNA-Seq data describing the differentiation of myeloid progenitors during hematopoiesis, single-cell qPCR data of mouse blastocyst development, single-cell qPCR data of human myeloid monocytic leukemia and artificial data. Availability and implementation An R implementation of K-Branches is freely available at https://github.com/theislab/kbranches. Contact fabian.theis@helmholtz-muenchen.de Supplementary information Supplementary data are available at Bioinformatics online. PMID:28582478

  12. Clusters of cultures: diversity in meaning of family value and gender role items across Europe.

    PubMed

    van Vlimmeren, Eva; Moors, Guy B D; Gelissen, John P T M

    2017-01-01

    Survey data are often used to map cultural diversity by aggregating scores of attitude and value items across countries. However, this procedure only makes sense if the same concept is measured in all countries. In this study we argue that when (co)variances among sets of items are similar across countries, these countries share a common way of assigning meaning to the items. Clusters of cultures can then be observed by doing a cluster analysis on the (co)variance matrices of sets of related items. This study focuses on family values and gender role attitudes. We find four clusters of cultures that assign a distinct meaning to these items, especially in the case of gender roles. Some of these differences reflect response style behavior in the form of acquiescence. Adjusting for this style effect impacts on country comparisons hence demonstrating the usefulness of investigating the patterns of meaning given to sets of items prior to aggregating scores into cultural characteristics.

  13. Selection of key ambient particulate variables for epidemiological studies - applying cluster and heatmap analyses as tools for data reduction.

    PubMed

    Gu, Jianwei; Pitz, Mike; Breitner, Susanne; Birmili, Wolfram; von Klot, Stephanie; Schneider, Alexandra; Soentgen, Jens; Reller, Armin; Peters, Annette; Cyrys, Josef

    2012-10-01

    The success of epidemiological studies depends on the use of appropriate exposure variables. The purpose of this study is to extract a relatively small selection of variables characterizing ambient particulate matter from a large measurement data set. The original data set comprised a total of 96 particulate matter variables that have been continuously measured since 2004 at an urban background aerosol monitoring site in the city of Augsburg, Germany. Many of the original variables were derived from measured particle size distribution (PSD) across the particle diameter range 3 nm to 10 μm, including size-segregated particle number concentration, particle length concentration, particle surface concentration and particle mass concentration. The data set was complemented by integral aerosol variables. These variables were measured by independent instruments, including black carbon, sulfate, particle active surface concentration and particle length concentration. It is obvious that such a large number of measured variables cannot be used in health effect analyses simultaneously. The aim of this study is a pre-screening and a selection of the key variables that will be used as input in forthcoming epidemiological studies. In this study, we present two methods of parameter selection and apply them to data from a two-year period from 2007 to 2008. We used the agglomerative hierarchical cluster method to find groups of similar variables. In total, we selected 15 key variables from 9 clusters which are recommended for epidemiological analyses. We also applied a two-dimensional visualization technique called "heatmap" analysis to the Spearman correlation matrix. 12 key variables were selected using this method. Moreover, the positive matrix factorization (PMF) method was applied to the PSD data to characterize the possible particle sources. Correlations between the variables and PMF factors were used to interpret the meaning of the cluster and the heatmap analyses

  14. Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets.

    PubMed

    Marston, Louise; Peacock, Janet L; Yu, Keming; Brocklehurst, Peter; Calvert, Sandra A; Greenough, Anne; Marlow, Neil

    2009-07-01

    Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.

  15. Recognition of acute lymphoblastic leukemia cells in microscopic images using k-means clustering and support vector machine classifier.

    PubMed

    Amin, Morteza Moradi; Kermani, Saeed; Talebi, Ardeshir; Oghli, Mostafa Ghelich

    2015-01-01

    Acute lymphoblastic leukemia is the most common form of pediatric cancer which is categorized into three L1, L2, and L3 and could be detected through screening of blood and bone marrow smears by pathologists. Due to being time-consuming and tediousness of the procedure, a computer-based system is acquired for convenient detection of Acute lymphoblastic leukemia. Microscopic images are acquired from blood and bone marrow smears of patients with Acute lymphoblastic leukemia and normal cases. After applying image preprocessing, cells nuclei are segmented by k-means algorithm. Then geometric and statistical features are extracted from nuclei and finally these cells are classified to cancerous and noncancerous cells by means of support vector machine classifier with 10-fold cross validation. These cells are also classified into their sub-types by multi-Support vector machine classifier. Classifier is evaluated by these parameters: Sensitivity, specificity, and accuracy which values for cancerous and noncancerous cells 98%, 95%, and 97%, respectively. These parameters are also used for evaluation of cell sub-types which values in mean 84.3%, 97.3%, and 95.6%, respectively. The results show that proposed algorithm could achieve an acceptable performance for the diagnosis of Acute lymphoblastic leukemia and its sub-types and can be used as an assistant diagnostic tool for pathologists.

  16. Solutions of Smoluchowski's coagulation equation at large cluster sizes

    NASA Astrophysics Data System (ADS)

    Van Dongen, P. G. J.

    1987-09-01

    In this paper we determine the behavior of solutions ck( t) of Smoluchowski's coagulation equation for cluster sizes much larger than the mean cluster size s( t). We consider in general the homogeneous rate constants K( i, j), behaving as K( i, j) ∼ iμjv as j → ∞, where special attention is paid to models with an exponent v = 1. The behavior of ck( t) is studied in three different limits: (i) the short-time limit ( t ↓ 0), with k ≫ 1, (ii) the limit k → ∞, with t > 0 fixed, and (iii) the scaling limit, with k ≫ s( t). The two most important conclusions of this paper are, first, that the detailed behavior of ck( t) at large cluster sizes ( k ≫ s( t)) may be drastically different for different rate constants K( i, j) and, secondly, that the results for ck( t), obtained in the limits (i), (ii) and (iii), are closely related.

  17. A Self-Adaptive Fuzzy c-Means Algorithm for Determining the Optimal Number of Clusters

    PubMed Central

    Wang, Zhihao; Yi, Jing

    2016-01-01

    For the shortcoming of fuzzy c-means algorithm (FCM) needing to know the number of clusters in advance, this paper proposed a new self-adaptive method to determine the optimal number of clusters. Firstly, a density-based algorithm was put forward. The algorithm, according to the characteristics of the dataset, automatically determined the possible maximum number of clusters instead of using the empirical rule n and obtained the optimal initial cluster centroids, improving the limitation of FCM that randomly selected cluster centroids lead the convergence result to the local minimum. Secondly, this paper, by introducing a penalty function, proposed a new fuzzy clustering validity index based on fuzzy compactness and separation, which ensured that when the number of clusters verged on that of objects in the dataset, the value of clustering validity index did not monotonically decrease and was close to zero, so that the optimal number of clusters lost robustness and decision function. Then, based on these studies, a self-adaptive FCM algorithm was put forward to estimate the optimal number of clusters by the iterative trial-and-error process. At last, experiments were done on the UCI, KDD Cup 1999, and synthetic datasets, which showed that the method not only effectively determined the optimal number of clusters, but also reduced the iteration of FCM with the stable clustering result. PMID:28042291

  18. A clustering method of Chinese medicine prescriptions based on modified firefly algorithm.

    PubMed

    Yuan, Feng; Liu, Hong; Chen, Shou-Qiang; Xu, Liang

    2016-12-01

    This paper is aimed to study the clustering method for Chinese medicine (CM) medical cases. The traditional K-means clustering algorithm had shortcomings such as dependence of results on the selection of initial value, trapping in local optimum when processing prescriptions form CM medical cases. Therefore, a new clustering method based on the collaboration of firefly algorithm and simulated annealing algorithm was proposed. This algorithm dynamically determined the iteration of firefly algorithm and simulates sampling of annealing algorithm by fitness changes, and increased the diversity of swarm through expansion of the scope of the sudden jump, thereby effectively avoiding premature problem. The results from confirmatory experiments for CM medical cases suggested that, comparing with traditional K-means clustering algorithms, this method was greatly improved in the individual diversity and the obtained clustering results, the computing results from this method had a certain reference value for cluster analysis on CM prescriptions.

  19. [Study of the clinical phenotype of symptomatic chronic airways disease by hierarchical cluster analysis and two-step cluster analyses].

    PubMed

    Ning, P; Guo, Y F; Sun, T Y; Zhang, H S; Chai, D; Li, X M

    2016-09-01

    (SGRQ) score, acute exacerbation in the past one year, PEF variability and allergic dermatitis (P<0.05). (2) Four clusters were also identified by two-step cluster analysis as followings, cluster 1, COPD patients with moderate to severe airflow limitation; cluster 2, asthma and COPD patients with heavy smoking, airflow limitation and increased airways reversibility; cluster 3, patients having less smoking and normal pulmonary function with wheezing but no chronic cough; cluster 4, chronic bronchitis patients with normal pulmonary function and chronic cough. Significant differences were revealed regarding gender distribution, respiratory symptoms, pre-salbutamol FEV1/FVC%, pre-salbutamol FEV1% pred, post-salbutamol change in FEV1%, MMEF% pred, DLCO/VA% pred, RV% pred, PEF variability, total serum IgE level, cumulative tobacco cigarette consumption (pack-years), and SGRQ score (P<0.05). By different cluster analyses, distinct clinical phenotypes of chronic airway diseases are identified. Thus, individualized treatments may guide doctors to provide based on different phenotypes.

  20. Meta-Analyses of the Effects of Tier 2 Type Reading Interventions in Grades K-3.

    PubMed

    Wanzek, Jeanne; Vaughn, Sharon; Scammacca, Nancy; Gatlin, Brandy; Walker, Melodee A; Capin, Philip

    2016-09-01

    This meta-analysis extends previous work on extensive Tier 3 type reading interventions (Wanzek & Vaughn, 2007; Wanzek et al., 2013) to Tier 2 type interventions by examining a non-overlapping set of studies addressing the effects of less extensive reading interventions for students with or at risk for reading difficulties in Grades K-3. We examined the overall effects of these interventions on students' foundational skills, language, and comprehension as well as the intervention features that may be associated with improved outcomes. We conducted four meta-analyses on 72 studies to examine effects on (1) standardized foundational skill measures (mean ES = 0.54), (2) not-standardized foundational skill measures (mean ES = 0.62), (3) standardized language/comprehension measures (mean ES = 0.36), and (4) not-standardized language/comprehension measures (mean ES = 1.02). There were no differences in effects related to intervention type, instructional group size, grade level, intervention implementer, or the number of intervention hours.

  1. The implementation of hybrid clustering using fuzzy c-means and divisive algorithm for analyzing DNA human Papillomavirus cause of cervical cancer

    NASA Astrophysics Data System (ADS)

    Andryani, Diyah Septi; Bustamam, Alhadi; Lestari, Dian

    2017-03-01

    Clustering aims to classify the different patterns into groups called clusters. In this clustering method, we use n-mers frequency to calculate the distance matrix which is considered more accurate than using the DNA alignment. The clustering results could be used to discover biologically important sub-sections and groups of genes. Many clustering methods have been developed, while hard clustering methods considered less accurate than fuzzy clustering methods, especially if it is used for outliers data. Among fuzzy clustering methods, fuzzy c-means is one the best known for its accuracy and simplicity. Fuzzy c-means clustering uses membership function variable, which refers to how likely the data could be members into a cluster. Fuzzy c-means clustering works using the principle of minimizing the objective function. Parameters of membership function in fuzzy are used as a weighting factor which is also called the fuzzier. In this study we implement hybrid clustering using fuzzy c-means and divisive algorithm which could improve the accuracy of cluster membership compare to traditional partitional approach only. In this study fuzzy c-means is used in the first step to find partition results. Furthermore divisive algorithms will run on the second step to find sub-clusters and dendogram of phylogenetic tree. To find the best number of clusters is determined using the minimum value of Davies Bouldin Index (DBI) of the cluster results. In this research, the results show that the methods introduced in this paper is better than other partitioning methods. Finally, we found 3 clusters with DBI value of 1.126628 at first step of clustering. Moreover, DBI values after implementing the second step of clustering are always producing smaller IDB values compare to the results of using first step clustering only. This condition indicates that the hybrid approach in this study produce better performance of the cluster results, in term its DBI values.

  2. A Cluster Analytic Approach to Identifying Predictors and Moderators of Psychosocial Treatment for Bipolar Depression: Results from STEP-BD

    PubMed Central

    Deckersbach, Thilo; Peters, Amy T.; Sylvia, Louisa G.; Gold, Alexandra K.; da Silva Magalhaes, Pedro Vieira; Henry, David B.; Frank, Ellen; Otto, Michael W.; Berk, Michael; Dougherty, Darin D.; Nierenberg, Andrew A.; Miklowitz, David J.

    2016-01-01

    Background We sought to address how predictors and moderators of psychotherapy for bipolar depression – identified individually in prior analyses – can inform the development of a metric for prospectively classifying treatment outcome in intensive psychotherapy (IP) versus collaborative care (CC) adjunctive to pharmacotherapy in the Systematic Treatment Enhancement Program (STEP-BD) study. Methods We conducted post-hoc analyses on 135 STEP-BD participants using cluster analysis to identify subsets of participants with similar clinical profiles and investigated this combined metric as a moderator and predictor of response to IP. We used agglomerative hierarchical cluster analyses and k-means clustering to determine the content of the clinical profiles. Logistic regression and Cox proportional hazard models were used to evaluate whether the resulting clusters predicted or moderated likelihood of recovery or time until recovery. Results The cluster analysis yielded a two-cluster solution: 1) “less-recurrent/severe” and 2) “chronic/recurrent.” Rates of recovery in IP were similar for less-recurrent/severe and chronic/recurrent participants. Less-recurrent/severe patients were more likely than chronic/recurrent patients to achieve recovery in CC (p = .040, OR = 4.56). IP yielded a faster recovery for chronic/recurrent participants, whereas CC led to recovery sooner in the less-recurrent/severe cluster (p = .034, OR = 2.62). Limitations Cluster analyses require list-wise deletion of cases with missing data so we were unable to conduct analyses on all STEP-BD participants. Conclusions A well-powered, parametric approach can distinguish patients based on illness history and provide clinicians with symptom profiles of patients that confer differential prognosis in CC vs. IP. PMID:27289316

  3. Adaptive phase k-means algorithm for waveform classification

    NASA Astrophysics Data System (ADS)

    Song, Chengyun; Liu, Zhining; Wang, Yaojun; Xu, Feng; Li, Xingming; Hu, Guangmin

    2018-01-01

    Waveform classification is a powerful technique for seismic facies analysis that describes the heterogeneity and compartments within a reservoir. Horizon interpretation is a critical step in waveform classification. However, the horizon often produces inconsistent waveform phase, and thus results in an unsatisfied classification. To alleviate this problem, an adaptive phase waveform classification method called the adaptive phase k-means is introduced in this paper. Our method improves the traditional k-means algorithm using an adaptive phase distance for waveform similarity measure. The proposed distance is a measure with variable phases as it moves from sample to sample along the traces. Model traces are also updated with the best phase interference in the iterative process. Therefore, our method is robust to phase variations caused by the interpretation horizon. We tested the effectiveness of our algorithm by applying it to synthetic and real data. The satisfactory results reveal that the proposed method tolerates certain waveform phase variation and is a good tool for seismic facies analysis.

  4. Open source clustering software.

    PubMed

    de Hoon, M J L; Imoto, S; Nolan, J; Miyano, S

    2004-06-12

    We have implemented k-means clustering, hierarchical clustering and self-organizing maps in a single multipurpose open-source library of C routines, callable from other C and C++ programs. Using this library, we have created an improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix. In addition, we generated a Python and a Perl interface to the C Clustering Library, thereby combining the flexibility of a scripting language with the speed of C. The C Clustering Library and the corresponding Python C extension module Pycluster were released under the Python License, while the Perl module Algorithm::Cluster was released under the Artistic License. The GUI code Cluster 3.0 for Windows, Macintosh and Linux/Unix, as well as the corresponding command-line program, were released under the same license as the original Cluster code. The complete source code is available at http://bonsai.ims.u-tokyo.ac.jp/mdehoon/software/cluster. Alternatively, Algorithm::Cluster can be downloaded from CPAN, while Pycluster is also available as part of the Biopython distribution.

  5. Clustering approach for unsupervised segmentation of malarial Plasmodium vivax parasite

    NASA Astrophysics Data System (ADS)

    Abdul-Nasir, Aimi Salihah; Mashor, Mohd Yusoff; Mohamed, Zeehaida

    2017-10-01

    Malaria is a global health problem, particularly in Africa and south Asia where it causes countless deaths and morbidity cases. Efficient control and prompt of this disease require early detection and accurate diagnosis due to the large number of cases reported yearly. To achieve this aim, this paper proposes an image segmentation approach via unsupervised pixel segmentation of malaria parasite to automate the diagnosis of malaria. In this study, a modified clustering algorithm namely enhanced k-means (EKM) clustering, is proposed for malaria image segmentation. In the proposed EKM clustering, the concept of variance and a new version of transferring process for clustered members are used to assist the assignation of data to the proper centre during the process of clustering, so that good segmented malaria image can be generated. The effectiveness of the proposed EKM clustering has been analyzed qualitatively and quantitatively by comparing this algorithm with two popular image segmentation techniques namely Otsu's thresholding and k-means clustering. The experimental results show that the proposed EKM clustering has successfully segmented 100 malaria images of P. vivax species with segmentation accuracy, sensitivity and specificity of 99.20%, 87.53% and 99.58%, respectively. Hence, the proposed EKM clustering can be considered as an image segmentation tool for segmenting the malaria images.

  6. Convex Clustering: An Attractive Alternative to Hierarchical Clustering

    PubMed Central

    Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth

    2015-01-01

    The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340

  7. Convex clustering: an attractive alternative to hierarchical clustering.

    PubMed

    Chen, Gary K; Chi, Eric C; Ranola, John Michael O; Lange, Kenneth

    2015-05-01

    The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/.

  8. A Comparative Analysis of DBSCAN, K-Means, and Quadratic Variation Algorithms for Automatic Identification of Swallows from Swallowing Accelerometry Signals

    PubMed Central

    Dudik, Joshua M.; Kurosu, Atsuko; Coyle, James L

    2015-01-01

    Background Cervical auscultation with high resolution sensors is currently under consideration as a method of automatically screening for specific swallowing abnormalities. To be clinically useful without human involvement, any devices based on cervical auscultation should be able to detect specified swallowing events in an automatic manner. Methods In this paper, we comparatively analyze the density-based spatial clustering of applications with noise algorithm (DBSCAN), a k-means based algorithm, and an algorithm based on quadratic variation as methods of differentiating periods of swallowing activity from periods of time without swallows. These algorithms utilized swallowing vibration data exclusively and compared the results to a gold standard measure of swallowing duration. Data was collected from 23 subjects that were actively suffering from swallowing difficulties. Results Comparing the performance of the DBSCAN algorithm with a proven segmentation algorithm that utilizes k-means clustering demonstrated that the DBSCAN algorithm had a higher sensitivity and correctly segmented more swallows. Comparing its performance with a threshold-based algorithm that utilized the quadratic variation of the signal showed that the DBSCAN algorithm offered no direct increase in performance. However, it offered several other benefits including a faster run time and more consistent performance between patients. All algorithms showed noticeable differen-tiation from the endpoints provided by a videofluoroscopy examination as well as reduced sensitivity. Conclusions In summary, we showed that the DBSCAN algorithm is a viable method for detecting the occurrence of a swallowing event using cervical auscultation signals, but significant work must be done to improve its performance before it can be implemented in an unsupervised manner. PMID:25658505

  9. Hybrid Radar Emitter Recognition Based on Rough k-Means Classifier and Relevance Vector Machine

    PubMed Central

    Yang, Zhutian; Wu, Zhilu; Yin, Zhendong; Quan, Taifan; Sun, Hongjian

    2013-01-01

    Due to the increasing complexity of electromagnetic signals, there exists a significant challenge for recognizing radar emitter signals. In this paper, a hybrid recognition approach is presented that classifies radar emitter signals by exploiting the different separability of samples. The proposed approach comprises two steps, namely the primary signal recognition and the advanced signal recognition. In the former step, a novel rough k-means classifier, which comprises three regions, i.e., certain area, rough area and uncertain area, is proposed to cluster the samples of radar emitter signals. In the latter step, the samples within the rough boundary are used to train the relevance vector machine (RVM). Then RVM is used to recognize the samples in the uncertain area; therefore, the classification accuracy is improved. Simulation results show that, for recognizing radar emitter signals, the proposed hybrid recognition approach is more accurate, and presents lower computational complexity than traditional approaches. PMID:23344380

  10. Multilevel models for cost-effectiveness analyses that use cluster randomised trial data: An approach to model choice.

    PubMed

    Ng, Edmond S-W; Diaz-Ordaz, Karla; Grieve, Richard; Nixon, Richard M; Thompson, Simon G; Carpenter, James R

    2016-10-01

    Multilevel models provide a flexible modelling framework for cost-effectiveness analyses that use cluster randomised trial data. However, there is a lack of guidance on how to choose the most appropriate multilevel models. This paper illustrates an approach for deciding what level of model complexity is warranted; in particular how best to accommodate complex variance-covariance structures, right-skewed costs and missing data. Our proposed models differ according to whether or not they allow individual-level variances and correlations to differ across treatment arms or clusters and by the assumed cost distribution (Normal, Gamma, Inverse Gaussian). The models are fitted by Markov chain Monte Carlo methods. Our approach to model choice is based on four main criteria: the characteristics of the data, model pre-specification informed by the previous literature, diagnostic plots and assessment of model appropriateness. This is illustrated by re-analysing a previous cost-effectiveness analysis that uses data from a cluster randomised trial. We find that the most useful criterion for model choice was the deviance information criterion, which distinguishes amongst models with alternative variance-covariance structures, as well as between those with different cost distributions. This strategy for model choice can help cost-effectiveness analyses provide reliable inferences for policy-making when using cluster trials, including those with missing data. © The Author(s) 2013.

  11. Learner Typologies Development Using OIndex and Data Mining Based Clustering Techniques

    ERIC Educational Resources Information Center

    Luan, Jing

    2004-01-01

    This explorative data mining project used distance based clustering algorithm to study 3 indicators, called OIndex, of student behavioral data and stabilized at a 6-cluster scenario following an exhaustive explorative study of 4, 5, and 6 cluster scenarios produced by K-Means and TwoStep algorithms. Using principles in data mining, the study…

  12. An AK-LDMeans algorithm based on image clustering

    NASA Astrophysics Data System (ADS)

    Chen, Huimin; Li, Xingwei; Zhang, Yongbin; Chen, Nan

    2018-03-01

    Clustering is an effective analytical technique for handling unmarked data for value mining. Its ultimate goal is to mark unclassified data quickly and correctly. We use the roadmap for the current image processing as the experimental background. In this paper, we propose an AK-LDMeans algorithm to automatically lock the K value by designing the Kcost fold line, and then use the long-distance high-density method to select the clustering centers to further replace the traditional initial clustering center selection method, which further improves the efficiency and accuracy of the traditional K-Means Algorithm. And the experimental results are compared with the current clustering algorithm and the results are obtained. The algorithm can provide effective reference value in the fields of image processing, machine vision and data mining.

  13. Stability-based validation of dietary patterns obtained by cluster analysis.

    PubMed

    Sauvageot, Nicolas; Schritz, Anna; Leite, Sonia; Alkerwi, Ala'a; Stranges, Saverio; Zannad, Faiez; Streel, Sylvie; Hoge, Axelle; Donneau, Anne-Françoise; Albert, Adelin; Guillaume, Michèle

    2017-01-14

    Cluster analysis is a data-driven method used to create clusters of individuals sharing similar dietary habits. However, this method requires specific choices from the user which have an influence on the results. Therefore, there is a need of an objective methodology helping researchers in their decisions during cluster analysis. The objective of this study was to use such a methodology based on stability of clustering solutions to select the most appropriate clustering method and number of clusters for describing dietary patterns in the NESCAV study (Nutrition, Environment and Cardiovascular Health), a large population-based cross-sectional study in the Greater Region (N = 2298). Clustering solutions were obtained with K-means, K-medians and Ward's method and a number of clusters varying from 2 to 6. Their stability was assessed with three indices: adjusted Rand index, Cramer's V and misclassification rate. The most stable solution was obtained with K-means method and a number of clusters equal to 3. The "Convenient" cluster characterized by the consumption of convenient foods was the most prevalent with 46% of the population having this dietary behaviour. In addition, a "Prudent" and a "Non-Prudent" patterns associated respectively with healthy and non-healthy dietary habits were adopted by 25% and 29% of the population. The "Convenient" and "Non-Prudent" clusters were associated with higher cardiovascular risk whereas the "Prudent" pattern was associated with a decreased cardiovascular risk. Associations with others factors showed that the choice of a specific dietary pattern is part of a wider lifestyle profile. This study is of interest for both researchers and public health professionals. From a methodological standpoint, we showed that using stability of clustering solutions could help researchers in their choices. From a public health perspective, this study showed the need of targeted health promotion campaigns describing the benefits of healthy

  14. Sensitivity evaluation of dynamic speckle activity measurements using clustering methods.

    PubMed

    Etchepareborda, Pablo; Federico, Alejandro; Kaufmann, Guillermo H

    2010-07-01

    We evaluate and compare the use of competitive neural networks, self-organizing maps, the expectation-maximization algorithm, K-means, and fuzzy C-means techniques as partitional clustering methods, when the sensitivity of the activity measurement of dynamic speckle images needs to be improved. The temporal history of the acquired intensity generated by each pixel is analyzed in a wavelet decomposition framework, and it is shown that the mean energy of its corresponding wavelet coefficients provides a suited feature space for clustering purposes. The sensitivity obtained by using the evaluated clustering techniques is also compared with the well-known methods of Konishi-Fujii, weighted generalized differences, and wavelet entropy. The performance of the partitional clustering approach is evaluated using simulated dynamic speckle patterns and also experimental data.

  15. Validating Clusters with the Lower Bound for Sum-of-Squares Error

    ERIC Educational Resources Information Center

    Steinley, Douglas

    2007-01-01

    Given that a minor condition holds (e.g., the number of variables is greater than the number of clusters), a nontrivial lower bound for the sum-of-squares error criterion in K-means clustering is derived. By calculating the lower bound for several different situations, a method is developed to determine the adequacy of cluster solution based on…

  16. A new collaborative recommendation approach based on users clustering using artificial bee colony algorithm.

    PubMed

    Ju, Chunhua; Xu, Chonghuan

    2013-01-01

    Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods.

  17. A New Collaborative Recommendation Approach Based on Users Clustering Using Artificial Bee Colony Algorithm

    PubMed Central

    Ju, Chunhua

    2013-01-01

    Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods. PMID:24381525

  18. Machine-learned cluster identification in high-dimensional data.

    PubMed

    Ultsch, Alfred; Lötsch, Jörn

    2017-02-01

    High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. Ward clustering imposed cluster structures on cluster-less "golf ball", "cuboid" and "S-shaped" data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a

  19. Clustering of Variables for Mixed Data

    NASA Astrophysics Data System (ADS)

    Saracco, J.; Chavent, M.

    2016-05-01

    This chapter presents clustering of variables which aim is to lump together strongly related variables. The proposed approach works on a mixed data set, i.e. on a data set which contains numerical variables and categorical variables. Two algorithms of clustering of variables are described: a hierarchical clustering and a k-means type clustering. A brief description of PCAmix method (that is a principal component analysis for mixed data) is provided, since the calculus of the synthetic variables summarizing the obtained clusters of variables is based on this multivariate method. Finally, the R packages ClustOfVar and PCAmixdata are illustrated on real mixed data. The PCAmix and ClustOfVar approaches are first used for dimension reduction (step 1) before applying in step 2 a standard clustering method to obtain groups of individuals.

  20. Modified fuzzy c-means applied to a Bragg grating-based spectral imager for material clustering

    NASA Astrophysics Data System (ADS)

    Rodríguez, Aida; Nieves, Juan Luis; Valero, Eva; Garrote, Estíbaliz; Hernández-Andrés, Javier; Romero, Javier

    2012-01-01

    We have modified the Fuzzy C-Means algorithm for an application related to segmentation of hyperspectral images. Classical fuzzy c-means algorithm uses Euclidean distance for computing sample membership to each cluster. We have introduced a different distance metric, Spectral Similarity Value (SSV), in order to have a more convenient similarity measure for reflectance information. SSV distance metric considers both magnitude difference (by the use of Euclidean distance) and spectral shape (by the use of Pearson correlation). Experiments confirmed that the introduction of this metric improves the quality of hyperspectral image segmentation, creating spectrally more dense clusters and increasing the number of correctly classified pixels.

  1. Properties of polycyclic aromatic hydrocarbons in the northwest photon dominated region of NGC 7023. II. Traditional PAH analysis using k-means as a visualization tool

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Boersma, C.; Bregman, J.; Allamandola, L. J., E-mail: Christiaan.Boersma@nasa.gov

    2014-11-10

    Polycyclic aromatic hydrocarbon (PAH) emission in the Spitzer-IRS spectral map of the northwest photon dominated region (PDR) in NGC 7023 is analyzed using the 'traditional' approach in which the PAH bands and plateaus between 5.2-19.5 μm are isolated by subtracting the underlying continuum and removing H{sub 2} emission lines. The spectra are organized into seven spectroscopic bins by using k-means clustering. Each cluster corresponds to, and reveals, a morphological zone within NGC 7023. The zones self-organize parallel to the well-defined PDR front that coincides with an increase in intensity of the H{sub 2} emission lines. PAH band profiles and integratedmore » strengths are measured, classified, and mapped. The morphological zones revealed by the k-means clustering provides deeper insight into the conditions that drive variations in band strength ratios and evolution of the PAH population that otherwise would be lost. For example, certain band-band relations are bifurcated, revealing two limiting cases; one associated with the PDR, the other with the diffuse medium. Traditionally, PAH band strength ratios are used to gain insight into the properties of the emitting PAH population, i.e., charge, size, structure, and composition. Insights inferred from this work are compared and contrasted to those from Boersma et al. (first paper in this series), where the PAH emission in NGC 7023 is decomposed exclusively using the PAH spectra and tools made available through the NASA Ames PAH IR Spectroscopic Database.« less

  2. Confidence intervals for a difference between lognormal means in cluster randomization trials.

    PubMed

    Poirier, Julia; Zou, G Y; Koval, John

    2017-04-01

    Cluster randomization trials, in which intact social units are randomized to different interventions, have become popular in the last 25 years. Outcomes from these trials in many cases are positively skewed, following approximately lognormal distributions. When inference is focused on the difference between treatment arm arithmetic means, existent confidence interval procedures either make restricting assumptions or are complex to implement. We approach this problem by assuming log-transformed outcomes from each treatment arm follow a one-way random effects model. The treatment arm means are functions of multiple parameters for which separate confidence intervals are readily available, suggesting that the method of variance estimates recovery may be applied to obtain closed-form confidence intervals. A simulation study showed that this simple approach performs well in small sample sizes in terms of empirical coverage, relatively balanced tail errors, and interval widths as compared to existing methods. The methods are illustrated using data arising from a cluster randomization trial investigating a critical pathway for the treatment of community acquired pneumonia.

  3. Radiation-induced segregation and precipitation behaviours around cascade clusters under electron irradiation.

    PubMed

    Sueishi, Yuichiro; Sakaguchi, Norihito; Shibayama, Tamaki; Kinoshita, Hiroshi; Takahashi, Heishichiro

    2003-01-01

    We have investigated the formation of cascade clusters and structural changes in them by means of electron irradiation following ion irradiation in an austenitic stainless steel. Almost all of the cascade clusters, which were introduced by the ion irradiation, grew to form interstitial-type dislocation loops or vacancy-type stacking fault tetrahedra after electron irradiation at 623 K, whereas a few of the dot-type clusters remained in the matrix. It was possible to recognize the concentration of Ni and Si by radiation-induced segregation around the dot-type clusters. After electron irradiation at 773 K, we found that some cascade clusters became precipitates (delta-Ni2Si) due to radiation-induced precipitation. This suggests that the cascade clusters could directly become precipitation sites during irradiation.

  4. Fast large-scale clustering of protein structures using Gauss integrals.

    PubMed

    Harder, Tim; Borg, Mikael; Boomsma, Wouter; Røgen, Peter; Hamelryck, Thomas

    2012-02-15

    Clustering protein structures is an important task in structural bioinformatics. De novo structure prediction, for example, often involves a clustering step for finding the best prediction. Other applications include assigning proteins to fold families and analyzing molecular dynamics trajectories. We present Pleiades, a novel approach to clustering protein structures with a rigorous mathematical underpinning. The method approximates clustering based on the root mean square deviation by first mapping structures to Gauss integral vectors--which were introduced by Røgen and co-workers--and subsequently performing K-means clustering. Compared to current methods, Pleiades dramatically improves on the time needed to perform clustering, and can cluster a significantly larger number of structures, while providing state-of-the-art results. The number of low energy structures generated in a typical folding study, which is in the order of 50,000 structures, can be clustered within seconds to minutes.

  5. Cerebellar Functional Parcellation Using Sparse Dictionary Learning Clustering.

    PubMed

    Wang, Changqing; Kipping, Judy; Bao, Chenglong; Ji, Hui; Qiu, Anqi

    2016-01-01

    The human cerebellum has recently been discovered to contribute to cognition and emotion beyond the planning and execution of movement, suggesting its functional heterogeneity. We aimed to identify the functional parcellation of the cerebellum using information from resting-state functional magnetic resonance imaging (rs-fMRI). For this, we introduced a new data-driven decomposition-based functional parcellation algorithm, called Sparse Dictionary Learning Clustering (SDLC). SDLC integrates dictionary learning, sparse representation of rs-fMRI, and k-means clustering into one optimization problem. The dictionary is comprised of an over-complete set of time course signals, with which a sparse representation of rs-fMRI signals can be constructed. Cerebellar functional regions were then identified using k-means clustering based on the sparse representation of rs-fMRI signals. We solved SDLC using a multi-block hybrid proximal alternating method that guarantees strong convergence. We evaluated the reliability of SDLC and benchmarked its classification accuracy against other clustering techniques using simulated data. We then demonstrated that SDLC can identify biologically reasonable functional regions of the cerebellum as estimated by their cerebello-cortical functional connectivity. We further provided new insights into the cerebello-cortical functional organization in children.

  6. A comparative analysis of DBSCAN, K-means, and quadratic variation algorithms for automatic identification of swallows from swallowing accelerometry signals.

    PubMed

    Dudik, Joshua M; Kurosu, Atsuko; Coyle, James L; Sejdić, Ervin

    2015-04-01

    Cervical auscultation with high resolution sensors is currently under consideration as a method of automatically screening for specific swallowing abnormalities. To be clinically useful without human involvement, any devices based on cervical auscultation should be able to detect specified swallowing events in an automatic manner. In this paper, we comparatively analyze the density-based spatial clustering of applications with noise algorithm (DBSCAN), a k-means based algorithm, and an algorithm based on quadratic variation as methods of differentiating periods of swallowing activity from periods of time without swallows. These algorithms utilized swallowing vibration data exclusively and compared the results to a gold standard measure of swallowing duration. Data was collected from 23 subjects that were actively suffering from swallowing difficulties. Comparing the performance of the DBSCAN algorithm with a proven segmentation algorithm that utilizes k-means clustering demonstrated that the DBSCAN algorithm had a higher sensitivity and correctly segmented more swallows. Comparing its performance with a threshold-based algorithm that utilized the quadratic variation of the signal showed that the DBSCAN algorithm offered no direct increase in performance. However, it offered several other benefits including a faster run time and more consistent performance between patients. All algorithms showed noticeable differentiation from the endpoints provided by a videofluoroscopy examination as well as reduced sensitivity. In summary, we showed that the DBSCAN algorithm is a viable method for detecting the occurrence of a swallowing event using cervical auscultation signals, but significant work must be done to improve its performance before it can be implemented in an unsupervised manner. Copyright © 2015 Elsevier Ltd. All rights reserved.

  7. Biased phylodynamic inferences from analysing clusters of viral sequences

    PubMed Central

    Xiang, Fei; Frost, Simon D. W.

    2017-01-01

    Abstract Phylogenetic methods are being increasingly used to help understand the transmission dynamics of measurably evolving viruses, including HIV. Clusters of highly similar sequences are often observed, which appear to follow a ‘power law’ behaviour, with a small number of very large clusters. These clusters may help to identify subpopulations in an epidemic, and inform where intervention strategies should be implemented. However, clustering of samples does not necessarily imply the presence of a subpopulation with high transmission rates, as groups of closely related viruses can also occur due to non-epidemiological effects such as over-sampling. It is important to ensure that observed phylogenetic clustering reflects true heterogeneity in the transmitting population, and is not being driven by non-epidemiological effects. We qualify the effect of using a falsely identified ‘transmission cluster’ of sequences to estimate phylodynamic parameters including the effective population size and exponential growth rate under several demographic scenarios. Our simulation studies show that taking the maximum size cluster to re-estimate parameters from trees simulated under a randomly mixing, constant population size coalescent process systematically underestimates the overall effective population size. In addition, the transmission cluster wrongly resembles an exponential or logistic growth model 99% of the time. We also illustrate the consequences of false clusters in exponentially growing coalescent and birth-death trees, where again, the growth rate is skewed upwards. This has clear implications for identifying clusters in large viral databases, where a false cluster could result in wasted intervention resources. PMID:28852573

  8. A Spatiotemporal Clustering Approach to Maritime Domain Awareness

    DTIC Science & Technology

    2013-09-01

    1997. [25] M. E. Celebi, “Effective initialization of k-means for color quantization,” 16th IEEE International Conference on Image Processing (ICIP...release; distribution is unlimited 12b. DISTRIBUTION CODE 13. ABSTRACT (maximum 200 words) Spatiotemporal clustering is the process of grouping...Department of Electrical and Computer Engineering iv THIS PAGE INTENTIONALLY LEFT BLANK v ABSTRACT Spatiotemporal clustering is the process of

  9. Dynamic behaviour of nanometre-sized defect clusters emitted from an atomic displacement cascade in Au at 50 K

    NASA Astrophysics Data System (ADS)

    Ono, K.; Miyamoto, M.; Arakawa, K.; Birtcher, R. C.

    2017-09-01

    We demonstrate the emission of nanometre-sized defect clusters from an isolated displacement cascade formed by irradiation of high-energy self-ions and their subsequent 1-D motion in Au at 50 K, using in situ electron microscopy. The small defect clusters emitted from a displacement cascade exhibited correlated back-and-forth 1-D motion along the [-1 1 0] direction and coalescence which results in their growth and reduction of their mobility. From the analysis of the random 1-D motion, the diffusivity of the small cluster was evaluated. Correlated 1-D motion and coalescence of clusters were understood via elastic interaction between small clusters. These results provide direct experimental evidence of the migration of small defect clusters and defect cascade evolution at low temperature.

  10. Inference from clustering with application to gene-expression microarrays.

    PubMed

    Dougherty, Edward R; Barrera, Junior; Brun, Marcel; Kim, Seungchan; Cesar, Roberto M; Chen, Yidong; Bittner, Michael; Trent, Jeffrey M

    2002-01-01

    There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A

  11. Mean Comparison: Manifest Variable versus Latent Variable

    ERIC Educational Resources Information Center

    Yuan, Ke-Hai; Bentler, Peter M.

    2006-01-01

    An extension of multiple correspondence analysis is proposed that takes into account cluster-level heterogeneity in respondents' preferences/choices. The method involves combining multiple correspondence analysis and k-means in a unified framework. The former is used for uncovering a low-dimensional space of multivariate categorical variables…

  12. Sequence spaces [Formula: see text] and [Formula: see text] with application in clustering.

    PubMed

    Khan, Mohd Shoaib; Alamri, Badriah As; Mursaleen, M; Lohani, Qm Danish

    2017-01-01

    Distance measures play a central role in evolving the clustering technique. Due to the rich mathematical background and natural implementation of [Formula: see text] distance measures, researchers were motivated to use them in almost every clustering process. Beside [Formula: see text] distance measures, there exist several distance measures. Sargent introduced a special type of distance measures [Formula: see text] and [Formula: see text] which is closely related to [Formula: see text]. In this paper, we generalized the Sargent sequence spaces through introduction of [Formula: see text] and [Formula: see text] sequence spaces. Moreover, it is shown that both spaces are BK -spaces, and one is a dual of another. Further, we have clustered the two-moon dataset by using an induced [Formula: see text]-distance measure (induced by the Sargent sequence space [Formula: see text]) in the k-means clustering algorithm. The clustering result established the efficacy of replacing the Euclidean distance measure by the [Formula: see text]-distance measure in the k-means algorithm.

  13. "K"-Means May Perform as well as Mixture Model Clustering but May Also Be Much Worse: Comment on Steinley and Brusco (2011)

    ERIC Educational Resources Information Center

    Vermunt, Jeroen K.

    2011-01-01

    Steinley and Brusco (2011) presented the results of a huge simulation study aimed at evaluating cluster recovery of mixture model clustering (MMC) both for the situation where the number of clusters is known and is unknown. They derived rather strong conclusions on the basis of this study, especially with regard to the good performance of…

  14. Usage of K-cluster and factor analysis for grouping and evaluation the quality of olive oil in accordance with physico-chemical parameters

    NASA Astrophysics Data System (ADS)

    Milev, M.; Nikolova, Kr.; Ivanova, Ir.; Dobreva, M.

    2015-11-01

    25 olive oils were studied- different in origin and ways of extraction, in accordance with 17 physico-chemical parameters as follows: color parameters - a and b, light, fluorescence peaks, pigments - chlorophyll and β-carotene, fatty-acid content. The goals of the current study were: Conducting correlation analysis to find the inner relation between the studied indices; By applying factor analysis with the help of the method of Principal Components (PCA), to reduce the great number of variables into a few factors, which are of main importance for distinguishing the different types of olive oil;Using K-means cluster to compare and group the tested types olive oils based on their similarity. The inner relation between the studied indices was found by applying correlation analysis. A factor analysis using PCA was applied on the basis of the found correlation matrix. Thus the number of the studied indices was reduced to 4 factors, which explained 79.3% from the entire variation. The first one unified the color parameters, β-carotene and the related with oxidative products fluorescence peak - about 520 nm. The second one was determined mainly by the chlorophyll content and related to it fluorescence peak - about 670 nm. The third and the fourth factors were determined by the fatty-acid content of the samples. The third one unified the fatty-acids, which give us the opportunity to distinguish olive oil from the other plant oils - oleic, linoleic and stearin acids. The fourth factor included fatty-acids with relatively much lower content in the studied samples. It is enquired the number of clusters to be determined preliminary in order to apply the K-Cluster analysis. The variant K = 3 was worked out because the types of the olive oil were three. The first cluster unified all salad and pomace olive oils, the second unified the samples of extra virgin oilstaken as controls from producers, which were bought from the trade network. The third cluster unified samples from

  15. Exploring the individual patterns of spiritual well-being in people newly diagnosed with advanced cancer: a cluster analysis.

    PubMed

    Bai, Mei; Dixon, Jane; Williams, Anna-Leila; Jeon, Sangchoon; Lazenby, Mark; McCorkle, Ruth

    2016-11-01

    Research shows that spiritual well-being correlates positively with quality of life (QOL) for people with cancer, whereas contradictory findings are frequently reported with respect to the differentiated associations between dimensions of spiritual well-being, namely peace, meaning and faith, and QOL. This study aimed to examine individual patterns of spiritual well-being among patients newly diagnosed with advanced cancer. Cluster analysis was based on the twelve items of the 12-item Functional Assessment of Chronic Illness Therapy-Spiritual Well-Being Scale at Time 1. A combination of hierarchical and k-means (non-hierarchical) clustering methods was employed to jointly determine the number of clusters. Self-rated health, depressive symptoms, peace, meaning and faith, and overall QOL were compared at Time 1 and Time 2. Hierarchical and k-means clustering methods both suggested four clusters. Comparison of the four clusters supported statistically significant and clinically meaningful differences in QOL outcomes among clusters while revealing contrasting relations of faith with QOL. Cluster 1, Cluster 3, and Cluster 4 represented high, medium, and low levels of overall QOL, respectively, with correspondingly high, medium, and low levels of peace, meaning, and faith. Cluster 2 was distinguished from other clusters by its medium levels of overall QOL, peace, and meaning and low level of faith. This study provides empirical support for individual difference in response to a newly diagnosed cancer and brings into focus conceptual and methodological challenges associated with the measure of spiritual well-being, which may partly contribute to the attenuated relation between faith and QOL.

  16. Detection of Functional Change Using Cluster Trend Analysis in Glaucoma.

    PubMed

    Gardiner, Stuart K; Mansberger, Steven L; Demirel, Shaban

    2017-05-01

    Global analyses using mean deviation (MD) assess visual field progression, but can miss localized changes. Pointwise analyses are more sensitive to localized progression, but more variable so require confirmation. This study assessed whether cluster trend analysis, averaging information across subsets of locations, could improve progression detection. A total of 133 test-retest eyes were tested 7 to 10 times. Rates of change and P values were calculated for possible re-orderings of these series to generate global analysis ("MD worsening faster than x dB/y with P < y"), pointwise and cluster analyses ("n locations [or clusters] worsening faster than x dB/y with P < y") with specificity exactly 95%. These criteria were applied to 505 eyes tested over a mean of 10.5 years, to find how soon each detected "deterioration," and compared using survival models. This was repeated including two subsequent visual fields to determine whether "deterioration" was confirmed. The best global criterion detected deterioration in 25% of eyes in 5.0 years (95% confidence interval [CI], 4.7-5.3 years), compared with 4.8 years (95% CI, 4.2-5.1) for the best cluster analysis criterion, and 4.1 years (95% CI, 4.0-4.5) for the best pointwise criterion. However, for pointwise analysis, only 38% of these changes were confirmed, compared with 61% for clusters and 76% for MD. The time until 25% of eyes showed subsequently confirmed deterioration was 6.3 years (95% CI, 6.0-7.2) for global, 6.3 years (95% CI, 6.0-7.0) for pointwise, and 6.0 years (95% CI, 5.3-6.6) for cluster analyses. Although the specificity is still suboptimal, cluster trend analysis detects subsequently confirmed deterioration sooner than either global or pointwise analyses.

  17. Stream Clustering of Growing Objects

    NASA Astrophysics Data System (ADS)

    Siddiqui, Zaigham Faraz; Spiliopoulou, Myra

    We study incremental clustering of objects that grow and accumulate over time. The objects come from a multi-table stream e.g. streams of Customer and Transaction. As the Transactions stream accumulates, the Customers’ profiles grow. First, we use an incremental propositionalisation to convert the multi-table stream into a single-table stream upon which we apply clustering. For this purpose, we develop an online version of K-Means algorithm that can handle these swelling objects and any new objects that arrive. The algorithm also monitors the quality of the model and performs re-clustering when it deteriorates. We evaluate our method on the PKDD Challenge 1999 dataset.

  18. Alternative Parameterizations for Cluster Editing

    NASA Astrophysics Data System (ADS)

    Komusiewicz, Christian; Uhlmann, Johannes

    Given an undirected graph G and a nonnegative integer k, the NP-hard Cluster Editing problem asks whether G can be transformed into a disjoint union of cliques by applying at most k edge modifications. In the field of parameterized algorithmics, Cluster Editing has almost exclusively been studied parameterized by the solution size k. Contrastingly, in many real-world instances it can be observed that the parameter k is not really small. This observation motivates our investigation of parameterizations of Cluster Editing different from the solution size k. Our results are as follows. Cluster Editing is fixed-parameter tractable with respect to the parameter "size of a minimum cluster vertex deletion set of G", a typically much smaller parameter than k. Cluster Editing remains NP-hard on graphs with maximum degree six. A restricted but practically relevant version of Cluster Editing is fixed-parameter tractable with respect to the combined parameter "number of clusters in the target graph" and "maximum number of modified edges incident to any vertex in G". Many of our results also transfer to the NP-hard Cluster Deletion problem, where only edge deletions are allowed.

  19. Study on text mining algorithm for ultrasound examination of chronic liver diseases based on spectral clustering

    NASA Astrophysics Data System (ADS)

    Chang, Bingguo; Chen, Xiaofei

    2018-05-01

    Ultrasonography is an important examination for the diagnosis of chronic liver disease. The doctor gives the liver indicators and suggests the patient's condition according to the description of ultrasound report. With the rapid increase in the amount of data of ultrasound report, the workload of professional physician to manually distinguish ultrasound results significantly increases. In this paper, we use the spectral clustering method to cluster analysis of the description of the ultrasound report, and automatically generate the ultrasonic diagnostic diagnosis by machine learning. 110 groups ultrasound examination report of chronic liver disease were selected as test samples in this experiment, and the results were validated by spectral clustering and compared with k-means clustering algorithm. The results show that the accuracy of spectral clustering is 92.73%, which is higher than that of k-means clustering algorithm, which provides a powerful ultrasound-assisted diagnosis for patients with chronic liver disease.

  20. Energy Aware Cluster-Based Routing in Flying Ad-Hoc Networks.

    PubMed

    Aadil, Farhan; Raza, Ali; Khan, Muhammad Fahad; Maqsood, Muazzam; Mehmood, Irfan; Rho, Seungmin

    2018-05-03

    Flying ad-hoc networks (FANETs) are a very vibrant research area nowadays. They have many military and civil applications. Limited battery energy and the high mobility of micro unmanned aerial vehicles (UAVs) represent their two main problems, i.e., short flight time and inefficient routing. In this paper, we try to address both of these problems by means of efficient clustering. First, we adjust the transmission power of the UAVs by anticipating their operational requirements. Optimal transmission range will have minimum packet loss ratio (PLR) and better link quality, which ultimately save the energy consumed during communication. Second, we use a variant of the K-Means Density clustering algorithm for selection of cluster heads. Optimal cluster heads enhance the cluster lifetime and reduce the routing overhead. The proposed model outperforms the state of the art artificial intelligence techniques such as Ant Colony Optimization-based clustering algorithm and Grey Wolf Optimization-based clustering algorithm. The performance of the proposed algorithm is evaluated in term of number of clusters, cluster building time, cluster lifetime and energy consumption.

  1. Automatic detection of multiple UXO-like targets using magnetic anomaly inversion and self-adaptive fuzzy c-means clustering

    NASA Astrophysics Data System (ADS)

    Yin, Gang; Zhang, Yingtang; Fan, Hongbo; Ren, Guoquan; Li, Zhining

    2017-12-01

    We have developed a method for automatically detecting UXO-like targets based on magnetic anomaly inversion and self-adaptive fuzzy c-means clustering. Magnetic anomaly inversion methods are used to estimate the initial locations of multiple UXO-like sources. Although these initial locations have some errors with respect to the real positions, they form dense clouds around the actual positions of the magnetic sources. Then we use the self-adaptive fuzzy c-means clustering algorithm to cluster these initial locations. The estimated number of cluster centroids represents the number of targets and the cluster centroids are regarded as the locations of magnetic targets. Effectiveness of the method has been demonstrated using synthetic datasets. Computational results show that the proposed method can be applied to the case of several UXO-like targets that are randomly scattered within in a confined, shallow subsurface, volume. A field test was carried out to test the validity of the proposed method and the experimental results show that the prearranged magnets can be detected unambiguously and located precisely.

  2. Exploring the nature and synchronicity of early cluster formation in the Large Magellanic Cloud - II. Relative ages and distances for six ancient globular clusters

    NASA Astrophysics Data System (ADS)

    Wagner-Kaiser, R.; Mackey, Dougal; Sarajedini, Ata; Chaboyer, Brian; Cohen, Roger E.; Yang, Soung-Chul; Cummings, Jeffrey D.; Geisler, Doug; Grocholski, Aaron J.

    2017-11-01

    We analyse Hubble Space Telescope observations of six globular clusters in the Large Magellanic Cloud (LMC) from programme GO-14164 in Cycle 23. These are the deepest available observations of the LMC globular cluster population; their uniformity facilitates a precise comparison with globular clusters in the Milky Way. Measuring the magnitude of the main-sequence turn-off point relative to template Galactic globular clusters allows the relative ages of the clusters to be determined with a mean precision of 8.4 per cent, and down to 6 per cent for individual objects. We find that the mean age of our LMC cluster ensemble is identical to the mean age of the oldest metal-poor clusters in the Milky Way halo to 0.2 ± 0.4 Gyr. This provides the most sensitive test to date of the synchronicity of the earliest epoch of globular cluster formation in two independent galaxies. Horizontal branch magnitudes and subdwarf fitting to the main sequence allow us to determine distance estimates for each cluster and examine their geometric distribution in the LMC. Using two different methods, we find an average distance to the LMC of 18.52 ± 0.05.

  3. Cluster analysis of dynamic contrast enhanced MRI reveals tumor subregions related to locoregional relapse for cervical cancer patients.

    PubMed

    Torheim, Turid; Groendahl, Aurora R; Andersen, Erlend K F; Lyng, Heidi; Malinen, Eirik; Kvaal, Knut; Futsaether, Cecilia M

    2016-11-01

    Solid tumors are known to be spatially heterogeneous. Detection of treatment-resistant tumor regions can improve clinical outcome, by enabling implementation of strategies targeting such regions. In this study, K-means clustering was used to group voxels in dynamic contrast enhanced magnetic resonance images (DCE-MRI) of cervical cancers. The aim was to identify clusters reflecting treatment resistance that could be used for targeted radiotherapy with a dose-painting approach. Eighty-one patients with locally advanced cervical cancer underwent DCE-MRI prior to chemoradiotherapy. The resulting image time series were fitted to two pharmacokinetic models, the Tofts model (yielding parameters K trans and ν e ) and the Brix model (A Brix , k ep and k el ). K-means clustering was used to group similar voxels based on either the pharmacokinetic parameter maps or the relative signal increase (RSI) time series. The associations between voxel clusters and treatment outcome (measured as locoregional control) were evaluated using the volume fraction or the spatial distribution of each cluster. One voxel cluster based on the RSI time series was significantly related to locoregional control (adjusted p-value 0.048). This cluster consisted of low-enhancing voxels. We found that tumors with poor prognosis had this RSI-based cluster gathered into few patches, making this cluster a potential candidate for targeted radiotherapy. None of the voxels clusters based on Tofts or Brix parameter maps were significantly related to treatment outcome. We identified one group of tumor voxels significantly associated with locoregional relapse that could potentially be used for dose painting. This tumor voxel cluster was identified using the raw MRI time series rather than the pharmacokinetic maps.

  4. Geology and 40Ar/39Ar geochronology of the medium- to high-K Tanaga volcanic cluster, western Aleutians

    USGS Publications Warehouse

    Jicha, Brian R.; Coombs, Michelle L.; Calvert, Andrew T.; Singer, Brad S.

    2012-01-01

    We used geologic mapping and geochemical data augmented by 40Ar/39Ar dating to establish an eruptive chronology for the Tanaga volcanic cluster in the western Aleutian arc. The Tanaga volcanic cluster is unique in comparison to other central and western Aleutian volcanoes in that it consists of three closely spaced, active, volumetrically significant edifices (Sajaka, Tanaga, and Takawangha), the eruptive products of which have unusually high K2O contents. Thirty-five new 40Ar/39Ar ages obtained in two different laboratories constrain the duration of Pleistocene–Holocene subaerial volcanism to younger than 295 ka. The eruptive activity has been mostly continuous for the last 150 k.y., unlike most other well-characterized arc volcanoes, which tend to grow in discrete pulses. More than half of the analyzed Tanaga volcanic cluster lavas are basalts that have erupted throughout the lifetime of the cluster, although a considerable amount of basaltic andesite and basaltic trachyandesite has also been produced since 200 ka. Major- and trace-element variations suggest that magmas from Sajaka and Tanaga volcanoes are likely to have crystallized pyroxene and/or amphibole at greater depths than the older Takawangha magmas, which experienced a larger percentage of plagioclase-dominated fractionation at shallower depths. Magma output from Takawangha has declined over the last 86 k.y. At ca. 19 ka, the focus of magma flux shifted to the west beneath Tanaga and Sajaka volcanoes, where hotter, more mafic magma erupted.

  5. Zodiacal Exoplanets in Time (ZEIT). V. A Uniform Search for Transiting Planets in Young Clusters Observed by K2

    NASA Astrophysics Data System (ADS)

    Rizzuto, Aaron C.; Mann, Andrew W.; Vanderburg, Andrew; Kraus, Adam L.; Covey, Kevin R.

    2017-12-01

    Detection of transiting exoplanets around young stars is more difficult than for older systems owing to increased stellar variability. Nine young open cluster planets have been found in the K2 data, but no single analysis pipeline identified all planets. We have developed a transit search pipeline for young stars that uses a transit-shaped notch and quadratic continuum in a 12 or 24 hr window to fit both the stellar variability and the presence of a transit. In addition, for the most rapid rotators ({P}{rot}< 2 days) we model the variability using a linear combination of observed rotations of each star. To maximally exploit our new pipeline, we update the membership for four stellar populations observed by K2 (Upper Scorpius, Pleiades, Hyades, Praesepe) and conduct a uniform search of the members. We identify all known transiting exoplanets in the clusters, 17 eclipsing binaries, one transiting planet candidate orbiting a potential Pleiades member, and three orbiting unlikely members of the young clusters. Limited injection recovery testing on the known planet hosts indicates that for the older Praesepe systems we are sensitive to additional exoplanets as small as 1-2 R ⊕, and for the larger Upper Scorpius planet host (K2-33) our pipeline is sensitive to ˜4 R ⊕ transiting planets. The lack of detected multiple systems in the young clusters is consistent with the expected frequency from the original Kepler sample, within our detection limits. With a robust pipeline that detects all known planets in the young clusters, occurrence rate testing at young ages is now possible.

  6. Adsorption of thiophene on silica-supported Mo clusters

    NASA Astrophysics Data System (ADS)

    Komarneni, M.; Kadossov, E.; Justin, J.; Lu, M.; Burghaus, U.

    2010-07-01

    The adsorption/decomposition kinetics/dynamics of thiophene has been studied on silica-supported Mo and MoS x clusters. Two-dimensional cluster formation at small Mo exposures and three-dimensional cluster growth at larger exposures would be consistent with the Auger electron spectroscopy (AES) data. Thermal desorption spectroscopy (TDS) indicates two reaction pathways. H 4C 4S desorbs molecularly at 190-400 K. Two TDS features were evident and could be assigned to molecularly on Mo sites, and S sites adsorbed thiophene. Assuming a standard preexponential factor (ν = 1 × 10 13/s) for first-order kinetics, the binding energies for adsorption on Mo (sulfur) sites amount to 90 (65) kJ/mol for 0.4 ML Mo exposure and 76 (63) kJ/mol for 2 ML Mo. Thus, smaller clusters are more reactive than larger clusters for molecular adsorption of H 4C 4S. The second reaction pathway, the decomposition of thiophene, starts at 250 K. Utilizing multimass TDS, H 2, H 2S, and mostly alkynes are detected in the gas phase as decomposition products. H 4C 4S bond activation results in partially sulfided Mo clusters as well as S and C residuals on the surface. S and C poison the catalyst. As a result, with an increasing number of H 4C 4S adsorption/desorption cycles, the uptake of molecular thiophene decreases as well as the H 2 and H 2S production ceases. Thus, silica-supported sulfided Mo clusters are less reactive than metallic clusters. The poisoned catalyst can be partially reactivated by annealing in O 2. However, Mo oxides also appear to form, which passivate the catalyst further. On the other hand, while annealing a used catalyst in H/H 2, it is poisoned even more (i.e., the S AES signal increases). By means of adsorption transients, the initial adsorption probability, S0, of C 4H 4S has been determined. At thermal impact energies ( Ei = 0.04 eV), S0 for molecular adsorption amounts to 0.43 ± 0.03 for a surface temperature of 200 K. S0 increases with Mo cluster size, obeying the

  7. Faradaurate nanomolecules: a superstable plasmonic 76.3 kDa cluster.

    PubMed

    Dass, Amala

    2011-12-07

    Information on the emergence of the characteristic plasmonic optical properties of nanoscale noble-metal particles has been limited, due in part to the problem of preparing homogeneous material for ensemble measurements. Here, we report the identification, isolation, and mass spectrometric and optical characterization of a 76.3 kDa thiolate-protected gold nanoparticle. This giant molecule is far larger than any metal-cluster compound, those with direct metal-to-metal bonding, previously known as homogeneous molecular substances, and is the first to exhibit clear plasmonic properties. The observed plasmon emergence phenomena in nanomolecules are of great interest, and the availability of absolutely homogeneous and characterized samples is thus critical to establishing their origin. © 2011 American Chemical Society

  8. Risk assessment of water pollution sources based on an integrated k-means clustering and set pair analysis method in the region of Shiyan, China.

    PubMed

    Li, Chunhui; Sun, Lian; Jia, Junxiang; Cai, Yanpeng; Wang, Xuan

    2016-07-01

    Source water areas are facing many potential water pollution risks. Risk assessment is an effective method to evaluate such risks. In this paper an integrated model based on k-means clustering analysis and set pair analysis was established aiming at evaluating the risks associated with water pollution in source water areas, in which the weights of indicators were determined through the entropy weight method. Then the proposed model was applied to assess water pollution risks in the region of Shiyan in which China's key source water area Danjiangkou Reservoir for the water source of the middle route of South-to-North Water Diversion Project is located. The results showed that eleven sources with relative high risk value were identified. At the regional scale, Shiyan City and Danjiangkou City would have a high risk value in term of the industrial discharge. Comparatively, Danjiangkou City and Yunxian County would have a high risk value in terms of agricultural pollution. Overall, the risk values of north regions close to the main stream and reservoir of the region of Shiyan were higher than that in the south. The results of risk level indicated that five sources were in lower risk level (i.e., level II), two in moderate risk level (i.e., level III), one in higher risk level (i.e., level IV) and three in highest risk level (i.e., level V). Also risks of industrial discharge are higher than that of the agricultural sector. It is thus essential to manage the pillar industry of the region of Shiyan and certain agricultural companies in the vicinity of the reservoir to reduce water pollution risks of source water areas. Copyright © 2016 Elsevier B.V. All rights reserved.

  9. WordCluster: detecting clusters of DNA words and genomic elements

    PubMed Central

    2011-01-01

    Background Many k-mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds. Results We introduce here an algorithm to detect clusters of DNA words (k-mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome. Conclusions WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes. PMID:21261981

  10. Application of Fuzzy c-Means and Joint-Feature-Clustering to Detect Redundancies of Image-Features in Drug Combinations Studies of Breast Cancer

    NASA Astrophysics Data System (ADS)

    Brandl, Miriam B.; Beck, Dominik; Pham, Tuan D.

    2011-06-01

    The high dimensionality of image-based dataset can be a drawback for classification accuracy. In this study, we propose the application of fuzzy c-means clustering, cluster validity indices and the notation of a joint-feature-clustering matrix to find redundancies of image-features. The introduced matrix indicates how frequently features are grouped in a mutual cluster. The resulting information can be used to find data-derived feature prototypes with a common biological meaning, reduce data storage as well as computation times and improve the classification accuracy.

  11. Random phase approximation and cluster mean field studies of hard core Bose Hubbard model

    NASA Astrophysics Data System (ADS)

    Alavani, Bhargav K.; Gaude, Pallavi P.; Pai, Ramesh V.

    2018-04-01

    We investigate zero temperature and finite temperature properties of the Bose Hubbard Model in the hard core limit using Random Phase Approximation (RPA) and Cluster Mean Field Theory (CMFT). We show that our RPA calculations are able to capture quantum and thermal fluctuations significantly better than CMFT.

  12. [Applying the clustering technique for characterising maintenance outsourcing].

    PubMed

    Cruz, Antonio M; Usaquén-Perilla, Sandra P; Vanegas-Pabón, Nidia N; Lopera, Carolina

    2010-06-01

    Using clustering techniques for characterising companies providing health institutions with maintenance services. The study analysed seven pilot areas' equipment inventory (264 medical devices). Clustering techniques were applied using 26 variables. Response time (RT), operation duration (OD), availability and turnaround time (TAT) were amongst the most significant ones. Average biomedical equipment obsolescence value was 0.78. Four service provider clusters were identified: clusters 1 and 3 had better performance, lower TAT, RT and DR values (56 % of the providers coded O, L, C, B, I, S, H, F and G, had 1 to 4 day TAT values: Cluster 0 had medium performance (38 % of providers coded V, M, K, Z, T and Y, having an average 9.79 TAT value). Cluster 2 (6 % - provider J) had low performance, having very a high TAT level (101 days on average). The methodology allowed medical equipment inventory and maintenance service suppliers to be characterised. The cluster technique was effective in identifying the most competitive suppliers.

  13. Nearest neighbor-density-based clustering methods for large hyperspectral images

    NASA Astrophysics Data System (ADS)

    Cariou, Claude; Chehdi, Kacem

    2017-10-01

    We address the problem of hyperspectral image (HSI) pixel partitioning using nearest neighbor - density-based (NN-DB) clustering methods. NN-DB methods are able to cluster objects without specifying the number of clusters to be found. Within the NN-DB approach, we focus on deterministic methods, e.g. ModeSeek, knnClust, and GWENN (standing for Graph WatershEd using Nearest Neighbors). These methods only require the availability of a k-nearest neighbor (kNN) graph based on a given distance metric. Recently, a new DB clustering method, called Density Peak Clustering (DPC), has received much attention, and kNN versions of it have quickly followed and showed their efficiency. However, NN-DB methods still suffer from the difficulty of obtaining the kNN graph due to the quadratic complexity with respect to the number of pixels. This is why GWENN was embedded into a multiresolution (MR) scheme to bypass the computation of the full kNN graph over the image pixels. In this communication, we propose to extent the MR-GWENN scheme on three aspects. Firstly, similarly to knnClust, the original labeling rule of GWENN is modified to account for local density values, in addition to the labels of previously processed objects. Secondly, we set up a modified NN search procedure within the MR scheme, in order to stabilize of the number of clusters found from the coarsest to the finest spatial resolution. Finally, we show that these extensions can be easily adapted to the three other NN-DB methods (ModeSeek, knnClust, knnDPC) for pixel clustering in large HSIs. Experiments are conducted to compare the four NN-DB methods for pixel clustering in HSIs. We show that NN-DB methods can outperform a classical clustering method such as fuzzy c-means (FCM), in terms of classification accuracy, relevance of found clusters, and clustering speed. Finally, we demonstrate the feasibility and evaluate the performances of NN-DB methods on a very large image acquired by our AISA Eagle hyperspectral

  14. Review of methods for handling confounding by cluster and informative cluster size in clustered data

    PubMed Central

    Seaman, Shaun; Pavlou, Menelaos; Copas, Andrew

    2014-01-01

    Clustered data are common in medical research. Typically, one is interested in a regression model for the association between an outcome and covariates. Two complications that can arise when analysing clustered data are informative cluster size (ICS) and confounding by cluster (CBC). ICS and CBC mean that the outcome of a member given its covariates is associated with, respectively, the number of members in the cluster and the covariate values of other members in the cluster. Standard generalised linear mixed models for cluster-specific inference and standard generalised estimating equations for population-average inference assume, in general, the absence of ICS and CBC. Modifications of these approaches have been proposed to account for CBC or ICS. This article is a review of these methods. We express their assumptions in a common format, thus providing greater clarity about the assumptions that methods proposed for handling CBC make about ICS and vice versa, and about when different methods can be used in practice. We report relative efficiencies of methods where available, describe how methods are related, identify a previously unreported equivalence between two key methods, and propose some simple additional methods. Unnecessarily using a method that allows for ICS/CBC has an efficiency cost when ICS and CBC are absent. We review tools for identifying ICS/CBC. A strategy for analysis when CBC and ICS are suspected is demonstrated by examining the association between socio-economic deprivation and preterm neonatal death in Scotland. PMID:25087978

  15. Towards Development of Clustering Applications for Large-Scale Comparative Genotyping and Kinship Analysis Using Y-Short Tandem Repeats.

    PubMed

    Seman, Ali; Sapawi, Azizian Mohd; Salleh, Mohd Zaki

    2015-06-01

    Y-chromosome short tandem repeats (Y-STRs) are genetic markers with practical applications in human identification. However, where mass identification is required (e.g., in the aftermath of disasters with significant fatalities), the efficiency of the process could be improved with new statistical approaches. Clustering applications are relatively new tools for large-scale comparative genotyping, and the k-Approximate Modal Haplotype (k-AMH), an efficient algorithm for clustering large-scale Y-STR data, represents a promising method for developing these tools. In this study we improved the k-AMH and produced three new algorithms: the Nk-AMH I (including a new initial cluster center selection), the Nk-AMH II (including a new dominant weighting value), and the Nk-AMH III (combining I and II). The Nk-AMH III was the superior algorithm, with mean clustering accuracy that increased in four out of six datasets and remained at 100% in the other two. Additionally, the Nk-AMH III achieved a 2% higher overall mean clustering accuracy score than the k-AMH, as well as optimal accuracy for all datasets (0.84-1.00). With inclusion of the two new methods, the Nk-AMH III produced an optimal solution for clustering Y-STR data; thus, the algorithm has potential for further development towards fully automatic clustering of any large-scale genotypic data.

  16. Cataloging the Praesepe Cluster: Identifying Interlopers and Binary Systems

    NASA Astrophysics Data System (ADS)

    Lucey, Madeline R.; Gosnell, Natalie M.; Mann, Andrew; Douglas, Stephanie

    2018-01-01

    We present radial velocity measurements from an ongoing survey of the Praesepe open cluster using the WIYN 3.5m Telescope. Our target stars include 229 early-K to mid-M dwarfs with proper motion memberships that have been observed by the repurposed Kepler mission, K2. With this survey, we will provide a well-constrained membership list of the cluster. By removing interloping stars and determining the cluster binary frequency we can avoid systematic errors in our analysis of the K2 findings and more accurately determine exoplanet properties in the Praesepe cluster. Obtaining accurate exoplanet parameters in open clusters allows us to study the temporal dimension of exoplanet parameter space. We find Praesepe to have a mean radial velocity of 34.09 km/s and a velocity dispersion of 1.13 km/s, which is consistent with previous studies. We derive radial velocity membership probabilities for stars with ≥3 radial velocity measurements and compare against published membership probabilities. We also identify radial velocity variables and potential double-lined spectroscopic binaries. We plan to obtain more observations to determine the radial velocity membership of all the stars in our sample, as well as follow up on radial velocity variables to determine binary orbital solutions.

  17. Application of fuzzy c-means clustering to PRTR chemicals uncovering their release and toxicity characteristics.

    PubMed

    Xue, Mianqiang; Zhou, Liang; Kojima, Naoya; Dos Muchangos, Leticia Sarmento; Machimura, Takashi; Tokai, Akihiro

    2018-05-01

    Increasing manufacture and usage of chemicals have not been matched by the increase in our understanding of their risks. Pollutant release and transfer register (PRTR) is becoming a popular measure for collecting chemical data and enhancing the public right to know. However, these data are usually in high dimensionality which restricts their wider use. The present study partitions Japanese PRTR chemicals into five fuzzy clusters by fuzzy c-mean clustering (FCM) to explore the implicit information. Each chemical with membership degrees belongs to each cluster. Cluster I features high releases from non-listed industries and the household sector and high environmental toxicity. Cluster II is characterized by high reported releases and transfers from 24 listed industries above the threshold, mutagenicity, and high environmental toxicity. Chemicals in cluster III have characteristics of high releases from non-listed industries and low toxicity. Cluster IV is characterized by high reported releases and transfers from 24 listed industries above the threshold and extremely high environmental toxicity. Cluster V is characterized by low releases yet mutagenicity and high carcinogenicity. Chemicals with the highest membership degree were identified as representatives for each cluster. For the highest membership degree, half of the chemicals have a value higher than 0.74. If we look at both the highest and the second highest membership degrees simultaneously, about 94% of the chemicals have a value higher than 0.5. FCM can serve as an approach to uncover the implicit information of highly complex chemical dataset, which subsequently supports the strategy development for efficient and effective chemical management. Copyright © 2017 Elsevier B.V. All rights reserved.

  18. Dynamical transitions in large systems of mean field-coupled Landau-Stuart oscillators: Extensive chaos and cluster states.

    PubMed

    Ku, Wai Lim; Girvan, Michelle; Ott, Edward

    2015-12-01

    In this paper, we study dynamical systems in which a large number N of identical Landau-Stuart oscillators are globally coupled via a mean-field. Previously, it has been observed that this type of system can exhibit a variety of different dynamical behaviors. These behaviors include time periodic cluster states in which each oscillator is in one of a small number of groups for which all oscillators in each group have the same state which is different from group to group, as well as a behavior in which all oscillators have different states and the macroscopic dynamics of the mean field is chaotic. We argue that this second type of behavior is "extensive" in the sense that the chaotic attractor in the full phase space of the system has a fractal dimension that scales linearly with N and that the number of positive Lyapunov exponents of the attractor also scales linearly with N. An important focus of this paper is the transition between cluster states and extensive chaos as the system is subjected to slow adiabatic parameter change. We observe discontinuous transitions between the cluster states (which correspond to low dimensional dynamics) and the extensively chaotic states. Furthermore, examining the cluster state, as the system approaches the discontinuous transition to extensive chaos, we find that the oscillator population distribution between the clusters continually evolves so that the cluster state is always marginally stable. This behavior is used to reveal the mechanism of the discontinuous transition. We also apply the Kaplan-Yorke formula to study the fractal structure of the extensively chaotic attractors.

  19. Dynamical transitions in large systems of mean field-coupled Landau-Stuart oscillators: Extensive chaos and cluster states

    NASA Astrophysics Data System (ADS)

    Ku, Wai Lim; Girvan, Michelle; Ott, Edward

    2015-12-01

    In this paper, we study dynamical systems in which a large number N of identical Landau-Stuart oscillators are globally coupled via a mean-field. Previously, it has been observed that this type of system can exhibit a variety of different dynamical behaviors. These behaviors include time periodic cluster states in which each oscillator is in one of a small number of groups for which all oscillators in each group have the same state which is different from group to group, as well as a behavior in which all oscillators have different states and the macroscopic dynamics of the mean field is chaotic. We argue that this second type of behavior is "extensive" in the sense that the chaotic attractor in the full phase space of the system has a fractal dimension that scales linearly with N and that the number of positive Lyapunov exponents of the attractor also scales linearly with N. An important focus of this paper is the transition between cluster states and extensive chaos as the system is subjected to slow adiabatic parameter change. We observe discontinuous transitions between the cluster states (which correspond to low dimensional dynamics) and the extensively chaotic states. Furthermore, examining the cluster state, as the system approaches the discontinuous transition to extensive chaos, we find that the oscillator population distribution between the clusters continually evolves so that the cluster state is always marginally stable. This behavior is used to reveal the mechanism of the discontinuous transition. We also apply the Kaplan-Yorke formula to study the fractal structure of the extensively chaotic attractors.

  20. A roadmap of clustering algorithms: finding a match for a biomedical application.

    PubMed

    Andreopoulos, Bill; An, Aijun; Wang, Xiaogang; Schroeder, Michael

    2009-05-01

    Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.

  1. Phenetic Comparison of Prokaryotic Genomes Using k-mers

    PubMed Central

    Déraspe, Maxime; Raymond, Frédéric; Boisvert, Sébastien; Culley, Alexander; Roy, Paul H.; Laviolette, François; Corbeil, Jacques

    2017-01-01

    Abstract Bacterial genomics studies are getting more extensive and complex, requiring new ways to envision analyses. Using the Ray Surveyor software, we demonstrate that comparison of genomes based on their k-mer content allows reconstruction of phenetic trees without the need of prior data curation, such as core genome alignment of a species. We validated the methodology using simulated genomes and previously published phylogenomic studies of Streptococcus pneumoniae and Pseudomonas aeruginosa. We also investigated the relationship of specific genetic determinants with bacterial population structures. By comparing clusters from the complete genomic content of a genome population with clusters from specific functional categories of genes, we can determine how the population structures are correlated. Indeed, the strain clustering based on a subset of k-mers allows determination of its similarity with the whole genome clusters. We also applied this methodology on 42 species of bacteria to determine the correlational significance of five important bacterial genomic characteristics. For example, intrinsic resistance is more important in P. aeruginosa than in S. pneumoniae, and the former has increased correlation of its population structure with antibiotic resistance genes. The global view of the pangenome of bacteria also demonstrated the taxa-dependent interaction of population structure with antibiotic resistance, bacteriophage, plasmid, and mobile element k-mer data sets. PMID:28957508

  2. Photometric light curves for seven rapidly-rotating K dwarfs in the Pleiades and Alpha Persei clusters

    NASA Technical Reports Server (NTRS)

    Stauffer, John R.; Schild, Rudolph A.; Baliunas, Sallie L.; Africano, John L.

    1987-01-01

    Light curves and period estimates were obtained for several Pleiades and Alpha Persei cluster K dwarfs which were identified as rapid rotators in earlier spectroscopic studies. A few of the stars have previously-published light curves, making it possible to study the long-term variability of the light-curve shapes. The general cause of the photometric variability observed for these stars is an asymmetric distribution of photospheric inhomogeneities (starspots). The presence of these inhomogeneities combined with the rotation of the star lead to the light curves observed. The photometric periods derived are thus identified with the rotation period of the star, making it possible to estimate equatorial rotational velocities for these K dwarfs. These data are of particular importance because the clusters are sufficiently young that stars of this mass should have just arrived on the main sequence. These data could be used to estimate the temperatures and sizes of the spot groups necessary to produce the observed light curves for these stars.

  3. A robust fuzzy local Information c-means clustering algorithm with noise detection

    NASA Astrophysics Data System (ADS)

    Shang, Jiayu; Li, Shiren; Huang, Junwei

    2018-04-01

    Fuzzy c-means clustering (FCM), especially with spatial constraints (FCM_S), is an effective algorithm suitable for image segmentation. Its reliability contributes not only to the presentation of fuzziness for belongingness of every pixel but also to exploitation of spatial contextual information. But these algorithms still remain some problems when processing the image with noise, they are sensitive to the parameters which have to be tuned according to prior knowledge of the noise. In this paper, we propose a new FCM algorithm, combining the gray constraints and spatial constraints, called spatial and gray-level denoised fuzzy c-means (SGDFCM) algorithm. This new algorithm conquers the parameter disadvantages mentioned above by considering the possibility of noise of each pixel, which aims to improve the robustness and obtain more detail information. Furthermore, the possibility of noise can be calculated in advance, which means the algorithm is effective and efficient.

  4. Insights into quasar UV spectra using unsupervised clustering analysis

    NASA Astrophysics Data System (ADS)

    Tammour, A.; Gallagher, S. C.; Daley, M.; Richards, G. T.

    2016-06-01

    Machine learning techniques can provide powerful tools to detect patterns in multidimensional parameter space. We use K-means - a simple yet powerful unsupervised clustering algorithm which picks out structure in unlabelled data - to study a sample of quasar UV spectra from the Quasar Catalog of the 10th Data Release of the Sloan Digital Sky Survey (SDSS-DR10) of Paris et al. Detecting patterns in large data sets helps us gain insights into the physical conditions and processes giving rise to the observed properties of quasars. We use K-means to find clusters in the parameter space of the equivalent width (EW), the blue- and red-half-width at half-maximum (HWHM) of the Mg II 2800 Å line, the C IV 1549 Å line, and the C III] 1908 Å blend in samples of broad absorption line (BAL) and non-BAL quasars at redshift 1.6-2.1. Using this method, we successfully recover correlations well-known in the UV regime such as the anti-correlation between the EW and blueshift of the C IV emission line and the shape of the ionizing spectra energy distribution (SED) probed by the strength of He II and the Si III]/C III] ratio. We find this to be particularly evident when the properties of C III] are used to find the clusters, while those of Mg II proved to be less strongly correlated with the properties of the other lines in the spectra such as the width of C IV or the Si III]/C III] ratio. We conclude that unsupervised clustering methods (such as K-means) are powerful methods for finding `natural' binning boundaries in multidimensional data sets and discuss caveats and future work.

  5. Improving estimation of kinetic parameters in dynamic force spectroscopy using cluster analysis

    NASA Astrophysics Data System (ADS)

    Yen, Chi-Fu; Sivasankar, Sanjeevi

    2018-03-01

    Dynamic Force Spectroscopy (DFS) is a widely used technique to characterize the dissociation kinetics and interaction energy landscape of receptor-ligand complexes with single-molecule resolution. In an Atomic Force Microscope (AFM)-based DFS experiment, receptor-ligand complexes, sandwiched between an AFM tip and substrate, are ruptured at different stress rates by varying the speed at which the AFM-tip and substrate are pulled away from each other. The rupture events are grouped according to their pulling speeds, and the mean force and loading rate of each group are calculated. These data are subsequently fit to established models, and energy landscape parameters such as the intrinsic off-rate (koff) and the width of the potential energy barrier (xβ) are extracted. However, due to large uncertainties in determining mean forces and loading rates of the groups, errors in the estimated koff and xβ can be substantial. Here, we demonstrate that the accuracy of fitted parameters in a DFS experiment can be dramatically improved by sorting rupture events into groups using cluster analysis instead of sorting them according to their pulling speeds. We test different clustering algorithms including Gaussian mixture, logistic regression, and K-means clustering, under conditions that closely mimic DFS experiments. Using Monte Carlo simulations, we benchmark the performance of these clustering algorithms over a wide range of koff and xβ, under different levels of thermal noise, and as a function of both the number of unbinding events and the number of pulling speeds. Our results demonstrate that cluster analysis, particularly K-means clustering, is very effective in improving the accuracy of parameter estimation, particularly when the number of unbinding events are limited and not well separated into distinct groups. Cluster analysis is easy to implement, and our performance benchmarks serve as a guide in choosing an appropriate method for DFS data analysis.

  6. Structural motifs of pre-nucleation clusters.

    PubMed

    Zhang, Y; Türkmen, I R; Wassermann, B; Erko, A; Rühl, E

    2013-10-07

    Structural motifs of pre-nucleation clusters prepared in single, optically levitated supersaturated aqueous aerosol microparticles containing CaBr2 as a model system are reported. Cluster formation is identified by means of X-ray absorption in the Br K-edge regime. The salt concentration beyond the saturation point is varied by controlling the humidity in the ambient atmosphere surrounding the 15-30 μm microdroplets. This leads to the formation of metastable supersaturated liquid particles. Distinct spectral shifts in near-edge spectra as a function of salt concentration are observed, in which the energy position of the Br K-edge is red-shifted by up to 7.1 ± 0.4 eV if the dilute solution is compared to the solid. The K-edge positions of supersaturated solutions are found between these limits. The changes in electronic structure are rationalized in terms of the formation of pre-nucleation clusters. This assumption is verified by spectral simulations using first-principle density functional theory and molecular dynamics calculations, in which structural motifs are considered, explaining the experimental results. These consist of solvated CaBr2 moieties, rather than building blocks forming calcium bromide hexahydrates, the crystal system that is formed by drying aqueous CaBr2 solutions.

  7. Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis.

    PubMed

    Liao, Minlei; Li, Yunfeng; Kianifard, Farid; Obi, Engels; Arcona, Stephen

    2016-03-02

    Cluster analysis (CA) is a frequently used applied statistical technique that helps to reveal hidden structures and "clusters" found in large data sets. However, this method has not been widely used in large healthcare claims databases where the distribution of expenditure data is commonly severely skewed. The purpose of this study was to identify cost change patterns of patients with end-stage renal disease (ESRD) who initiated hemodialysis (HD) by applying different clustering methods. A retrospective, cross-sectional, observational study was conducted using the Truven Health MarketScan® Research Databases. Patients aged ≥18 years with ≥2 ESRD diagnoses who initiated HD between 2008 and 2010 were included. The K-means CA method and hierarchical CA with various linkage methods were applied to all-cause costs within baseline (12-months pre-HD) and follow-up periods (12-months post-HD) to identify clusters. Demographic, clinical, and cost information was extracted from both periods, and then examined by cluster. A total of 18,380 patients were identified. Meaningful all-cause cost clusters were generated using K-means CA and hierarchical CA with either flexible beta or Ward's methods. Based on cluster sample sizes and change of cost patterns, the K-means CA method and 4 clusters were selected: Cluster 1: Average to High (n = 113); Cluster 2: Very High to High (n = 89); Cluster 3: Average to Average (n = 16,624); or Cluster 4: Increasing Costs, High at Both Points (n = 1554). Median cost changes in the 12-month pre-HD and post-HD periods increased from $185,070 to $884,605 for Cluster 1 (Average to High), decreased from $910,930 to $157,997 for Cluster 2 (Very High to High), were relatively stable and remained low from $15,168 to $13,026 for Cluster 3 (Average to Average), and increased from $57,909 to $193,140 for Cluster 4 (Increasing Costs, High at Both Points). Relatively stable costs after starting HD were associated with more stable scores

  8. An Effective Approach for Clustering InhA Molecular Dynamics Trajectory Using Substrate-Binding Cavity Features.

    PubMed

    De Paris, Renata; Quevedo, Christian V; Ruiz, Duncan D A; Norberto de Souza, Osmar

    2015-01-01

    Protein receptor conformations, obtained from molecular dynamics (MD) simulations, have become a promising treatment of its explicit flexibility in molecular docking experiments applied to drug discovery and development. However, incorporating the entire ensemble of MD conformations in docking experiments to screen large candidate compound libraries is currently an unfeasible task. Clustering algorithms have been widely used as a means to reduce such ensembles to a manageable size. Most studies investigate different algorithms using pairwise Root-Mean Square Deviation (RMSD) values for all, or part of the MD conformations. Nevertheless, the RMSD only may not be the most appropriate gauge to cluster conformations when the target receptor has a plastic active site, since they are influenced by changes that occur on other parts of the structure. Hence, we have applied two partitioning methods (k-means and k-medoids) and four agglomerative hierarchical methods (Complete linkage, Ward's, Unweighted Pair Group Method and Weighted Pair Group Method) to analyze and compare the quality of partitions between a data set composed of properties from an enzyme receptor substrate-binding cavity and two data sets created using different RMSD approaches. Ensembles of representative MD conformations were generated by selecting a medoid of each group from all partitions analyzed. We investigated the performance of our new method for evaluating binding conformation of drug candidates to the InhA enzyme, which were performed by cross-docking experiments between a 20 ns MD trajectory and 20 different ligands. Statistical analyses showed that the novel ensemble, which is represented by only 0.48% of the MD conformations, was able to reproduce 75% of all dynamic behaviors within the binding cavity for the docking experiments performed. Moreover, this new approach not only outperforms the other two RMSD-clustering solutions, but it also shows to be a promising strategy to distill

  9. Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering

    PubMed Central

    2010-01-01

    Background Cluster analysis, and in particular hierarchical clustering, is widely used to extract information from gene expression data. The aim is to discover new classes, or sub-classes, of either individuals or genes. Performing a cluster analysis commonly involve decisions on how to; handle missing values, standardize the data and select genes. In addition, pre-processing, involving various types of filtration and normalization procedures, can have an effect on the ability to discover biologically relevant classes. Here we consider cluster analysis in a broad sense and perform a comprehensive evaluation that covers several aspects of cluster analyses, including normalization. Result We evaluated 2780 cluster analysis methods on seven publicly available 2-channel microarray data sets with common reference designs. Each cluster analysis method differed in data normalization (5 normalizations were considered), missing value imputation (2), standardization of data (2), gene selection (19) or clustering method (11). The cluster analyses are evaluated using known classes, such as cancer types, and the adjusted Rand index. The performances of the different analyses vary between the data sets and it is difficult to give general recommendations. However, normalization, gene selection and clustering method are all variables that have a significant impact on the performance. In particular, gene selection is important and it is generally necessary to include a relatively large number of genes in order to get good performance. Selecting genes with high standard deviation or using principal component analysis are shown to be the preferred gene selection methods. Hierarchical clustering using Ward's method, k-means clustering and Mclust are the clustering methods considered in this paper that achieves the highest adjusted Rand. Normalization can have a significant positive impact on the ability to cluster individuals, and there are indications that background correction is

  10. [Plaque segmentation of intracoronary optical coherence tomography images based on K-means and improved random walk algorithm].

    PubMed

    Wang, Guanglei; Wang, Pengyu; Han, Yechen; Liu, Xiuling; Li, Yan; Lu, Qian

    2017-06-01

    In recent years, optical coherence tomography (OCT) has developed into a popular coronary imaging technology at home and abroad. The segmentation of plaque regions in coronary OCT images has great significance for vulnerable plaque recognition and research. In this paper, a new algorithm based on K -means clustering and improved random walk is proposed and Semi-automated segmentation of calcified plaque, fibrotic plaque and lipid pool was achieved. And the weight function of random walk is improved. The distance between the edges of pixels in the image and the seed points is added to the definition of the weight function. It increases the weak edge weights and prevent over-segmentation. Based on the above methods, the OCT images of 9 coronary atherosclerotic patients were selected for plaque segmentation. By contrasting the doctor's manual segmentation results with this method, it was proved that this method had good robustness and accuracy. It is hoped that this method can be helpful for the clinical diagnosis of coronary heart disease.

  11. Photometry and spectroscopy in the open cluster Alpha Persei, 2

    NASA Technical Reports Server (NTRS)

    Prosser, Charles F.

    1993-01-01

    Results from a combination of new spectroscopic and photometric observations in the lower main-sequence and pre-main sequence of the open cluster alpha Persei are presented. New echelle spectroscopy has provided radial and rotational velocity information for thirteen candidate members, three of which are nonmembers based on radial velocity, absence of a Li 6707A feature, and absence of H-alpha emission. A set of revised rotational velocity estimates for several slowly rotating candidates identified earlier is given, yielding rotational velocities as low as 7 km/s for two apparent cluster members. VRI photometry for several pre-main sequence members is given; the new (V,V-I(sub K)) photometry yields a more clearly defined pre-main sequence. A list of approximately 43 new faint candidate members based on the (V,V-I(sub K)) CCD photometry is presented in an effort to identify additional cluster members at very low masses. Low-dispersion spectra obtained for several of these candidates provide in some cases supporting evidence for cluster membership. The single brown dwarf candidate in this cluster is for the first time placed in a color-magnitude diagram with other cluster members, providing a better means for establishing its true status. Stars from among the list of new photometric candidates may provide the means for establishing a sequence of cluster members down to very faint magnitudes (V approximately 21) and consequently very low masses. New coordinate determinations for previous candidate members and finding charts for the new photometric candidates are provided in appendices.

  12. Exploratory Item Classification Via Spectral Graph Clustering

    PubMed Central

    Chen, Yunxiao; Li, Xiaoou; Liu, Jingchen; Xu, Gongjun; Ying, Zhiliang

    2017-01-01

    Large-scale assessments are supported by a large item pool. An important task in test development is to assign items into scales that measure different characteristics of individuals, and a popular approach is cluster analysis of items. Classical methods in cluster analysis, such as the hierarchical clustering, K-means method, and latent-class analysis, often induce a high computational overhead and have difficulty handling missing data, especially in the presence of high-dimensional responses. In this article, the authors propose a spectral clustering algorithm for exploratory item cluster analysis. The method is computationally efficient, effective for data with missing or incomplete responses, easy to implement, and often outperforms traditional clustering algorithms in the context of high dimensionality. The spectral clustering algorithm is based on graph theory, a branch of mathematics that studies the properties of graphs. The algorithm first constructs a graph of items, characterizing the similarity structure among items. It then extracts item clusters based on the graphical structure, grouping similar items together. The proposed method is evaluated through simulations and an application to the revised Eysenck Personality Questionnaire. PMID:29033476

  13. Cluster analyses of association of weather, daily factors and emergent medical conditions.

    PubMed

    Malkić, Jasmin; Sarajlić, Nermin; Smrke, Barbara U R; Smrke, Dragica

    2013-03-01

    The goal of this study was to evaluate associations between the meteorological conditions and the number of emergency cases for five distinctive causes of dispatch groups reported to SOS dispatch centre in Uppsala, Sweden. Center's responsibility include alerting to 17 ambulances in whole Uppsala County, area of 8,209 km2 with around 320,000 inhabitants representing the target patient group. Source of the medical data for this study is the database of dispatch data for the year of 2009, while the metrological data have been provided from Uppsala University Department of Earth Sciences yearly weather report. Medical and meteorological data were summoned into the unified data space where each point represents a day with its weather parameters and dispatch cause group cardinality. DBSCAN data mining algorithm was implemented to five distinctive groups of dispatch causes after the data spaces have gone through the variance adjustment and the principal component analyses. As the result, several point clusters were discovered in each of the examined data spaces indicating the distinctive conditions regarding the weather and daily cardinality of the dispatch cause, as well as the associations between these two. Most interesting finding is that specific type of winter weather formed a cluster only around the days with the high count of breathing difficulties, while one of the summer weather clusters made similar association with the days with low number of cases. Findings were confirmed by confidence level estimation based on signal to noise ratio for the observed data points.

  14. Dynamical transitions in large systems of mean field-coupled Landau-Stuart oscillators: Extensive chaos and cluster states

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ku, Wai Lim; Girvan, Michelle; Ott, Edward

    In this paper, we study dynamical systems in which a large number N of identical Landau-Stuart oscillators are globally coupled via a mean-field. Previously, it has been observed that this type of system can exhibit a variety of different dynamical behaviors. These behaviors include time periodic cluster states in which each oscillator is in one of a small number of groups for which all oscillators in each group have the same state which is different from group to group, as well as a behavior in which all oscillators have different states and the macroscopic dynamics of the mean field ismore » chaotic. We argue that this second type of behavior is “extensive” in the sense that the chaotic attractor in the full phase space of the system has a fractal dimension that scales linearly with N and that the number of positive Lyapunov exponents of the attractor also scales linearly with N. An important focus of this paper is the transition between cluster states and extensive chaos as the system is subjected to slow adiabatic parameter change. We observe discontinuous transitions between the cluster states (which correspond to low dimensional dynamics) and the extensively chaotic states. Furthermore, examining the cluster state, as the system approaches the discontinuous transition to extensive chaos, we find that the oscillator population distribution between the clusters continually evolves so that the cluster state is always marginally stable. This behavior is used to reveal the mechanism of the discontinuous transition. We also apply the Kaplan-Yorke formula to study the fractal structure of the extensively chaotic attractors.« less

  15. Integrative analyses of conserved WNT clusters and their co-operative behaviour in human breast cancer

    PubMed Central

    Qurrat-ul-Ain; Seemab, Umair; Nawaz, Sulaman; Rashid, Sajid

    2011-01-01

    In human, WNT gene clusters are highly conserved at specie level and associated with carcinogenesis. Among them, WNT-10A and WNT-6 genes clustered in chromosome 2q35 are homologous to WNT-10B and WNT-1 located in chromosome 12q13, respectively. In an attempt to study co-regulation, the coordinated expression of these genes was monitored in human breast cancer tissues. As compared to normal tissue, both WNT-10A and WNT-10B genes exhibited lower expression while WNT-6 and WNT-1 showed increased expression in breast cancer tissues. The co-expression pattern was elaborated by detailed phylogenetic and syntenic analyses. Moreover, the intergenic and intragenic regions for these gene clusters were analyzed for studying the transcriptional regulation. In this context, adequate conserved binding sites for SOX and TCF family of transcriptional factors were observed. We propose that SOX9 and TCF4 may compete for binding at the promoters of WNT family genes thus regulating the disease phenotype. PMID:22355234

  16. Chaotic map clustering algorithm for EEG analysis

    NASA Astrophysics Data System (ADS)

    Bellotti, R.; De Carlo, F.; Stramaglia, S.

    2004-03-01

    The non-parametric chaotic map clustering algorithm has been applied to the analysis of electroencephalographic signals, in order to recognize the Huntington's disease, one of the most dangerous pathologies of the central nervous system. The performance of the method has been compared with those obtained through parametric algorithms, as K-means and deterministic annealing, and supervised multi-layer perceptron. While supervised neural networks need a training phase, performed by means of data tagged by the genetic test, and the parametric methods require a prior choice of the number of classes to find, the chaotic map clustering gives a natural evidence of the pathological class, without any training or supervision, thus providing a new efficient methodology for the recognition of patterns affected by the Huntington's disease.

  17. Path integral Monte Carlo study on the structure and absorption spectra of alkali atoms (Li, Na, K) attached to superfluid helium clusters

    NASA Astrophysics Data System (ADS)

    Nakayama, Akira; Yamashita, Koichi

    2001-01-01

    Path integral Monte Carlo calculations have been performed to investigate the microscopic structure and thermodynamic properties of the AkṡHeN (Ak=Li, Na, K,N⩽300) clusters at T=0.5 K. Absorption spectra which correspond to the 2P←2S transitions of alkali atoms are also calculated within a pairwise additive model, which employs diatomic Ak-He potential energy curves. The size dependences of the cluster structure and absorption spectra that show the influence of the helium cluster environment are examined in detail. It is found that alkali atoms are trapped in a dimple on the helium cluster's surface and that, from the asymptotic behavior, the AkṡHe300 cluster, at least semiquantitatively, mimics the local structure of experimentally produced large helium clusters in the vicinity of alkali atoms. We have successfully reproduced the overall shapes of the spectra and explained their features from a static and structural point of view. The positions, relative intensities, and line widths of the absorption maxima are calculated to be in moderate agreement with experiments [F. Stienkemeier, J. Higgins, C. Callegari, S. I. Kanorsky, W. E. Ernst, and G. Scoles, Z. Phys. D 38, 253 (1996)].

  18. The composite sequential clustering technique for analysis of multispectral scanner data

    NASA Technical Reports Server (NTRS)

    Su, M. Y.

    1972-01-01

    The clustering technique consists of two parts: (1) a sequential statistical clustering which is essentially a sequential variance analysis, and (2) a generalized K-means clustering. In this composite clustering technique, the output of (1) is a set of initial clusters which are input to (2) for further improvement by an iterative scheme. This unsupervised composite technique was employed for automatic classification of two sets of remote multispectral earth resource observations. The classification accuracy by the unsupervised technique is found to be comparable to that by traditional supervised maximum likelihood classification techniques. The mathematical algorithms for the composite sequential clustering program and a detailed computer program description with job setup are given.

  19. Designing an Algorithm for Cancerous Tissue Segmentation Using Adaptive K-means Cluttering and Discrete Wavelet Transform.

    PubMed

    Rezaee, Kh; Haddadnia, J

    2013-09-01

    Breast cancer is currently one of the leading causes of death among women worldwide. The diagnosis and separation of cancerous tumors in mammographic images require accuracy, experience and time, and it has always posed itself as a major challenge to the radiologists and physicians. This paper proposes a new algorithm which draws on discrete wavelet transform and adaptive K-means techniques to transmute the medical images implement the tumor estimation and detect breast cancer tumors in mammograms in early stages. It also allows the rapid processing of the input data. In the first step, after designing a filter, the discrete wavelet transform is applied to the input images and the approximate coefficients of scaling components are constructed. Then, the different parts of image are classified in continuous spectrum. In the next step, by using adaptive K-means algorithm for initializing and smart choice of clusters' number, the appropriate threshold is selected. Finally, the suspicious cancerous mass is separated by implementing the image processing techniques. We Received 120 mammographic images in LJPEG format, which had been scanned in Gray-Scale with 50 microns size, 3% noise and 20% INU from clinical data taken from two medical databases (mini-MIAS and DDSM). The proposed algorithm detected tumors at an acceptable level with an average accuracy of 92.32% and sensitivity of 90.24%. Also, the Kappa coefficient was approximately 0.85, which proved the suitable reliability of the system performance. The exact positioning of the cancerous tumors allows the radiologist to determine the stage of disease progression and suggest an appropriate treatment in accordance with the tumor growth. The low PPV and high NPV of the system is a warranty of the system and both clinical specialists and patients can trust its output.

  20. A local search for a graph clustering problem

    NASA Astrophysics Data System (ADS)

    Navrotskaya, Anna; Il'ev, Victor

    2016-10-01

    In the clustering problems one has to partition a given set of objects (a data set) into some subsets (called clusters) taking into consideration only similarity of the objects. One of most visual formalizations of clustering is graph clustering, that is grouping the vertices of a graph into clusters taking into consideration the edge structure of the graph whose vertices are objects and edges represent similarities between the objects. In the graph k-clustering problem the number of clusters does not exceed k and the goal is to minimize the number of edges between clusters and the number of missing edges within clusters. This problem is NP-hard for any k ≥ 2. We propose a polynomial time (2k-1)-approximation algorithm for graph k-clustering. Then we apply a local search procedure to the feasible solution found by this algorithm and hold experimental research of obtained heuristics.

  1. Fully convolutional network with cluster for semantic segmentation

    NASA Astrophysics Data System (ADS)

    Ma, Xiao; Chen, Zhongbi; Zhang, Jianlin

    2018-04-01

    At present, image semantic segmentation technology has been an active research topic for scientists in the field of computer vision and artificial intelligence. Especially, the extensive research of deep neural network in image recognition greatly promotes the development of semantic segmentation. This paper puts forward a method based on fully convolutional network, by cluster algorithm k-means. The cluster algorithm using the image's low-level features and initializing the cluster centers by the super-pixel segmentation is proposed to correct the set of points with low reliability, which are mistakenly classified in great probability, by the set of points with high reliability in each clustering regions. This method refines the segmentation of the target contour and improves the accuracy of the image segmentation.

  2. Cluster Dynamical Mean Field Methods and the Momentum-selective Mott transition

    NASA Astrophysics Data System (ADS)

    Gull, Emanuel

    2011-03-01

    Innovations in methodology and computational power have enabled cluster dynamical mean field calculations of the Hubbard model with interaction strengths and band structures representative of high temperature copper oxide superconductors, for clusters large enough that the thermodyamic limit behavior may be determined. We present the methods and show how extrapolations to the thermodynamic limit work in practice. We show that the Hubbard model with next-nearest neighbor hopping at intermediate interaction strength captures much of the exotic behavior characteristic of the high temperature superconductors. An important feature of the results is a pseudogap for hole doping but not for electron doping. The pseudogap regime is characterized by a gap for momenta near Brillouin zone face and gapless behavior near the zone diagonal. for dopings outside of the pseudogap regime we find scattering rates which vary around the fermi surface in a way consistent with recent transport measurements. Using the maximum entropy method we calculate spectra, self-energies, and response functions for Raman spectroscopy and optical conductivities, finding results also in good agreement with experiment. Olivier Parcollet, Philipp Werner, Nan Lin, Michel Ferrero, Antoine Georges, Andrew J. Millis; NSF-DMR-0705847.

  3. CLUSTERING OF INTERICTAL SPIKES BY DYNAMIC TIME WARPING AND AFFINITY PROPAGATION

    PubMed Central

    Thomas, John; Jin, Jing; Dauwels, Justin; Cash, Sydney S.; Westover, M. Brandon

    2018-01-01

    Epilepsy is often associated with the presence of spikes in electroencephalograms (EEGs). The spike waveforms vary vastly among epilepsy patients, and also for the same patient across time. In order to develop semi-automated and automated methods for detecting spikes, it is crucial to obtain a better understanding of the various spike shapes. In this paper, we develop several approaches to extract exemplars of spikes. We generate spike exemplars by applying clustering algorithms to a database of spikes from 12 patients. As similarity measures for clustering, we consider the Euclidean distance and Dynamic Time Warping (DTW). We assess two clustering algorithms, namely, K-means clustering and affinity propagation. The clustering methods are compared based on the mean squared error, and the similarity measures are assessed based on the number of generated spike clusters. Affinity propagation with DTW is shown to be the best combination for clustering epileptic spikes, since it generates fewer spike templates and does not require to pre-specify the number of spike templates. PMID:29527130

  4. Characterizing Heterogeneity within Head and Neck Lesions Using Cluster Analysis of Multi-Parametric MRI Data.

    PubMed

    Borri, Marco; Schmidt, Maria A; Powell, Ceri; Koh, Dow-Mu; Riddell, Angela M; Partridge, Mike; Bhide, Shreerang A; Nutting, Christopher M; Harrington, Kevin J; Newbold, Katie L; Leach, Martin O

    2015-01-01

    To describe a methodology, based on cluster analysis, to partition multi-parametric functional imaging data into groups (or clusters) of similar functional characteristics, with the aim of characterizing functional heterogeneity within head and neck tumour volumes. To evaluate the performance of the proposed approach on a set of longitudinal MRI data, analysing the evolution of the obtained sub-sets with treatment. The cluster analysis workflow was applied to a combination of dynamic contrast-enhanced and diffusion-weighted imaging MRI data from a cohort of squamous cell carcinoma of the head and neck patients. Cumulative distributions of voxels, containing pre and post-treatment data and including both primary tumours and lymph nodes, were partitioned into k clusters (k = 2, 3 or 4). Principal component analysis and cluster validation were employed to investigate data composition and to independently determine the optimal number of clusters. The evolution of the resulting sub-regions with induction chemotherapy treatment was assessed relative to the number of clusters. The clustering algorithm was able to separate clusters which significantly reduced in voxel number following induction chemotherapy from clusters with a non-significant reduction. Partitioning with the optimal number of clusters (k = 4), determined with cluster validation, produced the best separation between reducing and non-reducing clusters. The proposed methodology was able to identify tumour sub-regions with distinct functional properties, independently separating clusters which were affected differently by treatment. This work demonstrates that unsupervised cluster analysis, with no prior knowledge of the data, can be employed to provide a multi-parametric characterization of functional heterogeneity within tumour volumes.

  5. A heuristic approach to handle capacitated facility location problem evaluated using clustering internal evaluation

    NASA Astrophysics Data System (ADS)

    Sutanto, G. R.; Kim, S.; Kim, D.; Sutanto, H.

    2018-03-01

    One of the problems in dealing with capacitated facility location problem (CFLP) is occurred because of the difference between the capacity numbers of facilities and the number of customers that needs to be served. A facility with small capacity may result in uncovered customers. These customers need to be re-allocated to another facility that still has available capacity. Therefore, an approach is proposed to handle CFLP by using k-means clustering algorithm to handle customers’ allocation. And then, if customers’ re-allocation is needed, is decided by the overall average distance between customers and the facilities. This new approach is benchmarked to the existing approach by Liao and Guo which also use k-means clustering algorithm as a base idea to decide the facilities location and customers’ allocation. Both of these approaches are benchmarked by using three clustering evaluation methods with connectedness, compactness, and separations factors.

  6. Image Subtraction Reduction of Open Clusters M35 & NGC 2158 in the K2 Campaign 0 Super Stamps

    NASA Astrophysics Data System (ADS)

    Soares-Furtado, M.; Hartman, J. D.; Bakos, G. Á.; Huang, C. X.; Penev, K.; Bhatti, W.

    2017-04-01

    We observed the open clusters M35 and NGC 2158 during the initial K2 campaign (C0). Reducing these data to high-precision photometric timeseries is challenging due to the wide point-spread function (PSF) and the blending of stellar light in such dense regions. We developed an image-subtraction-based K2 reduction pipeline that is applicable to both crowded and sparse stellar fields. We applied our pipeline to the data-rich C0 K2 super stamp, containing the two open clusters, as well as to the neighboring postage stamps. In this paper, we present our image subtraction reduction pipeline and demonstrate that this technique achieves ultra-high photometric precision for sources in the C0 super stamp. We extract the raw light curves of 3960 stars taken from the UCAC4 and EPIC catalogs and de-trend them for systematic effects. We compare our photometric results with the prior reductions published in the literature. For de-trended TFA-corrected sources in the 12-12.25 {{{K}}}{{p}} magnitude range, we achieve a best 6.5-hour window running rms of 35 ppm, falling to 100 ppm for fainter stars in the 14-14.25 {{{K}}}{{p}} magnitude range. For stars with {K}p> 14, our de-trended and 6.5-hour binned light curves achieve the highest photometric precision. Moreover, all our TFA-corrected sources have higher precision on all timescales investigated. This work represents the first published image subtraction analysis of a K2 super stamp. This method will be particularly useful for analyzing the Galactic bulge observations carried out during K2 campaign 9. The raw light curves and the final results of our de-trending processes are publicly available at http://k2.hatsurveys.org/archive/.

  7. Hydrogen bonding interaction of small acetaldehyde clusters studied with core-electron excitation spectroscopy in the oxygen K-edge region

    NASA Astrophysics Data System (ADS)

    Tabayashi, K.; Chohda, M.; Yamanaka, T.; Tsutsumi, Y.; Takahashi, O.; Yoshida, H.; Taniguchi, M.

    2010-06-01

    In order to examine inner-shell electron excitation spectra of molecular clusters with strong multipole interactions, excitation spectra and time-of-flight (TOF) fragment-mass spectra of small acetaldehyde (AA) clusters have been studied under the beam conditions. The TOF spectra at the oxygen K-edge region showed an intense growth of the protonated clusters, MnH+ (M=CH3CHO) in the cluster beams. "cluster-specific" excitation spectra could be generated by monitoring partial-ion-yields of the protonated clusters. The most intense band of O1s→π*CO was found to shift to a higher energy by 0.15 eV relative to the monomer band upon clusterization. X-ray absorption spectra (XAS) were also calculated for the representative dimer configurations using a computer modelling program based on the density functional theory. The XAS prediction for the most stable (non-planar) configuration was found to give a close comparison with the cluster-band shift observed. The band shift was interpreted as being due to the HOMO-LUMO interaction within the complex where a contribution of vibrationally blue-shifting hydrogen bonding could be identified.

  8. CLUSTER STAFF search coils magnetometer calibration - comparisons with FGM

    NASA Astrophysics Data System (ADS)

    Robert, P.; Cornilleau-Wehrlin, N.; Piberne, R.; de Conchy, Y.; Lacombe, C.; Bouzid, V.; Grison, B.; Alison, D.; Canu, P.

    2013-12-01

    The main part of Cluster Spatio Temporal Analysis of Field Fluctuations (STAFF) experiment consists of triaxial search coils allowing the measurements of the three magnetic components of the waves from 0.1 Hz up to 4 kHz. Two sets of data are produced, one by a module to filter and transmit the corresponding waveform up to either 10 or 180 Hz (STAFF-SC) and the second by an onboard Spectrum Analyser (STAFF-SA) to compute the elements of the spectral matrix for five components of the waves, 3 × B and 2 × E (from EFW experiment) in the frequency range 8 Hz to 4 kHz. In order to understand the way the output signal of the search coils are calibrated, the transfer functions of the different parts of the instrument are described as well as the way to transform telemetry data into physical units, across various coordinate systems from the spinning sensors to a fixed and known frame. The instrument sensitivity is discussed. Cross-calibration inside STAFF (SC and SA) is presented. Results of cross-calibration between the STAFF search coils and the Cluster Flux Gate Magnetometer (FGM) data are discussed. It is shown that these cross-calibrations lead to an agreement between both data sets at low frequency within a 2% error. By means of statistics done over 10 yr, it is shown that the functionalities and characteristics of both instruments have not changed during this period.

  9. CLUSTER-STAFF search coil magnetometer calibration - comparisons with FGM

    NASA Astrophysics Data System (ADS)

    Robert, P.; Cornilleau-Wehrlin, N.; Piberne, R.; de Conchy, Y.; Lacombe, C.; Bouzid, V.; Grison, B.; Alison, D.; Canu, P.

    2014-09-01

    The main part of the Cluster Spatio-Temporal Analysis of Field Fluctuations (STAFF) experiment consists of triaxial search coils allowing the measurements of the three magnetic components of the waves from 0.1 Hz up to 4 kHz. Two sets of data are produced, one by a module to filter and transmit the corresponding waveform up to either 10 or 180 Hz (STAFF-SC), and the second by the onboard Spectrum Analyser (STAFF-SA) to compute the elements of the spectral matrix for five components of the waves, 3 × B and 2 × E (from the EFW experiment), in the frequency range 8 Hz to 4 kHz. In order to understand the way the output signals of the search coils are calibrated, the transfer functions of the different parts of the instrument are described as well as the way to transform telemetry data into physical units across various coordinate systems from the spinning sensors to a fixed and known frame. The instrument sensitivity is discussed. Cross-calibration inside STAFF (SC and SA) is presented. Results of cross-calibration between the STAFF search coils and the Cluster Fluxgate Magnetometer (FGM) data are discussed. It is shown that these cross-calibrations lead to an agreement between both data sets at low frequency within a 2% error. By means of statistics done over 10 yr, it is shown that the functionalities and characteristics of both instruments have not changed during this period.

  10. Will farmers intend to cultivate Provitamin A genetically modified (GM) cassava in Nigeria? Evidence from a k-means segmentation analysis of beliefs and attitudes.

    PubMed

    Oparinde, Adewale; Abdoulaye, Tahirou; Mignouna, Djana Babatima; Bamire, Adebayo Simeon

    2017-01-01

    Analysis of market segments within a population remains critical to agricultural systems and policy processes for targeting new innovations. Patterns in attitudes and intentions toward cultivating Provitamin A GM cassava are examined through the use of a combination of behavioural theory and k-means cluster analysis method, investigating the interrelationship among various behavioural antecedents. Using a state-level sample of smallholder cassava farmers in Nigeria, this paper identifies three distinct classes of attitude and intention denoted as low opposition, medium opposition and high opposition farmers. It was estimated that only 25% of the surveyed population of farmers was highly opposed to cultivating Provitamin A GM cassava.

  11. Will farmers intend to cultivate Provitamin A genetically modified (GM) cassava in Nigeria? Evidence from a k-means segmentation analysis of beliefs and attitudes

    PubMed Central

    Abdoulaye, Tahirou; Mignouna, Djana Babatima; Bamire, Adebayo Simeon

    2017-01-01

    Analysis of market segments within a population remains critical to agricultural systems and policy processes for targeting new innovations. Patterns in attitudes and intentions toward cultivating Provitamin A GM cassava are examined through the use of a combination of behavioural theory and k-means cluster analysis method, investigating the interrelationship among various behavioural antecedents. Using a state-level sample of smallholder cassava farmers in Nigeria, this paper identifies three distinct classes of attitude and intention denoted as low opposition, medium opposition and high opposition farmers. It was estimated that only 25% of the surveyed population of farmers was highly opposed to cultivating Provitamin A GM cassava. PMID:28700605

  12. Somatotyping using 3D anthropometry: a cluster analysis.

    PubMed

    Olds, Tim; Daniell, Nathan; Petkov, John; David Stewart, Arthur

    2013-01-01

    Somatotyping is the quantification of human body shape, independent of body size. Hitherto, somatotyping (including the most popular method, the Heath-Carter system) has been based on subjective visual ratings, sometimes supported by surface anthropometry. This study used data derived from three-dimensional (3D) whole-body scans as inputs for cluster analysis to objectively derive clusters of similar body shapes. Twenty-nine dimensions normalised for body size were measured on a purposive sample of 301 adults aged 17-56 years who had been scanned using a Vitus Smart laser scanner. K-means Cluster Analysis with v-fold cross-validation was used to determine shape clusters. Three male and three female clusters emerged, and were visualised using those scans closest to the cluster centroid and a caricature defined by doubling the difference between the average scan and the cluster centroid. The male clusters were decidedly endomorphic (high fatness), ectomorphic (high linearity), and endo-mesomorphic (a mixture of fatness and muscularity). The female clusters were clearly endomorphic, ectomorphic, and the ecto-mesomorphic (a mixture of linearity and muscularity). An objective shape quantification procedure combining 3D scanning and cluster analysis yielded shape clusters strikingly similar to traditional somatotyping.

  13. Cluster Analysis on Longitudinal Data of Patients with Adult-Onset Asthma.

    PubMed

    Ilmarinen, Pinja; Tuomisto, Leena E; Niemelä, Onni; Tommola, Minna; Haanpää, Jussi; Kankaanranta, Hannu

    Previous cluster analyses on asthma are based on cross-sectional data. To identify phenotypes of adult-onset asthma by using data from baseline (diagnostic) and 12-year follow-up visits. The Seinäjoki Adult Asthma Study is a 12-year follow-up study of patients with new-onset adult asthma. K-means cluster analysis was performed by using variables from baseline and follow-up visits on 171 patients to identify phenotypes. Five clusters were identified. Patients in cluster 1 (n = 38) were predominantly nonatopic males with moderate smoking history at baseline. At follow-up, 40% of these patients had developed persistent obstruction but the number of patients with uncontrolled asthma (5%) and rhinitis (10%) was the lowest. Cluster 2 (n = 19) was characterized by older men with heavy smoking history, poor lung function, and persistent obstruction at baseline. At follow-up, these patients were mostly uncontrolled (84%) despite daily use of inhaled corticosteroid (ICS) with add-on therapy. Cluster 3 (n = 50) consisted mostly of nonsmoking females with good lung function at diagnosis/follow-up and well-controlled/partially controlled asthma at follow-up. Cluster 4 (n = 25) had obese and symptomatic patients at baseline/follow-up. At follow-up, these patients had several comorbidities (40% psychiatric disease) and were treated daily with ICS and add-on therapy. Patients in cluster 5 (n = 39) were mostly atopic and had the earliest onset of asthma, the highest blood eosinophils, and FEV 1 reversibility at diagnosis. At follow-up, these patients used the lowest ICS dose but 56% were well controlled. Results can be used to predict outcomes of patients with adult-onset asthma and to aid in development of personalized therapy (NCT02733016 at ClinicalTrials.gov). Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.

  14. Developing the fuzzy c-means clustering algorithm based on maximum entropy for multitarget tracking in a cluttered environment

    NASA Astrophysics Data System (ADS)

    Chen, Xiao; Li, Yaan; Yu, Jing; Li, Yuxing

    2018-01-01

    For fast and more effective implementation of tracking multiple targets in a cluttered environment, we propose a multiple targets tracking (MTT) algorithm called maximum entropy fuzzy c-means clustering joint probabilistic data association that combines fuzzy c-means clustering and the joint probabilistic data association (PDA) algorithm. The algorithm uses the membership value to express the probability of the target originating from measurement. The membership value is obtained through fuzzy c-means clustering objective function optimized by the maximum entropy principle. When considering the effect of the public measurement, we use a correction factor to adjust the association probability matrix to estimate the state of the target. As this algorithm avoids confirmation matrix splitting, it can solve the high computational load problem of the joint PDA algorithm. The results of simulations and analysis conducted for tracking neighbor parallel targets and cross targets in a different density cluttered environment show that the proposed algorithm can realize MTT quickly and efficiently in a cluttered environment. Further, the performance of the proposed algorithm remains constant with increasing process noise variance. The proposed algorithm has the advantages of efficiency and low computational load, which can ensure optimum performance when tracking multiple targets in a dense cluttered environment.

  15. Defining objective clusters for rabies virus sequences using affinity propagation clustering

    PubMed Central

    Fischer, Susanne; Freuling, Conrad M.; Pfaff, Florian; Bodenhofer, Ulrich; Höper, Dirk; Fischer, Mareike; Marston, Denise A.; Fooks, Anthony R.; Mettenleiter, Thomas C.; Conraths, Franz J.; Homeier-Bachmann, Timo

    2018-01-01

    Rabies is caused by lyssaviruses, and is one of the oldest known zoonoses. In recent years, more than 21,000 nucleotide sequences of rabies viruses (RABV), from the prototype species rabies lyssavirus, have been deposited in public databases. Subsequent phylogenetic analyses in combination with metadata suggest geographic distributions of RABV. However, these analyses somewhat experience technical difficulties in defining verifiable criteria for cluster allocations in phylogenetic trees inviting for a more rational approach. Therefore, we applied a relatively new mathematical clustering algorythm named ‘affinity propagation clustering’ (AP) to propose a standardized sub-species classification utilizing full-genome RABV sequences. Because AP has the advantage that it is computationally fast and works for any meaningful measure of similarity between data samples, it has previously been applied successfully in bioinformatics, for analysis of microarray and gene expression data, however, cluster analysis of sequences is still in its infancy. Existing (516) and original (46) full genome RABV sequences were used to demonstrate the application of AP for RABV clustering. On a global scale, AP proposed four clusters, i.e. New World cluster, Arctic/Arctic-like, Cosmopolitan, and Asian as previously assigned by phylogenetic studies. By combining AP with established phylogenetic analyses, it is possible to resolve phylogenetic relationships between verifiably determined clusters and sequences. This workflow will be useful in confirming cluster distributions in a uniform transparent manner, not only for RABV, but also for other comparative sequence analyses. PMID:29357361

  16. The anterior hypothalamus in cluster headache.

    PubMed

    Arkink, Enrico B; Schmitz, Nicole; Schoonman, Guus G; van Vliet, Jorine A; Haan, Joost; van Buchem, Mark A; Ferrari, Michel D; Kruit, Mark C

    2017-10-01

    Objective To evaluate the presence, localization, and specificity of structural hypothalamic and whole brain changes in cluster headache and chronic paroxysmal hemicrania (CPH). Methods We compared T1-weighted magnetic resonance images of subjects with cluster headache (episodic n = 24; chronic n = 23; probable n = 14), CPH ( n = 9), migraine (with aura n = 14; without aura n = 19), and no headache ( n = 48). We applied whole brain voxel-based morphometry (VBM) using two complementary methods to analyze structural changes in the hypothalamus: region-of-interest analyses in whole brain VBM, and manual segmentation of the hypothalamus to calculate volumes. We used both conservative VBM thresholds, correcting for multiple comparisons, and less conservative thresholds for exploratory purposes. Results Using region-of-interest VBM analyses mirrored to the headache side, we found enlargement ( p < 0.05, small volume correction) in the anterior hypothalamic gray matter in subjects with chronic cluster headache compared to controls, and in all participants with episodic or chronic cluster headache taken together compared to migraineurs. After manual segmentation, hypothalamic volume (mean±SD) was larger ( p < 0.05) both in subjects with episodic (1.89 ± 0.18 ml) and chronic (1.87 ± 0.21 ml) cluster headache compared to controls (1.72 ± 0.15 ml) and migraineurs (1.68 ± 0.19 ml). Similar but non-significant trends were observed for participants with probable cluster headache (1.82 ± 0.19 ml; p = 0.07) and CPH (1.79 ± 0.20 ml; p = 0.15). Increased hypothalamic volume was primarily explained by bilateral enlargement of the anterior hypothalamus. Exploratory whole brain VBM analyses showed widespread changes in pain-modulating areas in all subjects with headache. Interpretation The anterior hypothalamus is enlarged in episodic and chronic cluster headache and possibly also in probable cluster

  17. Phase transition temperatures of 405-725 K in superfluid ultra-dense hydrogen clusters on metal surfaces

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Holmlid, Leif, E-mail: holmlid@chem.gu.se; Kotzias, Bernhard

    Ultra-dense hydrogen H(0) with its typical H-H bond distance of 2.3 pm is superfluid at room temperature as expected for quantum fluids. It also shows a Meissner effect at room temperature, which indicates that a transition point to a non-superfluid state should exist above room temperature. This transition point is given by a disappearance of the superfluid long-chain clusters H{sub 2N}(0). This transition point is now measured for several metal carrier surfaces at 405 - 725 K, using both ultra-dense protium p(0) and deuterium D(0). Clusters of ordinary Rydberg matter H(l) as well as small symmetric clusters H{sub 4}(0) andmore » H{sub 3}(0) (which do not give a superfluid or superconductive phase) all still exist on the surface at high temperature. This shows directly that desorption or diffusion processes do not remove the long superfluid H{sub 2N}(0) clusters. The two ultra-dense forms p(0) and D(0) have different transition temperatures under otherwise identical conditions. The transition point for p(0) is higher in temperature, which is unexpected.« less

  18. Characterizing Heterogeneity within Head and Neck Lesions Using Cluster Analysis of Multi-Parametric MRI Data

    PubMed Central

    Borri, Marco; Schmidt, Maria A.; Powell, Ceri; Koh, Dow-Mu; Riddell, Angela M.; Partridge, Mike; Bhide, Shreerang A.; Nutting, Christopher M.; Harrington, Kevin J.; Newbold, Katie L.; Leach, Martin O.

    2015-01-01

    Purpose To describe a methodology, based on cluster analysis, to partition multi-parametric functional imaging data into groups (or clusters) of similar functional characteristics, with the aim of characterizing functional heterogeneity within head and neck tumour volumes. To evaluate the performance of the proposed approach on a set of longitudinal MRI data, analysing the evolution of the obtained sub-sets with treatment. Material and Methods The cluster analysis workflow was applied to a combination of dynamic contrast-enhanced and diffusion-weighted imaging MRI data from a cohort of squamous cell carcinoma of the head and neck patients. Cumulative distributions of voxels, containing pre and post-treatment data and including both primary tumours and lymph nodes, were partitioned into k clusters (k = 2, 3 or 4). Principal component analysis and cluster validation were employed to investigate data composition and to independently determine the optimal number of clusters. The evolution of the resulting sub-regions with induction chemotherapy treatment was assessed relative to the number of clusters. Results The clustering algorithm was able to separate clusters which significantly reduced in voxel number following induction chemotherapy from clusters with a non-significant reduction. Partitioning with the optimal number of clusters (k = 4), determined with cluster validation, produced the best separation between reducing and non-reducing clusters. Conclusion The proposed methodology was able to identify tumour sub-regions with distinct functional properties, independently separating clusters which were affected differently by treatment. This work demonstrates that unsupervised cluster analysis, with no prior knowledge of the data, can be employed to provide a multi-parametric characterization of functional heterogeneity within tumour volumes. PMID:26398888

  19. Optimization-Based Model Fitting for Latent Class and Latent Profile Analyses

    ERIC Educational Resources Information Center

    Huang, Guan-Hua; Wang, Su-Mei; Hsu, Chung-Chu

    2011-01-01

    Statisticians typically estimate the parameters of latent class and latent profile models using the Expectation-Maximization algorithm. This paper proposes an alternative two-stage approach to model fitting. The first stage uses the modified k-means and hierarchical clustering algorithms to identify the latent classes that best satisfy the…

  20. Optimizing Energy Consumption in Vehicular Sensor Networks by Clustering Using Fuzzy C-Means and Fuzzy Subtractive Algorithms

    NASA Astrophysics Data System (ADS)

    Ebrahimi, A.; Pahlavani, P.; Masoumi, Z.

    2017-09-01

    Traffic monitoring and managing in urban intelligent transportation systems (ITS) can be carried out based on vehicular sensor networks. In a vehicular sensor network, vehicles equipped with sensors such as GPS, can act as mobile sensors for sensing the urban traffic and sending the reports to a traffic monitoring center (TMC) for traffic estimation. The energy consumption by the sensor nodes is a main problem in the wireless sensor networks (WSNs); moreover, it is the most important feature in designing these networks. Clustering the sensor nodes is considered as an effective solution to reduce the energy consumption of WSNs. Each cluster should have a Cluster Head (CH), and a number of nodes located within its supervision area. The cluster heads are responsible for gathering and aggregating the information of clusters. Then, it transmits the information to the data collection center. Hence, the use of clustering decreases the volume of transmitting information, and, consequently, reduces the energy consumption of network. In this paper, Fuzzy C-Means (FCM) and Fuzzy Subtractive algorithms are employed to cluster sensors and investigate their performance on the energy consumption of sensors. It can be seen that the FCM algorithm and Fuzzy Subtractive have been reduced energy consumption of vehicle sensors up to 90.68% and 92.18%, respectively. Comparing the performance of the algorithms implies the 1.5 percent improvement in Fuzzy Subtractive algorithm in comparison.

  1. Clustering Categorical Data Using Community Detection Techniques

    PubMed Central

    2017-01-01

    With the advent of the k-modes algorithm, the toolbox for clustering categorical data has an efficient tool that scales linearly in the number of data items. However, random initialization of cluster centers in k-modes makes it hard to reach a good clustering without resorting to many trials. Recently proposed methods for better initialization are deterministic and reduce the clustering cost considerably. A variety of initialization methods differ in how the heuristics chooses the set of initial centers. In this paper, we address the clustering problem for categorical data from the perspective of community detection. Instead of initializing k modes and running several iterations, our scheme, CD-Clustering, builds an unweighted graph and detects highly cohesive groups of nodes using a fast community detection technique. The top-k detected communities by size will define the k modes. Evaluation on ten real categorical datasets shows that our method outperforms the existing initialization methods for k-modes in terms of accuracy, precision, and recall in most of the cases. PMID:29430249

  2. Query by example video based on fuzzy c-means initialized by fixed clustering center

    NASA Astrophysics Data System (ADS)

    Hou, Sujuan; Zhou, Shangbo; Siddique, Muhammad Abubakar

    2012-04-01

    Currently, the high complexity of video contents has posed the following major challenges for fast retrieval: (1) efficient similarity measurements, and (2) efficient indexing on the compact representations. A video-retrieval strategy based on fuzzy c-means (FCM) is presented for querying by example. Initially, the query video is segmented and represented by a set of shots, each shot can be represented by a key frame, and then we used video processing techniques to find visual cues to represent the key frame. Next, because the FCM algorithm is sensitive to the initializations, here we initialized the cluster center by the shots of query video so that users could achieve appropriate convergence. After an FCM cluster was initialized by the query video, each shot of query video was considered a benchmark point in the aforesaid cluster, and each shot in the database possessed a class label. The similarity between the shots in the database with the same class label and benchmark point can be transformed into the distance between them. Finally, the similarity between the query video and the video in database was transformed into the number of similar shots. Our experimental results demonstrated the performance of this proposed approach.

  3. A Cluster-Analytical Approach towards Physical Activity and Eating Habits among 10-Year-Old Children

    ERIC Educational Resources Information Center

    Sabbe, Dieter; De Bourdeaudhuij, I.; Legiest, E.; Maes, L.

    2008-01-01

    The purpose was to investigate whether clusters--based on physical activity (PA) and eating habits--can be found among children, and to explore subgroups' characteristics. A total of 1725 10-year olds completed a self-administered questionnaire. K-means cluster analysis was based on the weekly quantity of vigorous and moderate PA, the excess index…

  4. Cluster Analysis of Atmospheric Dynamics and Pollution Transport in a Coastal Area

    NASA Astrophysics Data System (ADS)

    Sokolov, Anton; Dmitriev, Egor; Maksimovich, Elena; Delbarre, Hervé; Augustin, Patrick; Gengembre, Cyril; Fourmentin, Marc; Locoge, Nadine

    2016-11-01

    Summertime atmospheric dynamics in the coastal zone of the industrialized Dunkerque agglomeration in northern France was characterized by a cluster analysis of back trajectories in the context of pollution transport. The MESO-NH atmospheric model was used to simulate the local dynamics at multiple scales with horizontal resolution down to 500 m, and for the online calculation of the Lagrangian backward trajectories with 30-min temporal resolution. Airmass transport was performed along six principal pathways obtained by the weighted k-means clustering technique. Four of these centroids corresponded to a range of wind speeds over the English Channel: two for wind directions from the north-east and two from the south-west. Another pathway corresponded to a south-westerly continental transport. The backward trajectories of the largest and most dispersed sixth cluster contained low wind speeds, including sea-breeze circulations. Based on analyses of meteorological data and pollution measurements, the principal atmospheric pathways were related to local air-contamination events. Continuous air quality and meteorological data were collected during the Benzene-Toluene-Ethylbenzene-Xylene 2006 campaign. The sites of the pollution measurements served as the endpoints for the backward trajectories. Pollutant transport pathways corresponding to the highest air contamination were defined.

  5. Mutation Clusters from Cancer Exome.

    PubMed

    Kakushadze, Zura; Yu, Willie

    2017-08-15

    We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit a mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1389 published genome samples across 14 cancer types. In contrast, we find in- and out-of-sample instabilities in cancer signatures extracted from exome samples via nonnegative matrix factorization (NMF), a computationally-costly and non-deterministic method. Extracting stable mutation structures from exome data could have important implications for speed and cost, which are critical for early-stage cancer diagnostics, such as novel blood-test methods currently in development.

  6. Mutation Clusters from Cancer Exome

    PubMed Central

    Kakushadze, Zura; Yu, Willie

    2017-01-01

    We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit a mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1389 published genome samples across 14 cancer types. In contrast, we find in- and out-of-sample instabilities in cancer signatures extracted from exome samples via nonnegative matrix factorization (NMF), a computationally-costly and non-deterministic method. Extracting stable mutation structures from exome data could have important implications for speed and cost, which are critical for early-stage cancer diagnostics, such as novel blood-test methods currently in development. PMID:28809811

  7. Chemical analyses and K-Ar ages of samples from 13 drill holes, Medicine Lake volcano, California

    USGS Publications Warehouse

    Donnelly-Nolan, Julie M.

    2006-01-01

    Chemical analyses and K-Ar ages are presented for rocks sampled from drill holes at Medicine Lake volcano, northern California. A location map and a cross-section are included, as are separate tables for drill hole information, major and trace element data, and for K-Ar dates.

  8. Hebbian self-organizing integrate-and-fire networks for data clustering.

    PubMed

    Landis, Florian; Ott, Thomas; Stoop, Ruedi

    2010-01-01

    We propose a Hebbian learning-based data clustering algorithm using spiking neurons. The algorithm is capable of distinguishing between clusters and noisy background data and finds an arbitrary number of clusters of arbitrary shape. These properties render the approach particularly useful for visual scene segmentation into arbitrarily shaped homogeneous regions. We present several application examples, and in order to highlight the advantages and the weaknesses of our method, we systematically compare the results with those from standard methods such as the k-means and Ward's linkage clustering. The analysis demonstrates that not only the clustering ability of the proposed algorithm is more powerful than those of the two concurrent methods, the time complexity of the method is also more modest than that of its generally used strongest competitor.

  9. Brain vascular image segmentation based on fuzzy local information C-means clustering

    NASA Astrophysics Data System (ADS)

    Hu, Chaoen; Liu, Xia; Liang, Xiao; Hui, Hui; Yang, Xin; Tian, Jie

    2017-02-01

    Light sheet fluorescence microscopy (LSFM) is a powerful optical resolution fluorescence microscopy technique which enables to observe the mouse brain vascular network in cellular resolution. However, micro-vessel structures are intensity inhomogeneity in LSFM images, which make an inconvenience for extracting line structures. In this work, we developed a vascular image segmentation method by enhancing vessel details which should be useful for estimating statistics like micro-vessel density. Since the eigenvalues of hessian matrix and its sign describes different geometric structure in images, which enable to construct vascular similarity function and enhance line signals, the main idea of our method is to cluster the pixel values of the enhanced image. Our method contained three steps: 1) calculate the multiscale gradients and the differences between eigenvalues of Hessian matrix. 2) In order to generate the enhanced microvessels structures, a feed forward neural network was trained by 2.26 million pixels for dealing with the correlations between multi-scale gradients and the differences between eigenvalues. 3) The fuzzy local information c-means clustering (FLICM) was used to cluster the pixel values in enhance line signals. To verify the feasibility and effectiveness of this method, mouse brain vascular images have been acquired by a commercial light-sheet microscope in our lab. The experiment of the segmentation method showed that dice similarity coefficient can reach up to 85%. The results illustrated that our approach extracting line structures of blood vessels dramatically improves the vascular image and enable to accurately extract blood vessels in LSFM images.

  10. Modeling and clustering water demand patterns from real-world smart meter data

    NASA Astrophysics Data System (ADS)

    Cheifetz, Nicolas; Noumir, Zineb; Samé, Allou; Sandraz, Anne-Claire; Féliers, Cédric; Heim, Véronique

    2017-08-01

    Nowadays, drinking water utilities need an acute comprehension of the water demand on their distribution network, in order to efficiently operate the optimization of resources, manage billing and propose new customer services. With the emergence of smart grids, based on automated meter reading (AMR), a better understanding of the consumption modes is now accessible for smart cities with more granularities. In this context, this paper evaluates a novel methodology for identifying relevant usage profiles from the water consumption data produced by smart meters. The methodology is fully data-driven using the consumption time series which are seen as functions or curves observed with an hourly time step. First, a Fourier-based additive time series decomposition model is introduced to extract seasonal patterns from time series. These patterns are intended to represent the customer habits in terms of water consumption. Two functional clustering approaches are then used to classify the extracted seasonal patterns: the functional version of K-means, and the Fourier REgression Mixture (FReMix) model. The K-means approach produces a hard segmentation and K representative prototypes. On the other hand, the FReMix is a generative model and also produces K profiles as well as a soft segmentation based on the posterior probabilities. The proposed approach is applied to a smart grid deployed on the largest water distribution network (WDN) in France. The two clustering strategies are evaluated and compared. Finally, a realistic interpretation of the consumption habits is given for each cluster. The extensive experiments and the qualitative interpretation of the resulting clusters allow one to highlight the effectiveness of the proposed methodology.

  11. A comparative study of spatially clustered distribution of jumbo flying squid ( Dosidicus gigas) offshore Peru

    NASA Astrophysics Data System (ADS)

    Feng, Yongjiu; Cui, Li; Chen, Xinjun; Liu, Yu

    2017-06-01

    We examined spatially clustered distribution of jumbo flying squid ( Dosidicus gigas) in the offshore waters of Peru bounded by 78°-86°W and 8°-20°S under 0.5°×0.5° fishing grid. The study is based on the catch-per-unit-effort (CPUE) and fishing effort from Chinese mainland squid jigging fleet in 2003-2004 and 2006-2013. The data for all years as well as the eight years (excluding El Niño events) were studied to examine the effect of climate variation on the spatial distribution of D. gigas. Five spatial clusters reflecting the spatial distribution were computed using K-means and Getis-Ord Gi* for a detailed comparative study. Our results showed that clusters identified by the two methods were quite different in terms of their spatial patterns, and K-means was not as accurate as Getis-Ord Gi*, as inferred from the agreement degree and receiver operating characteristic. There were more areas of hot and cold spots in years without the impact of El Niño, suggesting that such large-scale climate variations could reduce the clustering level of D. gigas. The catches also showed that warm El Niño conditions and high water temperature were less favorable for D. gigas offshore Peru. The results suggested that the use of K-means is preferable if the aim is to discover the spatial distribution of each sub-region (cluster) of the study area, while Getis-Ord Gi* is preferable if the aim is to identify statistically significant hot spots that may indicate the central fishing ground.

  12. Constrained clusters of gene expression profiles with pathological features.

    PubMed

    Sese, Jun; Kurokawa, Yukinori; Monden, Morito; Kato, Kikuya; Morishita, Shinichi

    2004-11-22

    Gene expression profiles should be useful in distinguishing variations in disease, since they reflect accurately the status of cells. The primary clustering of gene expression reveals the genotypes that are responsible for the proximity of members within each cluster, while further clustering elucidates the pathological features of the individual members of each cluster. However, since the first clustering process and the second classification step, in which the features are associated with clusters, are performed independently, the initial set of clusters may omit genes that are associated with pathologically meaningful features. Therefore, it is important to devise a way of identifying gene expression clusters that are associated with pathological features. We present the novel technique of 'itemset constrained clustering' (IC-Clustering), which computes the optimal cluster that maximizes the interclass variance of gene expression between groups, which are divided according to the restriction that only divisions that can be expressed using common features are allowed. This constraint automatically labels each cluster with a set of pathological features which characterize that cluster. When applied to liver cancer datasets, IC-Clustering revealed informative gene expression clusters, which could be annotated with various pathological features, such as 'tumor' and 'man', or 'except tumor' and 'normal liver function'. In contrast, the k-means method overlooked these clusters.

  13. Density-based clustering analyses to identify heterogeneous cellular sub-populations

    NASA Astrophysics Data System (ADS)

    Heaster, Tiffany M.; Walsh, Alex J.; Landman, Bennett A.; Skala, Melissa C.

    2017-02-01

    Autofluorescence microscopy of NAD(P)H and FAD provides functional metabolic measurements at the single-cell level. Here, density-based clustering algorithms were applied to metabolic autofluorescence measurements to identify cell-level heterogeneity in tumor cell cultures. The performance of the density-based clustering algorithm, DENCLUE, was tested in samples with known heterogeneity (co-cultures of breast carcinoma lines). DENCLUE was found to better represent the distribution of cell clusters compared to Gaussian mixture modeling. Overall, DENCLUE is a promising approach to quantify cell-level heterogeneity, and could be used to understand single cell population dynamics in cancer progression and treatment.

  14. k-filtering applied to Cluster density measurements in the Solar Wind: Early findings

    NASA Astrophysics Data System (ADS)

    Jeska, Lauren; Roberts, Owen; Li, Xing

    2014-05-01

    Studies of solar wind turbulence indicate that a large proportion of the energy is Alfvénic (incompressible) at inertial scales. The properties of the turbulence found in the dissipation range are still under debate ~ while it is widely believed that kinetic Alfvén waves form the dominant component, the constituents of the remaining compressible turbulence are disputed. Using k-filtering, the power can be measured without assuming the validity of Taylor's hypothesis, and its distribution in (ω, k)-space can be determined to assist the identification of weak turbulence components. This technique is applied to Cluster electron density measurements and compared to the power in |B(t)|. As the direct electron density measurements from the WHISPER instrument have a low cadency of only 2.2s, proxy data derived from the spacecraft potential, measured every 0.2s by the EFW instrument, are used to extend this study to ion scales.

  15. An Improved Clustering Algorithm of Tunnel Monitoring Data for Cloud Computing

    PubMed Central

    Zhong, Luo; Tang, KunHao; Li, Lin; Yang, Guang; Ye, JingJing

    2014-01-01

    With the rapid development of urban construction, the number of urban tunnels is increasing and the data they produce become more and more complex. It results in the fact that the traditional clustering algorithm cannot handle the mass data of the tunnel. To solve this problem, an improved parallel clustering algorithm based on k-means has been proposed. It is a clustering algorithm using the MapReduce within cloud computing that deals with data. It not only has the advantage of being used to deal with mass data but also is more efficient. Moreover, it is able to compute the average dissimilarity degree of each cluster in order to clean the abnormal data. PMID:24982971

  16. Model-based clustering for RNA-seq data.

    PubMed

    Si, Yaqing; Liu, Peng; Li, Pinghua; Brutnell, Thomas P

    2014-01-15

    RNA-seq technology has been widely adopted as an attractive alternative to microarray-based methods to study global gene expression. However, robust statistical tools to analyze these complex datasets are still lacking. By grouping genes with similar expression profiles across treatments, cluster analysis provides insight into gene functions and networks, and hence is an important technique for RNA-seq data analysis. In this manuscript, we derive clustering algorithms based on appropriate probability models for RNA-seq data. An expectation-maximization algorithm and another two stochastic versions of expectation-maximization algorithms are described. In addition, a strategy for initialization based on likelihood is proposed to improve the clustering algorithms. Moreover, we present a model-based hybrid-hierarchical clustering method to generate a tree structure that allows visualization of relationships among clusters as well as flexibility of choosing the number of clusters. Results from both simulation studies and analysis of a maize RNA-seq dataset show that our proposed methods provide better clustering results than alternative methods such as the K-means algorithm and hierarchical clustering methods that are not based on probability models. An R package, MBCluster.Seq, has been developed to implement our proposed algorithms. This R package provides fast computation and is publicly available at http://www.r-project.org

  17. An Effective Approach for Clustering InhA Molecular Dynamics Trajectory Using Substrate-Binding Cavity Features

    PubMed Central

    Ruiz, Duncan D. A.; Norberto de Souza, Osmar

    2015-01-01

    Protein receptor conformations, obtained from molecular dynamics (MD) simulations, have become a promising treatment of its explicit flexibility in molecular docking experiments applied to drug discovery and development. However, incorporating the entire ensemble of MD conformations in docking experiments to screen large candidate compound libraries is currently an unfeasible task. Clustering algorithms have been widely used as a means to reduce such ensembles to a manageable size. Most studies investigate different algorithms using pairwise Root-Mean Square Deviation (RMSD) values for all, or part of the MD conformations. Nevertheless, the RMSD only may not be the most appropriate gauge to cluster conformations when the target receptor has a plastic active site, since they are influenced by changes that occur on other parts of the structure. Hence, we have applied two partitioning methods (k-means and k-medoids) and four agglomerative hierarchical methods (Complete linkage, Ward’s, Unweighted Pair Group Method and Weighted Pair Group Method) to analyze and compare the quality of partitions between a data set composed of properties from an enzyme receptor substrate-binding cavity and two data sets created using different RMSD approaches. Ensembles of representative MD conformations were generated by selecting a medoid of each group from all partitions analyzed. We investigated the performance of our new method for evaluating binding conformation of drug candidates to the InhA enzyme, which were performed by cross-docking experiments between a 20 ns MD trajectory and 20 different ligands. Statistical analyses showed that the novel ensemble, which is represented by only 0.48% of the MD conformations, was able to reproduce 75% of all dynamic behaviors within the binding cavity for the docking experiments performed. Moreover, this new approach not only outperforms the other two RMSD-clustering solutions, but it also shows to be a promising strategy to distill

  18. BP network identification technology of infrared polarization based on fuzzy c-means clustering

    NASA Astrophysics Data System (ADS)

    Zeng, Haifang; Gu, Guohua; He, Weiji; Chen, Qian; Yang, Wei

    2011-08-01

    Infrared detection system is frequently employed on surveillance operations and reconnaissance mission to detect particular targets of interest in both civilian and military communities. By incorporating the polarization of light as supplementary information, the target discrimination performance could be enhanced. So this paper proposed an infrared target identification method which is based on fuzzy theory and neural network with polarization properties of targets. The paper utilizes polarization degree and light intensity to advance the unsupervised KFCM (kernel fuzzy C-Means) clustering method. And establish different material pol1arization properties database. In the built network, the system can feedback output corresponding material types of probability distribution toward any input polarized degree such as 10° 15°, 20°, 25°, 30°. KFCM, which has stronger robustness and accuracy than FCM, introduces kernel idea and gives the noise points and invalid value different but intuitively reasonable weights. Because of differences in characterization of material properties, there will be some conflicts in classification results. And D - S evidence theory was used in the combination of the polarization and intensity information. Related results show KFCM clustering precision and operation rate are higher than that of the FCM clustering method. The artificial neural network method realizes material identification, which reasonable solved the problems of complexity in environmental information of infrared polarization, and improperness of background knowledge and inference rule. This method of polarization identification is fast in speed, good in self-adaption and high in resolution.

  19. Organization of the Escherichia coli K-12 gene cluster responsible for production of the extracellular polysaccharide colanic acid.

    PubMed Central

    Stevenson, G; Andrianopoulos, K; Hobbs, M; Reeves, P R

    1996-01-01

    Colanic acid (CA) is an extracellular polysaccharide produced by most Escherichia coli strains as well as by other species of the family Enterobacteriaceae. We have determined the sequence of a 23-kb segment of the E. coli K-12 chromosome which includes the cluster of genes necessary for production of CA. The CA cluster comprises 19 genes. Two other sequenced genes (orf1.3 and galF), which are situated between the CA cluster and the O-antigen cluster, were shown to be unnecessary for CA production. The CA cluster includes genes for synthesis of GDP-L-fucose, one of the precursors of CA, and the gene for one of the enzymes in this pathway (GDP-D-mannose 4,6-dehydratase) was identified by biochemical assay. Six of the inferred proteins show sequence similarity to glycosyl transferases, and two others have sequence similarity to acetyl transferases. Another gene (wzx) is predicted to encode a protein with multiple transmembrane segments and may function in export of the CA repeat unit from the cytoplasm into the periplasm in a process analogous to O-unit export. The first three genes of the cluster are predicted to encode an outer membrane lipoprotein, a phosphatase, and an inner membrane protein with an ATP-binding domain. Since homologs of these genes are found in other extracellular polysaccharide gene clusters, they may have a common function, such as export of polysaccharide from the cell. PMID:8759852

  20. Analyses of Crime Patterns in NIBRS Data Based on a Novel Graph Theory Clustering Method: Virginia as a Case Study

    PubMed Central

    Nolan, Jim

    2014-01-01

    This paper suggests a novel clustering method for analyzing the National Incident-Based Reporting System (NIBRS) data, which include the determination of correlation of different crime types, the development of a likelihood index for crimes to occur in a jurisdiction, and the clustering of jurisdictions based on crime type. The method was tested by using the 2005 assault data from 121 jurisdictions in Virginia as a test case. The analyses of these data show that some different crime types are correlated and some different crime parameters are correlated with different crime types. The analyses also show that certain jurisdictions within Virginia share certain crime patterns. This information assists with constructing a pattern for a specific crime type and can be used to determine whether a jurisdiction may be more likely to see this type of crime occur in their area. PMID:24778585

  1. ClusCo: clustering and comparison of protein models.

    PubMed

    Jamroz, Michal; Kolinski, Andrzej

    2013-02-22

    The development, optimization and validation of protein modeling methods require efficient tools for structural comparison. Frequently, a large number of models need to be compared with the target native structure. The main reason for the development of Clusco software was to create a high-throughput tool for all-versus-all comparison, because calculating similarity matrix is the one of the bottlenecks in the protein modeling pipeline. Clusco is fast and easy-to-use software for high-throughput comparison of protein models with different similarity measures (cRMSD, dRMSD, GDT_TS, TM-Score, MaxSub, Contact Map Overlap) and clustering of the comparison results with standard methods: K-means Clustering or Hierarchical Agglomerative Clustering. The application was highly optimized and written in C/C++, including the code for parallel execution on CPU and GPU, which resulted in a significant speedup over similar clustering and scoring computation programs.

  2. Non-negative Matrix Factorization and Co-clustering: A Promising Tool for Multi-tasks Bearing Fault Diagnosis

    NASA Astrophysics Data System (ADS)

    Shen, Fei; Chen, Chao; Yan, Ruqiang

    2017-05-01

    Classical bearing fault diagnosis methods, being designed according to one specific task, always pay attention to the effectiveness of extracted features and the final diagnostic performance. However, most of these approaches suffer from inefficiency when multiple tasks exist, especially in a real-time diagnostic scenario. A fault diagnosis method based on Non-negative Matrix Factorization (NMF) and Co-clustering strategy is proposed to overcome this limitation. Firstly, some high-dimensional matrixes are constructed using the Short-Time Fourier Transform (STFT) features, where the dimension of each matrix equals to the number of target tasks. Then, the NMF algorithm is carried out to obtain different components in each dimension direction through optimized matching, such as Euclidean distance and divergence distance. Finally, a Co-clustering technique based on information entropy is utilized to realize classification of each component. To verity the effectiveness of the proposed approach, a series of bearing data sets were analysed in this research. The tests indicated that although the diagnostic performance of single task is comparable to traditional clustering methods such as K-mean algorithm and Guassian Mixture Model, the accuracy and computational efficiency in multi-tasks fault diagnosis are improved.

  3. Investigation of dislocation cluster evolution during directional solidification of multicrystalline silicon

    NASA Astrophysics Data System (ADS)

    Oriwol, Daniel; Trempa, Matthias; Sylla, Lamine; Leipner, Hartmut S.

    2017-04-01

    Dislocation clusters are the main crystal defects in multicrystalline silicon and are detrimental for solar cell efficiency. They were formed during the silicon ingot casting due to the relaxation of strain energy. The evolution of the dislocation clusters was studied by means of automated analysing tools of the standard wafer and cell production giving information about the cluster development as a function of the ingot height. Due to the observation of the whole wafer surface the point of view is of macroscopic nature. It was found that the dislocations tend to build clusters of high density which usually expand in diameter as a function of ingot height. According to their structure the dislocation clusters can be divided into light and dense clusters. The appearance of both types shows a clear dependence on the orientation of the grain growth direction. Additionally, a process of annihilation of dislocation clusters during the crystallization has been observed. To complement the macroscopic description, the dislocation clusters were also investigates by TEM. It is shown that the dislocations within the subgrain boundaries are closely arranged. Distances of 40-30 nm were found. These results lead to the conclusion that the dislocation density within the cluster structure is impossible to quantify by means of etch pit counting.

  4. Self consistency grouping: a stringent clustering method

    PubMed Central

    2012-01-01

    Background Numerous types of clustering like single linkage and K-means have been widely studied and applied to a variety of scientific problems. However, the existing methods are not readily applicable for the problems that demand high stringency. Methods Our method, self consistency grouping, i.e. SCG, yields clusters whose members are closer in rank to each other than to any member outside the cluster. We do not define a distance metric; we use the best known distance metric and presume that it measures the correct distance. SCG does not impose any restriction on the size or the number of the clusters that it finds. The boundaries of clusters are determined by the inconsistencies in the ranks. In addition to the direct implementation that finds the complete structure of the (sub)clusters we implemented two faster versions. The fastest version is guaranteed to find only the clusters that are not subclusters of any other clusters and the other version yields the same output as the direct implementation but does so more efficiently. Results Our tests have demonstrated that SCG yields very few false positives. This was accomplished by introducing errors in the distance measurement. Clustering of protein domain representatives by structural similarity showed that SCG could recover homologous groups with high precision. Conclusions SCG has potential for finding biological relationships under stringent conditions. PMID:23320864

  5. α-cluster versus non-α-cluster decay of the excited compound nucleus Ce124* using the dynamical cluster-decay model

    NASA Astrophysics Data System (ADS)

    Kaur, Arshdeep; Chopra, Sahila; Gupta, Raj K.

    2014-03-01

    The dynamical cluster-decay model (DCM), an extended version of the preformed cluster model (PCM) for ground-state (T =0) decays, is applied to study the decay of the proton-rich compound nucleus Ce124* formed in the S32 + Mo92 reaction at an above-barrier beam energy of 150 MeV. Application of the statistical code pace4 to experimental data shows large deviations in all cases of proton clusters' (2p, 3p, and 4p) evaporation residue (ER) and the non-α nucleus Be6 intermediate mass fragment (IMF). Furthermore, the α-nucleus Be8 decay is not observed in this experiment (not even the upper limit is given). Using the DCM, with effects of deformations up to hexadecapole and "compact" orientations included, for the best-fitted cross sections of 2p and 3p ERs and of Li5 and Be6 IMFs, the relative cross section of Be8 is found to be more than that of Be6, possibly due to the α-nucleus structure of Be8. The same is shown to be true for C12 versus C10, i.e., α-nuclei clusters are populated strongly relative to non-α clusters, similar to what was predicted by one of us (R.K.G.) et al. [S. Kumar, D. Bir, and R. K. Gupta, Phys. Rev. C 51, 1762 (1995), 10.1103/PhysRevC.51.1762] for ground-state decays of such nuclei and the decay of Ba116* formed in the Ni58 + Ni58 reaction at various compound nucleus excitation energies [R. K. Gupta et al., J. Phys. G: Nucl. Part. Phys. 32, 345 (2006), 10.1088/0954-3899/32/3/009]. The only parameter of the DCM is the neck-length ΔR, related to the "barrier-lowering" parameter. The compound nucleus formation probability and "barrier-lowering/-modification" effects are analyzed, and the role of varying the deformations of Be6 and/or Be8 nuclei on relative cross sections is studied, since the measured deformations are not available. The ones used here are from relativistic mean-field calculations [β2(6Be )=-0.087 and β2(8Be)=-0.094]. Calculations are also presented for a beam energy of 140 MeV, supporting the above result.

  6. Large Scale Hierarchical K-Means Based Image Retrieval With MapReduce

    DTIC Science & Technology

    2014-03-27

    hadoop distributed file system: Architecture and design, 2007. [10] G. Bradski. Dr. Dobb’s Journal of Software Tools, 2000. [11] Terry Costlow. Big data ...million images running on 20 virtual machines are shown. 15. SUBJECT TERMS Image Retrieval, MapReduce, Hierarchical K-Means, Big Data , Hadoop U U U UU 87...13 2.1.1.2 HDFS Data Representation . . . . . . . . . . . . . . . . 14 2.1.1.3 Hadoop Engine

  7. Principal component and clustering analysis on molecular dynamics data of the ribosomal L11·23S subdomain.

    PubMed

    Wolf, Antje; Kirschner, Karl N

    2013-02-01

    With improvements in computer speed and algorithm efficiency, MD simulations are sampling larger amounts of molecular and biomolecular conformations. Being able to qualitatively and quantitatively sift these conformations into meaningful groups is a difficult and important task, especially when considering the structure-activity paradigm. Here we present a study that combines two popular techniques, principal component (PC) analysis and clustering, for revealing major conformational changes that occur in molecular dynamics (MD) simulations. Specifically, we explored how clustering different PC subspaces effects the resulting clusters versus clustering the complete trajectory data. As a case example, we used the trajectory data from an explicitly solvated simulation of a bacteria's L11·23S ribosomal subdomain, which is a target of thiopeptide antibiotics. Clustering was performed, using K-means and average-linkage algorithms, on data involving the first two to the first five PC subspace dimensions. For the average-linkage algorithm we found that data-point membership, cluster shape, and cluster size depended on the selected PC subspace data. In contrast, K-means provided very consistent results regardless of the selected subspace. Since we present results on a single model system, generalization concerning the clustering of different PC subspaces of other molecular systems is currently premature. However, our hope is that this study illustrates a) the complexities in selecting the appropriate clustering algorithm, b) the complexities in interpreting and validating their results, and c) by combining PC analysis with subsequent clustering valuable dynamic and conformational information can be obtained.

  8. Self-similarity Clustering Event Detection Based on Triggers Guidance

    NASA Astrophysics Data System (ADS)

    Zhang, Xianfei; Li, Bicheng; Tian, Yuxuan

    Traditional method of Event Detection and Characterization (EDC) regards event detection task as classification problem. It makes words as samples to train classifier, which can lead to positive and negative samples of classifier imbalance. Meanwhile, there is data sparseness problem of this method when the corpus is small. This paper doesn't classify event using word as samples, but cluster event in judging event types. It adopts self-similarity to convergence the value of K in K-means algorithm by the guidance of event triggers, and optimizes clustering algorithm. Then, combining with named entity and its comparative position information, the new method further make sure the pinpoint type of event. The new method avoids depending on template of event in tradition methods, and its result of event detection can well be used in automatic text summarization, text retrieval, and topic detection and tracking.

  9. Spatial cluster for clustering the influence factor of birth and death child in Bogor Regency, West Java

    NASA Astrophysics Data System (ADS)

    Bekti, Rokhana Dwi; Rachmawati, Ro'fah

    2014-03-01

    The number of birth and death child is the benchmarks to determine and monitor the health and welfare in Indonesia. It can be used to identify groups of people who have a high mortality risk. Identifying group is important to compare the characteristics of human that have high and low risk. These characteristics can be seen from the factors that influenced it. Furthermore, there are factors which influence of birth and death child, such us economic, health facility, education, and others. The influence factors of every individual are different, but there are similarities some individuals which live close together or in the close locations. It means there was spatial effect. To identify group in this research, clustering is done by spatial cluster method, which is view to considering the influence of the location or the relationship between locations. One of spatial cluster method is Spatial 'K'luster Analysis by Tree Edge Removal (SKATER). The research was conducted in Bogor Regency, West Java. The goal was to get a cluster of districts based on the factors that influence birth and death child. SKATER build four number of cluster respectively consists of 26, 7, 2, and 5 districts. SKATER has good performance for clustering which include spatial effect. If it compare by other cluster method, Kmeans has good performance by MANOVA test.

  10. Performance analysis of clustering techniques over microarray data: A case study

    NASA Astrophysics Data System (ADS)

    Dash, Rasmita; Misra, Bijan Bihari

    2018-03-01

    Handling big data is one of the major issues in the field of statistical data analysis. In such investigation cluster analysis plays a vital role to deal with the large scale data. There are many clustering techniques with different cluster analysis approach. But which approach suits a particular dataset is difficult to predict. To deal with this problem a grading approach is introduced over many clustering techniques to identify a stable technique. But the grading approach depends on the characteristic of dataset as well as on the validity indices. So a two stage grading approach is implemented. In this study the grading approach is implemented over five clustering techniques like hybrid swarm based clustering (HSC), k-means, partitioning around medoids (PAM), vector quantization (VQ) and agglomerative nesting (AGNES). The experimentation is conducted over five microarray datasets with seven validity indices. The finding of grading approach that a cluster technique is significant is also established by Nemenyi post-hoc hypothetical test.

  11. A cluster analysis on road traffic accidents using genetic algorithms

    NASA Astrophysics Data System (ADS)

    Saharan, Sabariah; Baragona, Roberto

    2017-04-01

    The analysis of traffic road accidents is increasingly important because of the accidents cost and public road safety. The availability or large data sets makes the study of factors that affect the frequency and severity accidents are viable. However, the data are often highly unbalanced and overlapped. We deal with the data set of the road traffic accidents recorded in Christchurch, New Zealand, from 2000-2009 with a total of 26440 accidents. The data is in a binary set and there are 50 factors road traffic accidents with four level of severity. We used genetic algorithm for the analysis because we are in the presence of a large unbalanced data set and standard clustering like k-means algorithm may not be suitable for the task. The genetic algorithm based on clustering for unknown K, (GCUK) has been used to identify the factors associated with accidents of different levels of severity. The results provided us with an interesting insight into the relationship between factors and accidents severity level and suggest that the two main factors that contributes to fatal accidents are "Speed greater than 60 km h" and "Did not see other people until it was too late". A comparison with the k-means algorithm and the independent component analysis is performed to validate the results.

  12. GammaM23K, gammaM232K, and gammaL77K single substitutions in the TF1-ATPase lower ATPase activity by disrupting a cluster of hydrophobic side chains.

    PubMed

    Bandyopadhyay, Sanjay; Allison, William S

    2004-07-27

    In crystal structures of the bovine F(1)-ATPase (MF(1)), the side chains of gammaMet(23), gammaMet(232), and gammaLeu(77) interact in a cluster. Substitution of the corresponding residues in the alpha(3)beta(3)gamma subcomplex of TF(1) with lysine lowers the ATPase activity to 2.3, 11, and 15%, respectively, of that displayed by wild-type. In contrast, TF(1) subcomplexes containing the gammaM(23)C, gammaM(232)C, and gammaL(77)C substitutions display 36, 36, and 130%, respectively, of the wild-type ATPase activity. The ATPase activity of the gammaM(23)C/gammaM(232)C double mutant subcomplex is 36% that of the wild-type subcomplex before and after cross-linking the introduced cysteines, whereas the ATPase activity of the gammaM(23)C/L(77)C double mutant increased from 50 to 85% that of wild-type after cross-linking the introduced cysteines. Only beta-beta cross-links formed when the alpha(3)(betaE(395)C)(3)gammaM(23)C double mutant was inactivated with CuCl(2). The overall results suggest that the attenuated ATPase of the mutant subcomplexes containing the gammaM(23)K, gammaL(77)K, and gammaM(232)K substitutions is caused by disruption of the cluster of hydrophobic amino acid side chains and that the midregion of the coiled-coil comprised of the amino- and carboxyl-terminal alpha helices of the gamma subunit does not undergo unwinding or major displacement from the side chain of gammaLeu(77) during ATP-driven rotation of the gamma subunit.

  13. Clustering approaches to identifying gene expression patterns from DNA microarray data.

    PubMed

    Do, Jin Hwan; Choi, Dong-Kug

    2008-04-30

    The analysis of microarray data is essential for large amounts of gene expression data. In this review we focus on clustering techniques. The biological rationale for this approach is the fact that many co-expressed genes are co-regulated, and identifying co-expressed genes could aid in functional annotation of novel genes, de novo identification of transcription factor binding sites and elucidation of complex biological pathways. Co-expressed genes are usually identified in microarray experiments by clustering techniques. There are many such methods, and the results obtained even for the same datasets may vary considerably depending on the algorithms and metrics for dissimilarity measures used, as well as on user-selectable parameters such as desired number of clusters and initial values. Therefore, biologists who want to interpret microarray data should be aware of the weakness and strengths of the clustering methods used. In this review, we survey the basic principles of clustering of DNA microarray data from crisp clustering algorithms such as hierarchical clustering, K-means and self-organizing maps, to complex clustering algorithms like fuzzy clustering.

  14. Galaxy cluster luminosities and colours, and their dependence on cluster mass and merger state

    NASA Astrophysics Data System (ADS)

    Mulroy, Sarah L.; McGee, Sean L.; Gillman, Steven; Smith, Graham P.; Haines, Chris P.; Démoclès, Jessica; Okabe, Nobuhiro; Egami, Eiichi

    2017-12-01

    We study a sample of 19 galaxy clusters in the redshift range 0.15 < z < 0.30 with highly complete spectroscopic membership catalogues (to K < K*(z) + 1.5) from the Arizona Cluster Redshift Survey, individual weak-lensing masses and near-infrared data from the Local Cluster Substructure Survey, and optical photometry from the Sloan Digital Sky Survey. We fit the scaling relations between total cluster luminosity in each of six bandpasses (grizJK) and cluster mass, finding cluster luminosity to be a promising mass proxy with low intrinsic scatter σln L|M of only ∼10-20 per cent for all relations. At fixed overdensity radius, the intercept increases with wavelength, consistent with an old stellar population. The scatter and slope are consistent across all wavelengths, suggesting that cluster colour is not a function of mass. Comparing colour with indicators of the level of disturbance in the cluster, we find a narrower variety in the cluster colours of 'disturbed' clusters than of 'undisturbed' clusters. This trend is more pronounced with indicators sensitive to the initial stages of a cluster merger, e.g. the Dressler Schectman statistic. We interpret this as possible evidence that the total cluster star formation rate is 'standardized' in mergers, perhaps through a process such as a system-wide shock in the intracluster medium.

  15. Block clustering based on difference of convex functions (DC) programming and DC algorithms.

    PubMed

    Le, Hoai Minh; Le Thi, Hoai An; Dinh, Tao Pham; Huynh, Van Ngai

    2013-10-01

    We investigate difference of convex functions (DC) programming and the DC algorithm (DCA) to solve the block clustering problem in the continuous framework, which traditionally requires solving a hard combinatorial optimization problem. DC reformulation techniques and exact penalty in DC programming are developed to build an appropriate equivalent DC program of the block clustering problem. They lead to an elegant and explicit DCA scheme for the resulting DC program. Computational experiments show the robustness and efficiency of the proposed algorithm and its superiority over standard algorithms such as two-mode K-means, two-mode fuzzy clustering, and block classification EM.

  16. Clustering Module in OLAP for Horticultural Crops using SpagoBI

    NASA Astrophysics Data System (ADS)

    Putri, D.; Sitanggang, I. S.

    2017-03-01

    Horticultural crops data are organized by the Ministry of Agriculture, Republic of Indonesia. The data are presented annually in a tabular form and result a large data set. This situation makes users difficult to obtain summaries of horticultural crops data. This study aims to develop a clustering module in the SOLAP system for the distribution of horticultural crops in Indonesia and to visualize the results of clustering in a map using SpagoBI. The algorithm used for clustering is K-Means. Horticultural crops data include vegetables, ornamental plants, medicinal plants, and fruits from 2000 to 2013. The clustering module displays clustering results of horticultural crops in the form of text and table on SpagoBI. This module can also visualize the distribution of horticultural crops in the form of map on the HTML page. The application is expected to be useful for users in order to easily obtain summaries of the horticultural crops distribution data and its clusters. The summaries and clusters can be beneficial for the stakeholders to determine potential areas in Indonesia for horticultural crops.

  17. WebStruct and VisualStruct: Web interfaces and visualization for Structure software implemented in a cluster environment.

    PubMed

    Jayashree, B; Rajgopal, S; Hoisington, D; Prasanth, V P; Chandra, S

    2008-09-24

    Structure, is a widely used software tool to investigate population genetic structure with multi-locus genotyping data. The software uses an iterative algorithm to group individuals into "K" clusters, representing possibly K genetically distinct subpopulations. The serial implementation of this programme is processor-intensive even with small datasets. We describe an implementation of the program within a parallel framework. Speedup was achieved by running different replicates and values of K on each node of the cluster. A web-based user-oriented GUI has been implemented in PHP, through which the user can specify input parameters for the programme. The number of processors to be used can be specified in the background command. A web-based visualization tool "Visualstruct", written in PHP (HTML and Java script embedded), allows for the graphical display of population clusters output from Structure, where each individual may be visualized as a line segment with K colors defining its possible genomic composition with respect to the K genetic sub-populations. The advantage over available programs is in the increased number of individuals that can be visualized. The analyses of real datasets indicate a speedup of up to four, when comparing the speed of execution on clusters of eight processors with the speed of execution on one desktop. The software package is freely available to interested users upon request.

  18. PCA based clustering for brain tumor segmentation of T1w MRI images.

    PubMed

    Kaya, Irem Ersöz; Pehlivanlı, Ayça Çakmak; Sekizkardeş, Emine Gezmez; Ibrikci, Turgay

    2017-03-01

    Medical images are huge collections of information that are difficult to store and process consuming extensive computing time. Therefore, the reduction techniques are commonly used as a data pre-processing step to make the image data less complex so that a high-dimensional data can be identified by an appropriate low-dimensional representation. PCA is one of the most popular multivariate methods for data reduction. This paper is focused on T1-weighted MRI images clustering for brain tumor segmentation with dimension reduction by different common Principle Component Analysis (PCA) algorithms. Our primary aim is to present a comparison between different variations of PCA algorithms on MRIs for two cluster methods. Five most common PCA algorithms; namely the conventional PCA, Probabilistic Principal Component Analysis (PPCA), Expectation Maximization Based Principal Component Analysis (EM-PCA), Generalize Hebbian Algorithm (GHA), and Adaptive Principal Component Extraction (APEX) were applied to reduce dimensionality in advance of two clustering algorithms, K-Means and Fuzzy C-Means. In the study, the T1-weighted MRI images of the human brain with brain tumor were used for clustering. In addition to the original size of 512 lines and 512 pixels per line, three more different sizes, 256 × 256, 128 × 128 and 64 × 64, were included in the study to examine their effect on the methods. The obtained results were compared in terms of both the reconstruction errors and the Euclidean distance errors among the clustered images containing the same number of principle components. According to the findings, the PPCA obtained the best results among all others. Furthermore, the EM-PCA and the PPCA assisted K-Means algorithm to accomplish the best clustering performance in the majority as well as achieving significant results with both clustering algorithms for all size of T1w MRI images. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  19. {beta}-K{sub 4}La{sub 6}I{sub 14}Os: A new structure type for rare-earth-metal cluster compounds that contains discrete tetrahedral K{sub 4}I{sup 3+} units

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Uma, S.; Corbett, J.D.

    1999-08-23

    Suitable reactions of KI, La, LaI{sub 3}, and Os in niobium tubes at 800--850 C result in black, air- and moisture-sensitive crystals of the quaternary title phase. Isostructural K{sub 4}Pr{sub 6}I{sub 14}Z also exist for Z = Fe, Ru. The title phase was characterized by single-crystal X-ray diffraction (tetragonal, P4/ncc (No. 130), Z = 4; a = 13.117(3), c = 25.17(1) {angstrom} at 23 C). The important structural feature is the constitution (K{sub 4}I){sup 3+}(La{sub 6}I{sub 13}Os{sup 3{minus}}) with a new type of 3D anion network of [(La{sub 6}Os)I{sub 8}{sup i}I{sub 4/2}{sup i{minus}a}I{sub 4/2}{sup a{minus}i}I{sub 2/2}{sup a{minus}a}] clusters that aremore » connected into puckered layers through I{sup i{minus}a} and I{sup a{minus}i} atom pairs that bridge diagonally in the a-b plane. These cluster layers are further interlinked along {rvec c} at trans-vertexes through simple bridging I{sup a{minus}a}. The 14th iodine atom occurs in the unique K{sub 4}I{sup 3+} ions which lie in columns that interpenetrate the La{sub 6}OsI{sub 13} network along c. The present 16-e{sup {minus}} clusters, in contrast with the optimal 18-e{sup {minus}} octahedral cluster configuration, exhibit an uncommon tetragonal elongation and evidently become closed shell, with only a small temperature-independent (van Vleck-like) paramagnetism, {approximately}4 x 10{sup {minus}4} emu mol{sup {minus}1}.« less

  20. A scalable and practical one-pass clustering algorithm for recommender system

    NASA Astrophysics Data System (ADS)

    Khalid, Asra; Ghazanfar, Mustansar Ali; Azam, Awais; Alahmari, Saad Ali

    2015-12-01

    KMeans clustering-based recommendation algorithms have been proposed claiming to increase the scalability of recommender systems. One potential drawback of these algorithms is that they perform training offline and hence cannot accommodate the incremental updates with the arrival of new data, making them unsuitable for the dynamic environments. From this line of research, a new clustering algorithm called One-Pass is proposed, which is a simple, fast, and accurate. We show empirically that the proposed algorithm outperforms K-Means in terms of recommendation and training time while maintaining a good level of accuracy.

  1. A clustering-based graph Laplacian framework for value function approximation in reinforcement learning.

    PubMed

    Xu, Xin; Huang, Zhenhua; Graves, Daniel; Pedrycz, Witold

    2014-12-01

    In order to deal with the sequential decision problems with large or continuous state spaces, feature representation and function approximation have been a major research topic in reinforcement learning (RL). In this paper, a clustering-based graph Laplacian framework is presented for feature representation and value function approximation (VFA) in RL. By making use of clustering-based techniques, that is, K-means clustering or fuzzy C-means clustering, a graph Laplacian is constructed by subsampling in Markov decision processes (MDPs) with continuous state spaces. The basis functions for VFA can be automatically generated from spectral analysis of the graph Laplacian. The clustering-based graph Laplacian is integrated with a class of approximation policy iteration algorithms called representation policy iteration (RPI) for RL in MDPs with continuous state spaces. Simulation and experimental results show that, compared with previous RPI methods, the proposed approach needs fewer sample points to compute an efficient set of basis functions and the learning control performance can be improved for a variety of parameter settings.

  2. Subtypes of female juvenile offenders: a cluster analysis of the Millon Adolescent Clinical Inventory.

    PubMed

    Stefurak, Tres; Calhoun, Georgia B

    2007-01-01

    The current study sought to explore subtypes of adolescents within a sample of female juvenile offenders. Using the Millon Adolescent Clinical Inventory with 101 female juvenile offenders, a two-step cluster analysis was performed beginning with a Ward's method hierarchical cluster analysis followed by a K-Means iterative partitioning cluster analysis. The results suggest an optimal three-cluster solution, with cluster profiles leading to the following group labels: Externalizing Problems, Depressed/Interpersonally Ambivalent, and Anxious Prosocial. Analysis along the factors of age, race, offense typology and offense chronicity were conducted to further understand the nature of found clusters. Only the effect for race was significant with the Anxious Prosocial and Depressed Intepersonally Ambivalent clusters appearing disproportionately comprised of African American girls. To establish external validity, clusters were compared across scales of the Behavioral Assessment System for Children - Self Report of Personality, and corroborative distinctions between clusters were found here.

  3. Study on Adaptive Parameter Determination of Cluster Analysis in Urban Management Cases

    NASA Astrophysics Data System (ADS)

    Fu, J. Y.; Jing, C. F.; Du, M. Y.; Fu, Y. L.; Dai, P. P.

    2017-09-01

    The fine management for cities is the important way to realize the smart city. The data mining which uses spatial clustering analysis for urban management cases can be used in the evaluation of urban public facilities deployment, and support the policy decisions, and also provides technical support for the fine management of the city. Aiming at the problem that DBSCAN algorithm which is based on the density-clustering can not realize parameter adaptive determination, this paper proposed the optimizing method of parameter adaptive determination based on the spatial analysis. Firstly, making analysis of the function Ripley's K for the data set to realize adaptive determination of global parameter MinPts, which means setting the maximum aggregation scale as the range of data clustering. Calculating every point object's highest frequency K value in the range of Eps which uses K-D tree and setting it as the value of clustering density to realize the adaptive determination of global parameter MinPts. Then, the R language was used to optimize the above process to accomplish the precise clustering of typical urban management cases. The experimental results based on the typical case of urban management in XiCheng district of Beijing shows that: The new DBSCAN clustering algorithm this paper presents takes full account of the data's spatial and statistical characteristic which has obvious clustering feature, and has a better applicability and high quality. The results of the study are not only helpful for the formulation of urban management policies and the allocation of urban management supervisors in XiCheng District of Beijing, but also to other cities and related fields.

  4. Clustering Of Left Ventricular Wall Motion Patterns

    NASA Astrophysics Data System (ADS)

    Bjelogrlic, Z.; Jakopin, J.; Gyergyek, L.

    1982-11-01

    A method for detection of wall regions with similar motion was presented. A model based on local direction information was used to measure the left ventricular wall motion from cineangiographic sequence. Three time functions were used to define segmental motion patterns: distance of a ventricular contour segment from the mean contour, the velocity of a segment and its acceleration. Motion patterns were clustered by the UPGMA algorithm and by an algorithm based on K-nearest neighboor classification rule.

  5. Combining Mixture Components for Clustering*

    PubMed Central

    Baudry, Jean-Patrick; Raftery, Adrian E.; Celeux, Gilles; Lo, Kenneth; Gottardo, Raphaël

    2010-01-01

    Model-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K. These clusterings can be compared on substantive grounds, and we also describe an automatic way of selecting the number of clusters via a piecewise linear regression fit to the rescaled entropy plot. We illustrate the method with simulated data and a flow cytometry dataset. Supplemental Materials are available on the journal Web site and described at the end of the paper. PMID:20953302

  6. Galaxy properties in clusters. II. Backsplash galaxies

    NASA Astrophysics Data System (ADS)

    Muriel, H.; Coenda, V.

    2014-04-01

    Aims: We explore the properties of galaxies on the outskirts of clusters and their dependence on recent dynamical history in order to understand the real impact that the cluster core has on the evolution of galaxies. Methods: We analyse the properties of more than 1000 galaxies brighter than M0.1r = - 19.6 on the outskirts of 90 clusters (1 < r/rvir < 2) in the redshift range 0.05 < z < 0.10. Using the line of sight velocity of galaxies relative to the cluster's mean, we selected low and high velocity subsamples. Theoretical predictions indicate that a significant fraction of the first subsample should be backsplash galaxies, that is, objects that have already orbited near the cluster centre. A significant proportion of the sample of high relative velocity (HV) galaxies seems to be composed of infalling objects. Results: Our results suggest that, at fixed stellar mass, late-type galaxies in the low-velocity (LV) sample are systematically older, redder, and have formed fewer stars during the last 3 Gyrs than galaxies in the HV sample. This result is consistent with models that assume that the central regions of clusters are effective in quenching the star formation by means of processes such as ram pressure stripping or strangulation. At fixed stellar mass, LV galaxies show some evidence of having higher surface brightness and smaller size than HV galaxies. These results are consistent with the scenario where galaxies that have orbited the central regions of clusters are more likely to suffer tidal effects, producing loss of mass as well as a re-distribution of matter towards more compact configurations. Finally, we found a higher fraction of ET galaxies in the LV sample, supporting the idea that the central region of clusters of galaxies may contribute to the transformation of morphological types towards earlier types.

  7. Relative efficiency and sample size for cluster randomized trials with variable cluster sizes.

    PubMed

    You, Zhiying; Williams, O Dale; Aban, Inmaculada; Kabagambe, Edmond Kato; Tiwari, Hemant K; Cutter, Gary

    2011-02-01

    The statistical power of cluster randomized trials depends on two sample size components, the number of clusters per group and the numbers of individuals within clusters (cluster size). Variable cluster sizes are common and this variation alone may have significant impact on study power. Previous approaches have taken this into account by either adjusting total sample size using a designated design effect or adjusting the number of clusters according to an assessment of the relative efficiency of unequal versus equal cluster sizes. This article defines a relative efficiency of unequal versus equal cluster sizes using noncentrality parameters, investigates properties of this measure, and proposes an approach for adjusting the required sample size accordingly. We focus on comparing two groups with normally distributed outcomes using t-test, and use the noncentrality parameter to define the relative efficiency of unequal versus equal cluster sizes and show that statistical power depends only on this parameter for a given number of clusters. We calculate the sample size required for an unequal cluster sizes trial to have the same power as one with equal cluster sizes. Relative efficiency based on the noncentrality parameter is straightforward to calculate and easy to interpret. It connects the required mean cluster size directly to the required sample size with equal cluster sizes. Consequently, our approach first determines the sample size requirements with equal cluster sizes for a pre-specified study power and then calculates the required mean cluster size while keeping the number of clusters unchanged. Our approach allows adjustment in mean cluster size alone or simultaneous adjustment in mean cluster size and number of clusters, and is a flexible alternative to and a useful complement to existing methods. Comparison indicated that we have defined a relative efficiency that is greater than the relative efficiency in the literature under some conditions. Our measure

  8. Modulated Modularity Clustering as an Exploratory Tool for Functional Genomic Inference

    PubMed Central

    Stone, Eric A.; Ayroles, Julien F.

    2009-01-01

    In recent years, the advent of high-throughput assays, coupled with their diminishing cost, has facilitated a systems approach to biology. As a consequence, massive amounts of data are currently being generated, requiring efficient methodology aimed at the reduction of scale. Whole-genome transcriptional profiling is a standard component of systems-level analyses, and to reduce scale and improve inference clustering genes is common. Since clustering is often the first step toward generating hypotheses, cluster quality is critical. Conversely, because the validation of cluster-driven hypotheses is indirect, it is critical that quality clusters not be obtained by subjective means. In this paper, we present a new objective-based clustering method and demonstrate that it yields high-quality results. Our method, modulated modularity clustering (MMC), seeks community structure in graphical data. MMC modulates the connection strengths of edges in a weighted graph to maximize an objective function (called modularity) that quantifies community structure. The result of this maximization is a clustering through which tightly-connected groups of vertices emerge. Our application is to systems genetics, and we quantitatively compare MMC both to the hierarchical clustering method most commonly employed and to three popular spectral clustering approaches. We further validate MMC through analyses of human and Drosophila melanogaster expression data, demonstrating that the clusters we obtain are biologically meaningful. We show MMC to be effective and suitable to applications of large scale. In light of these features, we advocate MMC as a standard tool for exploration and hypothesis generation. PMID:19424432

  9. Large hydrogen-bonded pre-nucleation (HSO4-)(H2SO4)m(H2O)k and (HSO4-)(NH3)(H2SO4)m(H2O)k clusters in the earth's atmosphere.

    PubMed

    Herb, Jason; Xu, Yisheng; Yu, Fangqun; Nadykto, A B

    2013-01-10

    The importance of pre-nucleation cluster stability as the key parameter controlling nucleation of atmospheric airborne ions is well-established. In this Article, large ternary ionic (HSO(4)(-))(H(2)SO(4))(m)(NH(3))(H(2)O)(n) clusters have been studied using Density Functional Theory (DFT) and composite ab initio methods. Twenty classes of clusters have been investigated, and thermochemical properties of common atmospheric (HSO(4)(-))(H(2)SO(4))(m)(NH(3))(0)(H(2)O)(k) and (HSO(4)(-))(H(2)SO(4))(m)(NH(3))(1)(H(2)O)(n) clusters (with m, k, and n up to 3) have been obtained. A large amount of new themochemical and structural data ready-to-use for constraining kinetic nucleation models has been reported. We have performed a comprehensive thermochemical analysis of the obtained data and have investigated the impacts of ammonia and negatively charged bisulfate ion on stability of binary clusters in some detail. The comparison of theoretical predictions and experiments shows that the PW91PW91/6-311++G(3df,3pd) results are in very good agreement with both experimental data and high level ab initio CCSD(T)/CBS values and suggest that the PW91PW91/6-311++G(3df,3pd) method is a viable alternative to higher level ab initio methods in studying large pre-nucleation clusters, for which the higher level computations are prohibitively expensive. The uncertainties in both theory and experiments have been investigated, and possible ways of their reduction have been proposed.

  10. Missing continuous outcomes under covariate dependent missingness in cluster randomised trials

    PubMed Central

    Diaz-Ordaz, Karla; Bartlett, Jonathan W

    2016-01-01

    Attrition is a common occurrence in cluster randomised trials which leads to missing outcome data. Two approaches for analysing such trials are cluster-level analysis and individual-level analysis. This paper compares the performance of unadjusted cluster-level analysis, baseline covariate adjusted cluster-level analysis and linear mixed model analysis, under baseline covariate dependent missingness in continuous outcomes, in terms of bias, average estimated standard error and coverage probability. The methods of complete records analysis and multiple imputation are used to handle the missing outcome data. We considered four scenarios, with the missingness mechanism and baseline covariate effect on outcome either the same or different between intervention groups. We show that both unadjusted cluster-level analysis and baseline covariate adjusted cluster-level analysis give unbiased estimates of the intervention effect only if both intervention groups have the same missingness mechanisms and there is no interaction between baseline covariate and intervention group. Linear mixed model and multiple imputation give unbiased estimates under all four considered scenarios, provided that an interaction of intervention and baseline covariate is included in the model when appropriate. Cluster mean imputation has been proposed as a valid approach for handling missing outcomes in cluster randomised trials. We show that cluster mean imputation only gives unbiased estimates when missingness mechanism is the same between the intervention groups and there is no interaction between baseline covariate and intervention group. Multiple imputation shows overcoverage for small number of clusters in each intervention group. PMID:27177885

  11. Missing continuous outcomes under covariate dependent missingness in cluster randomised trials.

    PubMed

    Hossain, Anower; Diaz-Ordaz, Karla; Bartlett, Jonathan W

    2017-06-01

    Attrition is a common occurrence in cluster randomised trials which leads to missing outcome data. Two approaches for analysing such trials are cluster-level analysis and individual-level analysis. This paper compares the performance of unadjusted cluster-level analysis, baseline covariate adjusted cluster-level analysis and linear mixed model analysis, under baseline covariate dependent missingness in continuous outcomes, in terms of bias, average estimated standard error and coverage probability. The methods of complete records analysis and multiple imputation are used to handle the missing outcome data. We considered four scenarios, with the missingness mechanism and baseline covariate effect on outcome either the same or different between intervention groups. We show that both unadjusted cluster-level analysis and baseline covariate adjusted cluster-level analysis give unbiased estimates of the intervention effect only if both intervention groups have the same missingness mechanisms and there is no interaction between baseline covariate and intervention group. Linear mixed model and multiple imputation give unbiased estimates under all four considered scenarios, provided that an interaction of intervention and baseline covariate is included in the model when appropriate. Cluster mean imputation has been proposed as a valid approach for handling missing outcomes in cluster randomised trials. We show that cluster mean imputation only gives unbiased estimates when missingness mechanism is the same between the intervention groups and there is no interaction between baseline covariate and intervention group. Multiple imputation shows overcoverage for small number of clusters in each intervention group.

  12. Dating star clusters in the Small Magellanic Cloud by means of integrated spectra

    NASA Astrophysics Data System (ADS)

    Ahumada, A. V.; Clariá, J. J.; Bica, E.; Dutra, C. M.

    2002-10-01

    In this study flux-calibrated integrated spectra in the range (3600-6800) Å are presented for 16 concentrated star clusters in the Small Magellanic Cloud (SMC), approximately half of which constitute unstudied objects. We have estimated ages and foreground interstellar reddening values from the comparison of the line strengths and continuum distribution of the cluster spectra with those of template cluster spectra with known parameters. Most of the sample clusters are young blue clusters (6-50 Myr), while L 28, NGC 643 and L 114 are found to be intermediate-age clusters (1-6 Gyr). One well known SMC cluster (NGC 416) was observed for comparison purposes. The sample includes clusters in the surroundings and main body of the SMC, and the derived foreground reddening values are in the range 0.00 <= E(B-V) <= 0.15. The present data also make up a cluster spectral library at SMC metallicity. Based on observations made at Complejo Astronómico El Leoncito, which is operated under agreement between the Consejo Nacional de Investigaciones Científicas y Técnicas de la República Argentina and the National Universities of La Plata, Córdoba and San Juan, Argentina.

  13. Optimal Partitioning of a Data Set Based on the "p"-Median Model

    ERIC Educational Resources Information Center

    Brusco, Michael J.; Kohn, Hans-Friedrich

    2008-01-01

    Although the "K"-means algorithm for minimizing the within-cluster sums of squared deviations from cluster centroids is perhaps the most common method for applied cluster analyses, a variety of other criteria are available. The "p"-median model is an especially well-studied clustering problem that requires the selection of "p" objects to serve as…

  14. A segmentation and classification scheme for single tooth in MicroCT images based on 3D level set and k-means+.

    PubMed

    Wang, Liansheng; Li, Shusheng; Chen, Rongzhen; Liu, Sze-Yu; Chen, Jyh-Cheng

    2017-04-01

    Accurate classification of different anatomical structures of teeth from medical images provides crucial information for the stress analysis in dentistry. Usually, the anatomical structures of teeth are manually labeled by experienced clinical doctors, which is time consuming. However, automatic segmentation and classification is a challenging task because the anatomical structures and surroundings of the tooth in medical images are rather complex. Therefore, in this paper, we propose an effective framework which is designed to segment the tooth with a Selective Binary and Gaussian Filtering Regularized Level Set (GFRLS) method improved by fully utilizing 3 dimensional (3D) information, and classify the tooth by employing unsupervised learning i.e., k-means++ method. In order to evaluate the proposed method, the experiments are conducted on the sufficient and extensive datasets of mandibular molars. The experimental results show that our method can achieve higher accuracy and robustness compared to other three clustering methods. Copyright © 2016 Elsevier Ltd. All rights reserved.

  15. Photodissociation of nitromethane cluster anions.

    PubMed

    Goebbert, Daniel J; Khuseynov, Dmitry; Sanov, Andrei

    2010-08-28

    Three types of anionic fragments are observed in the photodissociation of nitromethane cluster anions, (CH(3)NO(2))(n)(-), n=1-6, at 355 nm: NO(2)(-)(CH(3)NO(2))(k), (CH(3)NO(2))(k)(-), and OH(-) (kclusters containing a monomer-anion core, CH(3)NO(2)(-), solvated by n-1 neutral nitromethane molecules. The NO(2)(-)(CH(3)NO(2))(k) and OH(-) fragments formed from these clusters are described as core-dissociation products, while the (CH(3)NO(2))(k)(-) fragments are attributed to energy transfer from excited CH(3)NO(2)(-) into the solvent network or a core-dissociation-recombination (caging) mechanism. As with other cluster families, the fraction of caged photofragments shows an overall increase with increasing cluster size. The low-lying A(2)A' and/or B(2)A' electronic states of CH(3)NO(2)(-) are believed responsible for photoabsorption leading to dissociation to NO(2)(-) based fragments, while the C(2)A" state is a candidate for the OH(-) pathway. Compared to neutral nitromethane, the photodissociation of CH(3)NO(2)(-) requires lower energy photons because the photochemically active electron occupies a high energy pi* orbital (which is vacant in the neutral). Although the electronic states in the photodissociation of CH(3)NO(2) and CH(3)NO(2)(-) are different, the major fragments, CH(3)+NO(2) and CH(3)+NO(2)(-), respectively, both form via C-N bond cleavage.

  16. Study of parameters of the nearest neighbour shared algorithm on clustering documents

    NASA Astrophysics Data System (ADS)

    Mustika Rukmi, Alvida; Budi Utomo, Daryono; Imro’atus Sholikhah, Neni

    2018-03-01

    Document clustering is one way of automatically managing documents, extracting of document topics and fastly filtering information. Preprocess of clustering documents processed by textmining consists of: keyword extraction using Rapid Automatic Keyphrase Extraction (RAKE) and making the document as concept vector using Latent Semantic Analysis (LSA). Furthermore, the clustering process is done so that the documents with the similarity of the topic are in the same cluster, based on the preprocesing by textmining performed. Shared Nearest Neighbour (SNN) algorithm is a clustering method based on the number of "nearest neighbors" shared. The parameters in the SNN Algorithm consist of: k nearest neighbor documents, ɛ shared nearest neighbor documents and MinT minimum number of similar documents, which can form a cluster. Characteristics The SNN algorithm is based on shared ‘neighbor’ properties. Each cluster is formed by keywords that are shared by the documents. SNN algorithm allows a cluster can be built more than one keyword, if the value of the frequency of appearing keywords in document is also high. Determination of parameter values on SNN algorithm affects document clustering results. The higher parameter value k, will increase the number of neighbor documents from each document, cause similarity of neighboring documents are lower. The accuracy of each cluster is also low. The higher parameter value ε, caused each document catch only neighbor documents that have a high similarity to build a cluster. It also causes more unclassified documents (noise). The higher the MinT parameter value cause the number of clusters will decrease, since the number of similar documents can not form clusters if less than MinT. Parameter in the SNN Algorithm determine performance of clustering result and the amount of noise (unclustered documents ). The Silhouette coeffisient shows almost the same result in many experiments, above 0.9, which means that SNN algorithm works well

  17. Clustering Molecular Dynamics Trajectories for Optimizing Docking Experiments

    PubMed Central

    De Paris, Renata; Quevedo, Christian V.; Ruiz, Duncan D.; Norberto de Souza, Osmar; Barros, Rodrigo C.

    2015-01-01

    Molecular dynamics simulations of protein receptors have become an attractive tool for rational drug discovery. However, the high computational cost of employing molecular dynamics trajectories in virtual screening of large repositories threats the feasibility of this task. Computational intelligence techniques have been applied in this context, with the ultimate goal of reducing the overall computational cost so the task can become feasible. Particularly, clustering algorithms have been widely used as a means to reduce the dimensionality of molecular dynamics trajectories. In this paper, we develop a novel methodology for clustering entire trajectories using structural features from the substrate-binding cavity of the receptor in order to optimize docking experiments on a cloud-based environment. The resulting partition was selected based on three clustering validity criteria, and it was further validated by analyzing the interactions between 20 ligands and a fully flexible receptor (FFR) model containing a 20 ns molecular dynamics simulation trajectory. Our proposed methodology shows that taking into account features of the substrate-binding cavity as input for the k-means algorithm is a promising technique for accurately selecting ensembles of representative structures tailored to a specific ligand. PMID:25873944

  18. Clustering molecular dynamics trajectories for optimizing docking experiments.

    PubMed

    De Paris, Renata; Quevedo, Christian V; Ruiz, Duncan D; Norberto de Souza, Osmar; Barros, Rodrigo C

    2015-01-01

    Molecular dynamics simulations of protein receptors have become an attractive tool for rational drug discovery. However, the high computational cost of employing molecular dynamics trajectories in virtual screening of large repositories threats the feasibility of this task. Computational intelligence techniques have been applied in this context, with the ultimate goal of reducing the overall computational cost so the task can become feasible. Particularly, clustering algorithms have been widely used as a means to reduce the dimensionality of molecular dynamics trajectories. In this paper, we develop a novel methodology for clustering entire trajectories using structural features from the substrate-binding cavity of the receptor in order to optimize docking experiments on a cloud-based environment. The resulting partition was selected based on three clustering validity criteria, and it was further validated by analyzing the interactions between 20 ligands and a fully flexible receptor (FFR) model containing a 20 ns molecular dynamics simulation trajectory. Our proposed methodology shows that taking into account features of the substrate-binding cavity as input for the k-means algorithm is a promising technique for accurately selecting ensembles of representative structures tailored to a specific ligand.

  19. Sun Protection Belief Clusters: Analysis of Amazon Mechanical Turk Data.

    PubMed

    Santiago-Rivas, Marimer; Schnur, Julie B; Jandorf, Lina

    2016-12-01

    This study aimed (i) to determine whether people could be differentiated on the basis of their sun protection belief profiles and individual characteristics and (ii) explore the use of a crowdsourcing web service for the assessment of sun protection beliefs. A sample of 500 adults completed an online survey of sun protection belief items using Amazon Mechanical Turk. A two-phased cluster analysis (i.e., hierarchical and non-hierarchical K-means) was utilized to determine clusters of sun protection barriers and facilitators. Results yielded three distinct clusters of sun protection barriers and three distinct clusters of sun protection facilitators. Significant associations between gender, age, sun sensitivity, and cluster membership were identified. Results also showed an association between barrier and facilitator cluster membership. The results of this study provided a potential alternative approach to developing future sun protection promotion initiatives in the population. Findings add to our knowledge regarding individuals who support, oppose, or are ambivalent toward sun protection and inform intervention research by identifying distinct subtypes that may best benefit from (or have a higher need for) skin cancer prevention efforts.

  20. Mean Occupation Function of High-redshift Quasars from the Planck Cluster Catalog

    NASA Astrophysics Data System (ADS)

    Chakraborty, Priyanka; Chatterjee, Suchetana; Dutta, Alankar; Myers, Adam D.

    2018-06-01

    We characterize the distribution of quasars within dark matter halos using a direct measurement technique for the first time at redshifts as high as z ∼ 1. Using the Planck Sunyaev-Zeldovich (SZ) catalog for galaxy groups and the Sloan Digital Sky Survey (SDSS) DR12 quasar data set, we assign host clusters/groups to the quasars and make a measurement of the mean number of quasars within dark matter halos as a function of halo mass. We find that a simple power-law fit of {log}< N> =(2.11+/- 0.01) {log}(M)-(32.77+/- 0.11) can be used to model the quasar fraction in dark matter halos. This suggests that the quasar fraction increases monotonically as a function of halo mass even to redshifts as high as z ∼ 1.

  1. Integrated K-band spectra of old and intermediate-age globular clusters in the Large Magellanic Cloud

    NASA Astrophysics Data System (ADS)

    Lyubenova, M.; Kuntschner, H.; Rejkuba, M.; Silva, D. R.; Kissler-Patig, M.; Tacconi-Garman, L. E.; Larsen, S. S.

    2010-02-01

    Current stellar population models have arguably the largest uncertainties in the near-IR wavelength range, partly due to a lack of large and well calibrated empirical spectral libraries. In this paper we present a project whose aim it is to provide the first library of luminosity weighted integrated near-IR spectra of globular clusters to be used to test the current stellar population models and serve as calibrators for future ones. Our pilot study presents spatially integrated K-band spectra of three old (≥10 Gyr) and metal poor ([Fe/H] ~ -1.4), and three intermediate age (1-2 Gyr) and more metal rich ([Fe/H] ~ - 0.4) globular clusters in the LMC. We measured the line strengths of the Na I, Ca I and 12CO (2-0) absorption features. The Na I index decreases with increasing age and decreasing metallicity of the clusters. The DCO index, used to measure the 12CO (2-0) line strength, is significantly reduced by the presence of carbon-rich TP-AGB stars in the globular clusters with age ~1 Gyr. This is in contradiction to the predictions of the stellar population models of Maraston (2005, MNRAS, 362, 799). We find that this disagreement is due to the different CO absorption strength of carbon-rich Milky Way TP-AGB stars used in the models and the LMC carbon stars in our sample. For globular clusters with age ≥ 2 Gyr we find DCO index measurements consistent with the model predictions. Based on observation collected at the ESO Paranal La Silla Observatory, Chile, Prog. ID 078.B-0205.Spectra in FITS format are only available in electronic form at the CDS via anonymous ftp to cdsarc.u-strasbg.fr (130.79.128.5) or via http://cdsweb.u-strasbg.fr/cgi-bin/qcat?J/A+A/510/A19

  2. Assessment of cluster yield components by image analysis.

    PubMed

    Diago, Maria P; Tardaguila, Javier; Aleixos, Nuria; Millan, Borja; Prats-Montalban, Jose M; Cubero, Sergio; Blasco, Jose

    2015-04-01

    Berry weight, berry number and cluster weight are key parameters for yield estimation for wine and tablegrape industry. Current yield prediction methods are destructive, labour-demanding and time-consuming. In this work, a new methodology, based on image analysis was developed to determine cluster yield components in a fast and inexpensive way. Clusters of seven different red varieties of grapevine (Vitis vinifera L.) were photographed under laboratory conditions and their cluster yield components manually determined after image acquisition. Two algorithms based on the Canny and the logarithmic image processing approaches were tested to find the contours of the berries in the images prior to berry detection performed by means of the Hough Transform. Results were obtained in two ways: by analysing either a single image of the cluster or using four images per cluster from different orientations. The best results (R(2) between 69% and 95% in berry detection and between 65% and 97% in cluster weight estimation) were achieved using four images and the Canny algorithm. The model's capability based on image analysis to predict berry weight was 84%. The new and low-cost methodology presented here enabled the assessment of cluster yield components, saving time and providing inexpensive information in comparison with current manual methods. © 2014 Society of Chemical Industry.

  3. Pressure of the hot gas in simulations of galaxy clusters

    NASA Astrophysics Data System (ADS)

    Planelles, S.; Fabjan, D.; Borgani, S.; Murante, G.; Rasia, E.; Biffi, V.; Truong, N.; Ragone-Figueroa, C.; Granato, G. L.; Dolag, K.; Pierpaoli, E.; Beck, A. M.; Steinborn, Lisa K.; Gaspari, M.

    2017-06-01

    We analyse the radial pressure profiles, the intracluster medium (ICM) clumping factor and the Sunyaev-Zel'dovich (SZ) scaling relations of a sample of simulated galaxy clusters and groups identified in a set of hydrodynamical simulations based on an updated version of the treepm-SPH GADGET-3 code. Three different sets of simulations are performed: the first assumes non-radiative physics, the others include, among other processes, active galactic nucleus (AGN) and/or stellar feedback. Our results are analysed as a function of redshift, ICM physics, cluster mass and cluster cool-coreness or dynamical state. In general, the mean pressure profiles obtained for our sample of groups and clusters show a good agreement with X-ray and SZ observations. Simulated cool-core (CC) and non-cool-core (NCC) clusters also show a good match with real data. We obtain in all cases a small (if any) redshift evolution of the pressure profiles of massive clusters, at least back to z = 1. We find that the clumpiness of gas density and pressure increases with the distance from the cluster centre and with the dynamical activity. The inclusion of AGN feedback in our simulations generates values for the gas clumping (√{C}_{ρ }˜ 1.2 at R200) in good agreement with recent observational estimates. The simulated YSZ-M scaling relations are in good accordance with several observed samples, especially for massive clusters. As for the scatter of these relations, we obtain a clear dependence on the cluster dynamical state, whereas this distinction is not so evident when looking at the subsamples of CC and NCC clusters.

  4. Predicting item popularity: Analysing local clustering behaviour of users

    NASA Astrophysics Data System (ADS)

    Liebig, Jessica; Rao, Asha

    2016-01-01

    Predicting the popularity of items in rating networks is an interesting but challenging problem. This is especially so when an item has first appeared and has received very few ratings. In this paper, we propose a novel approach to predicting the future popularity of new items in rating networks, defining a new bipartite clustering coefficient to predict the popularity of movies and stories in the MovieLens and Digg networks respectively. We show that the clustering behaviour of the first user who rates a new item gives insight into the future popularity of that item. Our method predicts, with a success rate of over 65% for the MovieLens network and over 50% for the Digg network, the future popularity of an item. This is a major improvement on current results.

  5. Micron-size hydrogen cluster target for laser-driven proton acceleration

    NASA Astrophysics Data System (ADS)

    Jinno, S.; Kanasaki, M.; Uno, M.; Matsui, R.; Uesaka, M.; Kishimoto, Y.; Fukuda, Y.

    2018-04-01

    As a new laser-driven ion acceleration technique, we proposed a way to produce impurity-free, highly reproducible, and robust proton beams exceeding 100 MeV using a Coulomb explosion of micron-size hydrogen clusters. In this study, micron-size hydrogen clusters were generated by expanding the cooled high-pressure hydrogen gas into a vacuum via a conical nozzle connected to a solenoid valve cooled by a mechanical cryostat. The size distributions of the hydrogen clusters were evaluated by measuring the angular distribution of laser light scattered from the clusters. The data were analyzed mathematically based on the Mie scattering theory combined with the Tikhonov regularization method. The maximum size of the hydrogen cluster at 25 K and 6 MPa in the stagnation state was recognized to be 2.15 ± 0.10 μm. The mean cluster size decreased with increasing temperature, and was found to be much larger than that given by Hagena’s formula. This discrepancy suggests that the micron-size hydrogen clusters were formed by the atomization (spallation) of the liquid or supercritical fluid phase of hydrogen. In addition, the density profiles of the gas phase were evaluated for 25 to 80 K at 6 MPa using a Nomarski interferometer. Based on the measurement results and the equation of state for hydrogen, the cluster mass fraction was obtained. 3D particles-in-cell (PIC) simulations concerning the interaction processes of micron-size hydrogen clusters with high power laser pulses predicted the generation of protons exceeding 100 MeV and accelerating in a laser propagation direction via an anisotropic Coulomb explosion mechanism, thus demonstrating a future candidate in laser-driven proton sources for upcoming multi-petawatt lasers.

  6. ADHD latent class clusters: DSM-IV subtypes and comorbidity

    PubMed Central

    Elia, Josephine; Arcos-Burgos, Mauricio; Bolton, Kelly L.; Ambrosini, Paul J.; Berrettini, Wade; Muenke, Maximilian

    2014-01-01

    ADHD (Attention Deficit Hyperactivity Disorder) has a complex, heterogeneous phenotype only partially captured by Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) criteria. In this report, latent class analyses (LCA) are used to identify ADHD phenotypes using K-SADS-IVR (Schedule for Affective Disorders & Schizophrenia for School Age Children-IV-Revised) symptoms and symptom severity data from a clinical sample of 500 ADHD subjects, ages 6–18, participating in an ADHD genetic study. Results show that LCA identified six separate ADHD clusters, some corresponding to specific DSM-IV subtypes while others included several subtypes. DSM-IV comorbid anxiety and mood disorders were generally similar across all clusters, and subjects without comorbidity did not aggregate within any one cluster. Age and gender composition also varied. These results support findings from population-based LCA studies. The six clusters provide additional homogenous groups that can be used to define ADHD phenotypes in genetic association studies. The limited age ranges aggregating in the different clusters may prove to be a particular advantage in genetic studies where candidate gene expression may vary during developmental phases. DSM-IV comorbid mood and anxiety disorders also do not appear to increase cluster heterogeneity; however, longitudinal studies that cover period of risk are needed to support this finding. PMID:19900717

  7. From virtual clustering analysis to self-consistent clustering analysis: a mathematical study

    NASA Astrophysics Data System (ADS)

    Tang, Shaoqiang; Zhang, Lei; Liu, Wing Kam

    2018-03-01

    In this paper, we propose a new homogenization algorithm, virtual clustering analysis (VCA), as well as provide a mathematical framework for the recently proposed self-consistent clustering analysis (SCA) (Liu et al. in Comput Methods Appl Mech Eng 306:319-341, 2016). In the mathematical theory, we clarify the key assumptions and ideas of VCA and SCA, and derive the continuous and discrete Lippmann-Schwinger equations. Based on a key postulation of "once response similarly, always response similarly", clustering is performed in an offline stage by machine learning techniques (k-means and SOM), and facilitates substantial reduction of computational complexity in an online predictive stage. The clear mathematical setup allows for the first time a convergence study of clustering refinement in one space dimension. Convergence is proved rigorously, and found to be of second order from numerical investigations. Furthermore, we propose to suitably enlarge the domain in VCA, such that the boundary terms may be neglected in the Lippmann-Schwinger equation, by virtue of the Saint-Venant's principle. In contrast, they were not obtained in the original SCA paper, and we discover these terms may well be responsible for the numerical dependency on the choice of reference material property. Since VCA enhances the accuracy by overcoming the modeling error, and reduce the numerical cost by avoiding an outer loop iteration for attaining the material property consistency in SCA, its efficiency is expected even higher than the recently proposed SCA algorithm.

  8. Gaussian mixture clustering and imputation of microarray data.

    PubMed

    Ouyang, Ming; Welsh, William J; Georgopoulos, Panos

    2004-04-12

    In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets.

  9. Low-temperature cluster catalysis.

    PubMed

    Judai, Ken; Abbet, Stéphane; Wörz, Anke S; Heiz, Ulrich; Henry, Claude R

    2004-03-10

    Free and supported metal clusters reveal unique chemical and physical properties, which vary as a function of size as each cluster possesses a characteristic electron confinement. Several previous experimental results showed that the outcome of a given chemical reaction can be controlled by tuning the cluster size. However, none of the examples indicate that clusters prepared in the gas phase and then deposited on a support material are indeed catalytically active over several reaction cycles nor that their catalytic properties remain constant during such a catalytic process. In this work we report turn-over frequencies (TOF) for Pd(n) (n = 4, 8, 30) clusters using pulsed molecular beam experiments. The obtained results illustrate that the catalytic reactivity for the NO reduction by CO (CO + NO --> 1/2N(2) + CO(2)) is indeed a function of cluster size and that the measured TOF remain constant at a given temperature. More interestingly, the temperature of maximal reactivity is at least 100 K lower than observed for palladium nanoparticles or single crystals. One reason for this surprising observation is the character of the binding sites of these small clusters: N(2) forms already at relatively low temperatures (400 and 450 K) and therefore poisoning by adsorbed nitrogen adatoms is prevented. Thus, small clusters not only open the possibility of tuning a catalytic process by changing cluster size, but also of catalyzing chemical reactions at low temperatures.

  10. Sample size determination for GEE analyses of stepped wedge cluster randomized trials.

    PubMed

    Li, Fan; Turner, Elizabeth L; Preisser, John S

    2018-06-19

    In stepped wedge cluster randomized trials, intact clusters of individuals switch from control to intervention from a randomly assigned period onwards. Such trials are becoming increasingly popular in health services research. When a closed cohort is recruited from each cluster for longitudinal follow-up, proper sample size calculation should account for three distinct types of intraclass correlations: the within-period, the inter-period, and the within-individual correlations. Setting the latter two correlation parameters to be equal accommodates cross-sectional designs. We propose sample size procedures for continuous and binary responses within the framework of generalized estimating equations that employ a block exchangeable within-cluster correlation structure defined from the distinct correlation types. For continuous responses, we show that the intraclass correlations affect power only through two eigenvalues of the correlation matrix. We demonstrate that analytical power agrees well with simulated power for as few as eight clusters, when data are analyzed using bias-corrected estimating equations for the correlation parameters concurrently with a bias-corrected sandwich variance estimator. © 2018, The International Biometric Society.

  11. The miRNA-17∼92 cluster mediates chemoresistance and enhances tumor growth in mantle cell lymphoma via PI3K/AKT pathway activation.

    PubMed

    Rao, E; Jiang, C; Ji, M; Huang, X; Iqbal, J; Lenz, G; Wright, G; Staudt, L M; Zhao, Y; McKeithan, T W; Chan, W C; Fu, K

    2012-05-01

    The median survival of patients with mantle cell lymphoma (MCL) ranges from 3 to 5 years with current chemotherapeutic regimens. A common secondary genomic alteration detected in MCL is chromosome 13q31-q32 gain/amplification, which targets a microRNA (miRNA) cluster, miR-17∼92. On the basis of gene expression profiling, we found that high level expression of C13orf25, the primary transcript from which these miRNAs are processed, was associated with poorer survival in patients with MCL (P=0.021). We demonstrated that the protein phosphatase PHLPP2, an important negative regulator of the PI3K/AKT pathway, was a direct target of miR-17∼92 miRNAs, in addition to PTEN and BIM. These proteins were down-modulated in MCL cells with overexpression of the miR-17∼92 cluster. Overexpression of miR-17∼92 activated the PI3K/AKT pathway and inhibited chemotherapy-induced apoptosis in MCL cell lines. Conversely, inhibition of miR-17∼92 expression suppressed the PI3K/AKT pathway and inhibited tumor growth in a xenograft MCL mouse model. Targeting the miR-17∼92 cluster may therefore provide a novel therapeutic approach for patients with MCL.

  12. Detecting Anomalies from End-to-End Internet Performance Measurements (PingER) Using Cluster Based Local Outlier Factor

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ali, Saqib; Wang, Guojun; Cottrell, Roger Leslie

    PingER (Ping End-to-End Reporting) is a worldwide end-to-end Internet performance measurement framework. It was developed by the SLAC National Accelerator Laboratory, Stanford, USA and running from the last 20 years. It has more than 700 monitoring agents and remote sites which monitor the performance of Internet links around 170 countries of the world. At present, the size of the compressed PingER data set is about 60 GB comprising of 100,000 flat files. The data is publicly available for valuable Internet performance analyses. However, the data sets suffer from missing values and anomalies due to congestion, bottleneck links, queuing overflow, networkmore » software misconfiguration, hardware failure, cable cuts, and social upheavals. Therefore, the objective of this paper is to detect such performance drops or spikes labeled as anomalies or outliers for the PingER data set. In the proposed approach, the raw text files of the data set are transformed into a PingER dimensional model. The missing values are imputed using the k-NN algorithm. The data is partitioned into similar instances using the k-means clustering algorithm. Afterward, clustering is integrated with the Local Outlier Factor (LOF) using the Cluster Based Local Outlier Factor (CBLOF) algorithm to detect the anomalies or outliers from the PingER data. Lastly, anomalies are further analyzed to identify the time frame and location of the hosts generating the major percentage of the anomalies in the PingER data set ranging from 1998 to 2016.« less

  13. Detecting Anomalies from End-to-End Internet Performance Measurements (PingER) Using Cluster Based Local Outlier Factor

    DOE PAGES

    Ali, Saqib; Wang, Guojun; Cottrell, Roger Leslie; ...

    2018-05-28

    PingER (Ping End-to-End Reporting) is a worldwide end-to-end Internet performance measurement framework. It was developed by the SLAC National Accelerator Laboratory, Stanford, USA and running from the last 20 years. It has more than 700 monitoring agents and remote sites which monitor the performance of Internet links around 170 countries of the world. At present, the size of the compressed PingER data set is about 60 GB comprising of 100,000 flat files. The data is publicly available for valuable Internet performance analyses. However, the data sets suffer from missing values and anomalies due to congestion, bottleneck links, queuing overflow, networkmore » software misconfiguration, hardware failure, cable cuts, and social upheavals. Therefore, the objective of this paper is to detect such performance drops or spikes labeled as anomalies or outliers for the PingER data set. In the proposed approach, the raw text files of the data set are transformed into a PingER dimensional model. The missing values are imputed using the k-NN algorithm. The data is partitioned into similar instances using the k-means clustering algorithm. Afterward, clustering is integrated with the Local Outlier Factor (LOF) using the Cluster Based Local Outlier Factor (CBLOF) algorithm to detect the anomalies or outliers from the PingER data. Lastly, anomalies are further analyzed to identify the time frame and location of the hosts generating the major percentage of the anomalies in the PingER data set ranging from 1998 to 2016.« less

  14. Strong evidences for a nonextensive behavior of the rotation period in open clusters

    NASA Astrophysics Data System (ADS)

    de Freitas, D. B.; Nepomuceno, M. M. F.; Soares, B. B.; Silva, J. R. P.

    2014-11-01

    Time-dependent nonextensivity in a stellar astrophysical scenario combines nonextensive entropic indices qK derived from the modified Kawaler's parametrization, and q, obtained from rotational velocity distribution. These q's are related through a heuristic single relation given by q≈ q0(1-Δ t/qK) , where t is the cluster age. In a nonextensive scenario, these indices are quantities that measure the degree of nonextensivity present in the system. Recent studies reveal that the index q is correlated to the formation rate of high-energy tails present in the distribution of rotation velocity. On the other hand, the index qK is determined by the stellar rotation-age relationship. This depends on the magnetic-field configuration through the expression qK=1+4aN/3 , where a and N denote the saturation level of the star magnetic field and its topology, respectively. In the present study, we show that the connection q-qK is also consistent with 548 rotation period data for single main-sequence stars in 11 open clusters aged less than 1 Gyr. The value of qK ˜ 2.5 from our unsaturated model shows that the mean magnetic-field topology of these stars is slightly more complex than a purely radial field. Our results also suggest that stellar rotational braking behavior affects the degree of anti-correlation between q and cluster age t. Finally, we suggest that stellar magnetic braking can be scaled by the entropic index q.

  15. The application of k-Nearest Neighbour in the identification of high potential archers based on relative psychological coping skills variables

    NASA Astrophysics Data System (ADS)

    Taha, Zahari; Muazu Musa, Rabiu; Majeed, Anwar P. P. Abdul; Razali Abdullah, Mohamad; Muaz Alim, Muhammad; Nasir, Ahmad Fakhri Ab

    2018-04-01

    The present study aims at classifying and predicting high and low potential archers from a collection of psychological coping skills variables trained on different k-Nearest Neighbour (k-NN) kernels. 50 youth archers with the average age and standard deviation of (17.0 ±.056) gathered from various archery programmes completed a one end shooting score test. Psychological coping skills inventory which evaluates the archers level of related coping skills were filled out by the archers prior to their shooting tests. k-means cluster analysis was applied to cluster the archers based on their scores on variables assessed k-NN models, i.e. fine, medium, coarse, cosine, cubic and weighted kernel functions, were trained on the psychological variables. The k-means clustered the archers into high psychologically prepared archers (HPPA) and low psychologically prepared archers (LPPA), respectively. It was demonstrated that the cosine k-NN model exhibited good accuracy and precision throughout the exercise with an accuracy of 94% and considerably fewer error rate for the prediction of the HPPA and the LPPA as compared to the rest of the models. The findings of this investigation can be valuable to coaches and sports managers to recognise high potential athletes from the selected psychological coping skills variables examined which would consequently save time and energy during talent identification and development programme.

  16. Cluster-distinguishing genotypic and phenotypic diversity of carbapenem-resistant Gram-negative bacteria in solid-organ transplantation patients: a comparative study.

    PubMed

    Karampatakis, Theodoros; Geladari, Anastasia; Politi, Lida; Antachopoulos, Charalampos; Iosifidis, Elias; Tsiatsiou, Olga; Karyoti, Aggeliki; Papanikolaou, Vasileios; Tsakris, Athanassios; Roilides, Emmanuel

    2017-07-31

    Solid-organ transplant recipients may display high rates of colonization and/or infection by multidrug-resistant bacteria. We analysed and compared the phenotypic and genotypic diversity of carbapenem-resistant (CR) strains of Klebsiella pneumoniae, Pseudomonas aeruginosa and Acinetobacter baumannii isolated from patients in the Solid Organ Transplantation department of our hospital. Between March 2012 and August 2013, 56 CR strains from various biological fluids underwent antimicrobial susceptibility testing with VITEK 2, molecular analysis by PCR amplification and genotypic analysis with pulsed-field gel electrophoresis (PFGE). They were clustered according to antimicrobial drug susceptibility and genotypic profiles. Diversity analyses were performed by calculating Simpson's diversity index and applying computed rarefaction curves.Results/Key findings. Among K. pneumoniae, KP-producers predominated (57.1 %). VIM and OXA-23 carbapenemases prevailed among P. aeruginosa and A. baumannii (89.4 and 88.9 %, respectively). KPC-producing K. pneumoniae and OXA-23 A. baumannii were assigned in single PFGE pulsotypes. VIM-producing P. aeruginosa generated multiple pulsotypes. CR K. pneumoniae strains displayed phenotypic diversity in tigecycline, colistin (CS), amikacin (AMK), gentamicin (GEN) and co-trimoxazole (SXT) (16 clusters); P. aeruginosa displayed phenotypic diversity in cefepime (FEP), ceftazidime, aztreonam, piperacillin, piperacillin-tazobactam, AMK, GEN and CS (9 clusters); and A. baumannii displayed phenotypic diversity in AMK, GEN, SXT, FEP, tobramycin and rifampicin (8 clusters). The Simpson diversity indices for the interpretative phenotype and PFGE analysis were 0.89 and 0.6, respectively, for K. pneumoniae strains (P<0.001); 0.77 and 0.6 for P. aeruginosa (P=0.22); and 0.86 and 0.19 for A. baumannii (P=0.004). The presence of different antimicrobial susceptibility profiles does not preclude the possibility that two CR K. pneumoniae or A. baumannii

  17. Spatial and temporal structure of typhoid outbreaks in Washington, D.C., 1906-1909: evaluating local clustering with the Gi* statistic.

    PubMed

    Hinman, Sarah E; Blackburn, Jason K; Curtis, Andrew

    2006-03-27

    To better understand the distribution of typhoid outbreaks in Washington, D.C., the U.S. Public Health Service (PHS) conducted four investigations of typhoid fever. These studies included maps of cases reported between 1 May - 31 October 1906 - 1909. These data were entered into a GIS database and analyzed using Ripley's K-function followed by the Gi* statistic in yearly intervals to evaluate spatial clustering, the scale of clustering, and the temporal stability of these clusters. The Ripley's K-function indicated no global spatial autocorrelation. The Gi* statistic indicated clustering of typhoid at multiple scales across the four year time period, refuting the conclusions drawn in all four PHS reports concerning the distribution of cases. While the PHS reports suggested an even distribution of the disease, this study quantified both areas of localized disease clustering, as well as mobile larger regions of clustering. Thus, indicating both highly localized and periodic generalized sources of infection within the city. The methodology applied in this study was useful for evaluating the spatial distribution and annual-level temporal patterns of typhoid outbreaks in Washington, D.C. from 1906 to 1909. While advanced spatial analyses of historical data sets must be interpreted with caution, this study does suggest that there is utility in these types of analyses and that they provide new insights into the urban patterns of typhoid outbreaks during the early part of the twentieth century.

  18. A highly efficient multi-core algorithm for clustering extremely large datasets

    PubMed Central

    2010-01-01

    Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922

  19. Ab initio metadynamics simulations of oxygen/ligand interactions in organoaluminum clusters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Alnemrat, Sufian; Hooper, Joseph P., E-mail: jphooper@nps.edu

    2014-10-14

    Car-Parrinello molecular dynamics combined with a metadynamics algorithm is used to study the initial interaction of O{sub 2} with the low-valence organoaluminum clusters Al{sub 4}Cp{sub 4} (Cp=C{sub 5}H{sub 5}) and Al{sub 4}Cp{sub 4}{sup *} (Cp{sup *}=C{sub 5}[CH{sub 3}]{sub 5}). Prior to reaction with the aluminum core, simulations suggest that the oxygen undergoes a hindered crossing of the steric barrier presented by the outer ligand monolayer. A combination of two collective variables based on aluminum/oxygen distance and lateral oxygen displacement was found to produce distinct reactant, product, and transition states for this process. In the methylated cluster with Cp{sup *} ligands,more » a broad transition state of 45 kJ/mol was observed due to direct steric interactions with the ligand groups and considerable oxygen reorientation. In the non-methylated cluster the ligands distort away from the oxidizer, resulting in a barrier of roughly 34 kJ/mol with minimal O{sub 2} reorientation. A study of the oxygen/cluster system fixed in a triplet multiplicity suggests that the spin state does not affect the initial steric interaction with the ligands. The metadynamics approach appears to be a promising means of analyzing the initial steps of such oxidation reactions for ligand-protected clusters.« less

  20. Developing appropriate methods for cost-effectiveness analysis of cluster randomized trials.

    PubMed

    Gomes, Manuel; Ng, Edmond S-W; Grieve, Richard; Nixon, Richard; Carpenter, James; Thompson, Simon G

    2012-01-01

    Cost-effectiveness analyses (CEAs) may use data from cluster randomized trials (CRTs), where the unit of randomization is the cluster, not the individual. However, most studies use analytical methods that ignore clustering. This article compares alternative statistical methods for accommodating clustering in CEAs of CRTs. Our simulation study compared the performance of statistical methods for CEAs of CRTs with 2 treatment arms. The study considered a method that ignored clustering--seemingly unrelated regression (SUR) without a robust standard error (SE)--and 4 methods that recognized clustering--SUR and generalized estimating equations (GEEs), both with robust SE, a "2-stage" nonparametric bootstrap (TSB) with shrinkage correction, and a multilevel model (MLM). The base case assumed CRTs with moderate numbers of balanced clusters (20 per arm) and normally distributed costs. Other scenarios included CRTs with few clusters, imbalanced cluster sizes, and skewed costs. Performance was reported as bias, root mean squared error (rMSE), and confidence interval (CI) coverage for estimating incremental net benefits (INBs). We also compared the methods in a case study. Each method reported low levels of bias. Without the robust SE, SUR gave poor CI coverage (base case: 0.89 v. nominal level: 0.95). The MLM and TSB performed well in each scenario (CI coverage, 0.92-0.95). With few clusters, the GEE and SUR (with robust SE) had coverage below 0.90. In the case study, the mean INBs were similar across all methods, but ignoring clustering underestimated statistical uncertainty and the value of further research. MLMs and the TSB are appropriate analytical methods for CEAs of CRTs with the characteristics described. SUR and GEE are not recommended for studies with few clusters.

  1. Cross-entropy clustering framework for catchment classification

    NASA Astrophysics Data System (ADS)

    Tongal, Hakan; Sivakumar, Bellie

    2017-09-01

    There is an increasing interest in catchment classification and regionalization in hydrology, as they are useful for identification of appropriate model complexity and transfer of information from gauged catchments to ungauged ones, among others. This study introduces a nonlinear cross-entropy clustering (CEC) method for classification of catchments. The method specifically considers embedding dimension (m), sample entropy (SampEn), and coefficient of variation (CV) to represent dimensionality, complexity, and variability of the time series, respectively. The method is applied to daily streamflow time series from 217 gauging stations across Australia. The results suggest that a combination of linear and nonlinear parameters (i.e. m, SampEn, and CV), representing different aspects of the underlying dynamics of streamflows, could be useful for determining distinct patterns of flow generation mechanisms within a nonlinear clustering framework. For the 217 streamflow time series, nine hydrologically homogeneous clusters that have distinct patterns of flow regime characteristics and specific dominant hydrological attributes with different climatic features are obtained. Comparison of the results with those obtained using the widely employed k-means clustering method (which results in five clusters, with the loss of some information about the features of the clusters) suggests the superiority of the cross-entropy clustering method. The outcomes from this study provide a useful guideline for employing the nonlinear dynamic approaches based on hydrologic signatures and for gaining an improved understanding of streamflow variability at a large scale.

  2. ctsGE-clustering subgroups of expression data.

    PubMed

    Sharabi-Schwager, Michal; Or, Etti; Ophir, Ron

    2017-07-01

    A pre-requisite to clustering noisy data, such as gene-expression data, is the filtering step. As an alternative to this step, the ctsGE R-package applies a sorting step in which all of the data are divided into small groups. The groups are divided according to how the time points are related to the time-series median. Then clustering is performed separately on each group. Thus, the clustering is done in two steps. First, an expression index (i.e. a sequence of 1, -1 and 0) is defined and genes with the same index are grouped together, and then each group of genes is clustered by k-means to create subgroups. The ctsGE package also provides an interactive tool to visualize and explore the gene-expression patterns and their subclusters. ctsGE proposes a way of organizing and exploring expression data without eliminating valuable information. Freely available as part of the Bioconductor project at https://bioconductor.org/packages/ctsGE/ . ron@agri.gov.il. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  3. Cluster structure of light nuclei

    NASA Astrophysics Data System (ADS)

    Iachello, Francesco

    2018-02-01

    Matter and charge densities of kα structures with k=2 (8Be), k=3 (12C) and k=4 (16O) calculated within the framework of the algebraic cluster model (ACM) are briefly reviewed and explicitly displayed. Their parameters are determined from a comparison with electron scattering data.

  4. Analysis of Rainfall and PM2.5 Data Using Clustered Trajectory Analysis for National Park Sites in the Western U.S.

    NASA Astrophysics Data System (ADS)

    Solorzano, N. N.; Hafner, W.; Jaffe, D.

    2005-12-01

    We calculated daily kinematic back-trajectories using the NOAA-HYSPLIT model to analyze 7 years of PM2.5 data from National Park sites in the Western U.S. (Glacier N.P., Mount Rainier N.P., Sequoia N.P., Rocky Mountain N.P. and Denali N.P.) The back-trajectories were clustered using a k-means clustering algorithm to segregate the trajectories into 6 main transport patterns. We calculated trajectory clusters for 1, 5 and 10 days to represent short, medium and long-range flow patterns. Some trajectory types and clusters show marked seasonality. Generally faster flow patterns are more prevalent in winter and slower/stagnant patterns are more prevalent in summer. In addition, we found significant inter-annual variability that may be important for explaining variations in rainfall and/or pollutant concentrations. The 5 and 10-day analyses revealed that, for the 4 non-Alaskan sites, trajectories from Asia tend to be less frequent in the summer, compared to the rest of the year. The clusters of different duration show very different predictive power for rainfall and PM2.5. We found that the 1-day clusters are a better predictor for precipitation and PM2.5 concentrations, as compared to the 5 and 10-day clusters. At each of the sites, there is at least one cluster with an average PM2.5 concentration that is different than the average for the site, indicating distinctive transport patterns. The same is true for 5 and 10-day clusters. Interestingly, only one site, Mount Rainier N.P., shows seasonal differences in PM2.5 concentrations between the clusters that differ from the average.

  5. Cluster Physics with Merging Galaxy Clusters

    NASA Astrophysics Data System (ADS)

    Molnar, Sandor

    Collisions between galaxy clusters provide a unique opportunity to study matter in a parameter space which cannot be explored in our laboratories on Earth. In the standard ΛCDM model, where the total density is dominated by the cosmological constant (Λ) and the matter density by cold dark matter (CDM), structure formation is hierarchical, and clusters grow mostly by merging. Mergers of two massive clusters are the most energetic events in the universe after the Big Bang, hence they provide a unique laboratory to study cluster physics. The two main mass components in clusters behave differently during collisions: the dark matter is nearly collisionless, responding only to gravity, while the gas is subject to pressure forces and dissipation, and shocks and turbulence are developed during collisions. In the present contribution we review the different methods used to derive the physical properties of merging clusters. Different physical processes leave their signatures on different wavelengths, thus our review is based on a multifrequency analysis. In principle, the best way to analyze multifrequency observations of merging clusters is to model them using N-body/HYDRO numerical simulations. We discuss the results of such detailed analyses. New high spatial and spectral resolution ground and space based telescopes will come online in the near future. Motivated by these new opportunities, we briefly discuss methods which will be feasible in the near future in studying merging clusters.

  6. Spatial and temporal structure of typhoid outbreaks in Washington, D.C., 1906–1909: evaluating local clustering with the Gi* statistic

    PubMed Central

    Hinman, Sarah E; Blackburn, Jason K; Curtis, Andrew

    2006-01-01

    Background To better understand the distribution of typhoid outbreaks in Washington, D.C., the U.S. Public Health Service (PHS) conducted four investigations of typhoid fever. These studies included maps of cases reported between 1 May – 31 October 1906 – 1909. These data were entered into a GIS database and analyzed using Ripley's K-function followed by the Gi* statistic in yearly intervals to evaluate spatial clustering, the scale of clustering, and the temporal stability of these clusters. Results The Ripley's K-function indicated no global spatial autocorrelation. The Gi* statistic indicated clustering of typhoid at multiple scales across the four year time period, refuting the conclusions drawn in all four PHS reports concerning the distribution of cases. While the PHS reports suggested an even distribution of the disease, this study quantified both areas of localized disease clustering, as well as mobile larger regions of clustering. Thus, indicating both highly localized and periodic generalized sources of infection within the city. Conclusion The methodology applied in this study was useful for evaluating the spatial distribution and annual-level temporal patterns of typhoid outbreaks in Washington, D.C. from 1906 to 1909. While advanced spatial analyses of historical data sets must be interpreted with caution, this study does suggest that there is utility in these types of analyses and that they provide new insights into the urban patterns of typhoid outbreaks during the early part of the twentieth century. PMID:16566830

  7. Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Data Analysis and Visualization; nternational Research Training Group ``Visualization of Large and Unstructured Data Sets,'' University of Kaiserslautern, Germany; Computational Research Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA

    2008-05-12

    The recent development of methods for extracting precise measurements of spatial gene expression patterns from three-dimensional (3D) image data opens the way for new analyses of the complex gene regulatory networks controlling animal development. We present an integrated visualization and analysis framework that supports user-guided data clustering to aid exploration of these new complex datasets. The interplay of data visualization and clustering-based data classification leads to improved visualization and enables a more detailed analysis than previously possible. We discuss (i) integration of data clustering and visualization into one framework; (ii) application of data clustering to 3D gene expression data; (iii)more » evaluation of the number of clusters k in the context of 3D gene expression clustering; and (iv) improvement of overall analysis quality via dedicated post-processing of clustering results based on visualization. We discuss the use of this framework to objectively define spatial pattern boundaries and temporal profiles of genes and to analyze how mRNA patterns are controlled by their regulatory transcription factors.« less

  8. A PSF-based approach to Kepler/K2 data - II. Exoplanet candidates in Praesepe (M 44)

    NASA Astrophysics Data System (ADS)

    Libralato, M.; Nardiello, D.; Bedin, L. R.; Borsato, L.; Granata, V.; Malavolta, L.; Piotto, G.; Ochner, P.; Cunial, A.; Nascimbeni, V.

    2016-12-01

    In this work, we keep pushing K2 data to a high photometric precision, close to that of the Kepler main mission, using a point-spread function (PSF)-based, neighbour-subtraction technique, which also overcome the dilution effects in crowded environments. We analyse the open cluster M 44 (NGC 2632), observed during the K2 Campaign 5, and extract light curves of stars imaged on module 14, where most of the cluster lies. We present two candidate exoplanets hosted by cluster members and five by field stars. As a by-product of our investigation, we find 1680 eclipsing binaries and variable stars, 1071 of which are new discoveries. Among them, we report the presence of a heartbeat binary star. Together with this work, we release to the community a catalogue with the variable stars and the candidate exoplanets found, as well as all our raw and detrended light curves.

  9. Mean-cluster approach indicates cell sorting time scales are determined by collective dynamics

    NASA Astrophysics Data System (ADS)

    Beatrici, Carine P.; de Almeida, Rita M. C.; Brunnet, Leonardo G.

    2017-03-01

    Cell migration is essential to cell segregation, playing a central role in tissue formation, wound healing, and tumor evolution. Considering random mixtures of two cell types, it is still not clear which cell characteristics define clustering time scales. The mass of diffusing clusters merging with one another is expected to grow as td /d +2 when the diffusion constant scales with the inverse of the cluster mass. Cell segregation experiments deviate from that behavior. Explanations for that could arise from specific microscopic mechanisms or from collective effects, typical of active matter. Here we consider a power law connecting diffusion constant and cluster mass to propose an analytic approach to model cell segregation where we explicitly take into account finite-size corrections. The results are compared with active matter model simulations and experiments available in the literature. To investigate the role played by different mechanisms we considered different hypotheses describing cell-cell interaction: differential adhesion hypothesis and different velocities hypothesis. We find that the simulations yield normal diffusion for long time intervals. Analytic and simulation results show that (i) cluster evolution clearly tends to a scaling regime, disrupted only at finite-size limits; (ii) cluster diffusion is greatly enhanced by cell collective behavior, such that for high enough tendency to follow the neighbors, cluster diffusion may become independent of cluster size; (iii) the scaling exponent for cluster growth depends only on the mass-diffusion relation, not on the detailed local segregation mechanism. These results apply for active matter systems in general and, in particular, the mechanisms found underlying the increase in cell sorting speed certainly have deep implications in biological evolution as a selection mechanism.

  10. A Multicriteria Decision Making Approach for Estimating the Number of Clusters in a Data Set

    PubMed Central

    Peng, Yi; Zhang, Yong; Kou, Gang; Shi, Yong

    2012-01-01

    Determining the number of clusters in a data set is an essential yet difficult step in cluster analysis. Since this task involves more than one criterion, it can be modeled as a multiple criteria decision making (MCDM) problem. This paper proposes a multiple criteria decision making (MCDM)-based approach to estimate the number of clusters for a given data set. In this approach, MCDM methods consider different numbers of clusters as alternatives and the outputs of any clustering algorithm on validity measures as criteria. The proposed method is examined by an experimental study using three MCDM methods, the well-known clustering algorithm–k-means, ten relative measures, and fifteen public-domain UCI machine learning data sets. The results show that MCDM methods work fairly well in estimating the number of clusters in the data and outperform the ten relative measures considered in the study. PMID:22870181

  11. Graph-Theoretic Analysis of Monomethyl Phosphate Clustering in Ionic Solutions.

    PubMed

    Han, Kyungreem; Venable, Richard M; Bryant, Anne-Marie; Legacy, Christopher J; Shen, Rong; Li, Hui; Roux, Benoît; Gericke, Arne; Pastor, Richard W

    2018-02-01

    All-atom molecular dynamics simulations combined with graph-theoretic analysis reveal that clustering of monomethyl phosphate dianion (MMP 2- ) is strongly influenced by the types and combinations of cations in the aqueous solution. Although Ca 2+ promotes the formation of stable and large MMP 2- clusters, K + alone does not. Nonetheless, clusters are larger and their link lifetimes are longer in mixtures of K + and Ca 2+ . This "synergistic" effect depends sensitively on the Lennard-Jones interaction parameters between Ca 2+ and the phosphorus oxygen and correlates with the hydration of the clusters. The pronounced MMP 2- clustering effect of Ca 2+ in the presence of K + is confirmed by Fourier transform infrared spectroscopy. The characterization of the cation-dependent clustering of MMP 2- provides a starting point for understanding cation-dependent clustering of phosphoinositides in cell membranes.

  12. Deep and wide photometry of two open clusters NGC 1245 and NGC 2506: dynamical evolution and halo

    NASA Astrophysics Data System (ADS)

    Lee, S. H.; Kang, Y.-W.; Ann, H. B.

    2013-06-01

    We studied the structure of two old open clusters, NGC 1245 and NGC 2506, from a wide and deep VI photometry data acquired using the CFH12K CCD camera at Canada-France-Hawaii Telescope. We devised a new method for assigning cluster membership probability to individual stars using both spatial positions and positions in the colour-magnitude diagram. From analyses of the luminosity functions at several cluster-centric radii and the radial surface density profiles derived from stars with different luminosity ranges, we found that the two clusters are dynamically relaxed to drive significant mass segregation and evaporation of some fraction of low-mass stars. There seems to be a signature of tidal tail in NGC 1245 but the signal is too low to be confirmed.

  13. Spots and the Activity of Stars in the Hyades Cluster from Observations with the Kepler Space Telescope (K2)

    NASA Astrophysics Data System (ADS)

    Savanov, I. S.; Dmitrienko, E. S.

    2018-03-01

    Observations of the K2 mission (continuing the program of the Kepler Space Telescope) are used to estimate the spot coverage S (the fractional area of spots on the surface of an active star) for stars of the Hyades cluster. The analysis is based on data on the photometric variations of 47 confirmed single cluster members, together with their atmospheric parameters, masses, and rotation periods. The resulting values of S for these Hyades objects are lower than those stars of the Pleiades cluster (on average, by Δ S 0.05-0.06). A comparison of the results of studies of cool, low-mass dwarfs in the Hyades and Pleiades clusters, as well as the results of a study of 1570 M stars from the main field observed in the Kepler SpaceMission, indicates that the Hyades stars are more evolved than the Pleiades stars, and demonstrate lower activity. The activity of seven solar-type Hyades stars ( S = 0.013 ± 0.006) almost approaches the activity level of the present-day Sun, and is lower than the activity of solar-mass stars in the Pleiades ( S = 0.031 ± 0.003). Solar-type stars in the Hyades rotate faster than the Sun (< P> = 8.6 d ), but slower than similar Pleiades stars.

  14. Iterative Stable Alignment and Clustering of 2D Transmission Electron Microscope Images

    PubMed Central

    Yang, Zhengfan; Fang, Jia; Chittuluru, Johnathan; Asturias, Francisco J.; Penczek, Pawel A.

    2012-01-01

    SUMMARY Identification of homogeneous subsets of images in a macromolecular electron microscopy (EM) image data set is a critical step in single-particle analysis. The task is handled by iterative algorithms, whose performance is compromised by the compounded limitations of image alignment and K-means clustering. Here we describe an approach, iterative stable alignment and clustering (ISAC) that, relying on a new clustering method and on the concepts of stability and reproducibility, can extract validated, homogeneous subsets of images. ISAC requires only a small number of simple parameters and, with minimal human intervention, can eliminate bias from two-dimensional image clustering and maximize the quality of group averages that can be used for ab initio three-dimensional structural determination and analysis of macromolecular conformational variability. Repeated testing of the stability and reproducibility of a solution within ISAC eliminates heterogeneous or incorrect classes and introduces critical validation to the process of EM image clustering. PMID:22325773

  15. Hot and turbulent gas in clusters

    DOE PAGES

    Schmidt, W.; Engels, J. F.; Niemeyer, J. C.; ...

    2016-03-20

    The gas in galaxy clusters is heated by shock compression through accretion (outer shocks) and mergers (inner shocks). These processes also produce turbulence. To analyse the relation between the thermal and turbulent energies of the gas under the influence of non-adiabatic processes, we performed numerical simulations of cosmic structure formation in a box of 152 Mpc comoving size with radiative cooling, UV background, and a subgrid scale model for numerically unresolved turbulence. By smoothing the gas velocities with an adaptive Kalman filter, we are able to estimate bulk flows towards cluster cores. This enables us to infer the velocity dispersionmore » associated with the turbulent fluctuation relative to the bulk flow. For haloes with masses above 10 13 M ⊙, we find that the turbulent velocity dispersions averaged over the warm-hot intergalactic medium (WHIM) and the intracluster medium (ICM) are approximately given by powers of the mean gas temperatures with exponents around 0.5, corresponding to a roughly linear relation between turbulent and thermal energies and transonic Mach numbers. However, turbulence is only weakly correlated with the halo mass. Since the power-law relation is stiffer for the WHIM, the turbulent Mach number tends to increase with the mean temperature of the WHIM. This can be attributed to enhanced turbulence production relative to dissipation in particularly hot and turbulent clusters.« less

  16. Hot Subdwarfs in Globular Clusters

    NASA Astrophysics Data System (ADS)

    Moehler, S.; Heber, U.; Saffer, R.; Thejll, P.

    1995-12-01

    We will present data on sdB stars in the globular clusters M 15, M 22, and NGC 6752. While NGC 6752 has been known to harbour sdBs for quite some time already (Heber et al., 1986), it has also been the only globular cluster known to do so. Only recently, sdB candidates in M 15 (Durrell & Harris, 1993) and in M 22 (Thejll, priv. comm) have been discovered. An analysis of one of the sdBs in M 15 was presented recently (Moehler, in press), while the data on the ones in M 22 will be shown at this meeting for the first time. The physical parameters of these stars (teff and log g ) are derived from optical and IUE spectrophotometric data, intermediate resolution spectroscopy and Stromgren photometry. Knowing the distances of the clusters we can also determine masses. We want to compare the physical parameters of these stars for the different clusters to see what their evolutionary status is and how (or whether at all) it is affected by metallicity. We will also compare our findings to sdB stars found in the field of the Milky Way. In addition we want to see whether the problems encountered with the analyses of blue HB stars (Moehler et al., 1995) apply also to the sdB stars. These analyses showed the BHB stars to have significantly lower surface gravities and masses than predicted by theory. It turned out that this effect did not extend to the sdBs in NGC 6752 studied by Heber et al. (1986) which however constituted a sample too small to draw any meaningful conclusions. Durrell P.R., Harris W.E., 1993, AJ{105}{1420} Heber U., Kudritzki R.P., Caloi V., Castellani V., Danziger J., Gilmozzi R., 1986, \\aua{162}{171--179} Moehler S., Heber U., de Boer K.S., 1995, \\aua{294}{65} Moehler S., 1995, to appear in The Formation of the Galactic Halo - Inside and Out}, Proceedings of the meeting at Tucson, Oct. 9-11, 1995, ASP Conf. Ser.

  17. Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter.

    PubMed

    Mohamed Hashim, Ezzeddin Kamil; Abdullah, Rosni

    2015-12-21

    Empirical analysis on k-mer DNA has been proven as an effective tool in finding unique patterns in DNA sequences which can lead to the discovery of potential sequence motifs. In an extensive study of empirical k-mer DNA on hundreds of organisms, the researchers found unique multi-modal k-mer spectra occur in the genomes of organisms from the tetrapod clade only which includes all mammals. The multi-modality is caused by the formation of the two lowest modes where k-mers under them are referred as the rare k-mers. The suppression of the two lowest modes (or the rare k-mers) can be attributed to the CG dinucleotide inclusions in them. Apart from that, the rare k-mers are selectively distributed in certain genomic features of CpG Island (CGI), promoter, 5' UTR, and exon. We correlated the rare k-mers with hundreds of annotated features using several bioinformatic tools, performed further intrinsic rare k-mer analyses within the correlated features, and modeled the elucidated rare k-mer clustering feature into a classifier to predict the correlated CGI and promoter features. Our correlation results show that rare k-mers are highly associated with several annotated features of CGI, promoter, 5' UTR, and open chromatin regions. Our intrinsic results show that rare k-mers have several unique topological, compositional, and clustering properties in CGI and promoter features. Finally, the performances of our RWC (rare-word clustering) method in predicting the CGI and promoter features are ranked among the top three, in eight of the CGI and promoter evaluations, among eight of the benchmarked datasets. Crown Copyright © 2015. Published by Elsevier Ltd. All rights reserved.

  18. Visualization and unsupervised predictive clustering of high-dimensional multimodal neuroimaging data.

    PubMed

    Mwangi, Benson; Soares, Jair C; Hasan, Khader M

    2014-10-30

    Neuroimaging machine learning studies have largely utilized supervised algorithms - meaning they require both neuroimaging scan data and corresponding target variables (e.g. healthy vs. diseased) to be successfully 'trained' for a prediction task. Noticeably, this approach may not be optimal or possible when the global structure of the data is not well known and the researcher does not have an a priori model to fit the data. We set out to investigate the utility of an unsupervised machine learning technique; t-distributed stochastic neighbour embedding (t-SNE) in identifying 'unseen' sample population patterns that may exist in high-dimensional neuroimaging data. Multimodal neuroimaging scans from 92 healthy subjects were pre-processed using atlas-based methods, integrated and input into the t-SNE algorithm. Patterns and clusters discovered by the algorithm were visualized using a 2D scatter plot and further analyzed using the K-means clustering algorithm. t-SNE was evaluated against classical principal component analysis. Remarkably, based on unlabelled multimodal scan data, t-SNE separated study subjects into two very distinct clusters which corresponded to subjects' gender labels (cluster silhouette index value=0.79). The resulting clusters were used to develop an unsupervised minimum distance clustering model which identified 93.5% of subjects' gender. Notably, from a neuropsychiatric perspective this method may allow discovery of data-driven disease phenotypes or sub-types of treatment responders. Copyright © 2014 Elsevier B.V. All rights reserved.

  19. Estimation of Comfort/Disconfort Based on EEG in Massage by Use of Clustering according to Correration and Incremental Learning type NN

    NASA Astrophysics Data System (ADS)

    Teramae, Tatsuya; Kushida, Daisuke; Takemori, Fumiaki; Kitamura, Akira

    Authors proposed the estimation method combining k-means algorithm and NN for evaluating massage. However, this estimation method has a problem that discrimination ratio is decreased to new user. There are two causes of this problem. One is that generalization of NN is bad. Another one is that clustering result by k-means algorithm has not high correlation coefficient in a class. Then, this research proposes k-means algorithm according to correlation coefficient and incremental learning for NN. The proposed k-means algorithm is method included evaluation function based on correlation coefficient. Incremental learning is method that NN is learned by new data and initialized weight based on the existing data. The effect of proposed methods are verified by estimation result using EEG data when testee is given massage.

  20. THE DISCOVERY OF A MASSIVE CLUSTER OF RED SUPERGIANTS WITH GLIMPSE

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Alexander, Michael J.; Kobulnicky, Henry A.; Clemens, Dan P.

    We report the discovery of a previously unknown massive Galactic star cluster at l = 29.{sup 0}22, b = -0.{sup 0}20. Identified visually in mid-IR images from the Spitzer GLIMPSE survey, the cluster contains at least eight late-type supergiants, based on follow-up near-IR spectroscopy, and an additional 3-6 candidate supergiant members having IR photometry consistent with a similar distance and reddening. The cluster lies at a local minimum in the {sup 13}CO column density and 8 {mu}m emission. We interpret this feature as a hole carved by the energetic winds of the evolving massive stars. The {sup 13}CO hole seenmore » in molecular maps at V {sub LSR} {approx} 95 km s{sup -1} corresponds to near/far kinematic distances of 6.1/8.7 {+-} 1 kpc. We calculate a mean spectrophotometric distance of 7.0{sup +3.7} {sub -2.4} kpc, broadly consistent with the kinematic distances inferred. This location places it near the northern end of the Galactic bar. For the mean extinction of A{sub V} = 12.6 {+-} 0.5 mag (A{sub K} = 1.5 {+-} 0.1 mag), the color-magnitude diagram of probable cluster members is well fit by isochrones in the age range 18-24 Myr. The estimated cluster mass is {approx}20,000 M {sub sun}. With the most massive original cluster stars likely deceased, no strong radio emission is detected in this vicinity. As such, this red supergiant (RSG) cluster is representative of adolescent massive Galactic clusters that lie hidden behind many magnitudes of dust obscuration. This cluster joins two similar RSG clusters as residents of the volatile region where the end of our Galaxy's bar joins the base of the Scutum-Crux spiral arm, suggesting a recent episode of widespread massive star formation there.« less