Research on retailer data clustering algorithm based on Spark
NASA Astrophysics Data System (ADS)
Huang, Qiuman; Zhou, Feng
2017-03-01
Big data analysis is a hot topic in the IT field now. Spark is a high-reliability and high-performance distributed parallel computing framework for big data sets. K-means algorithm is one of the classical partition methods in clustering algorithm. In this paper, we study the k-means clustering algorithm on Spark. Firstly, the principle of the algorithm is analyzed, and then the clustering analysis is carried out on the supermarket customers through the experiment to find out the different shopping patterns. At the same time, this paper proposes the parallelization of k-means algorithm and the distributed computing framework of Spark, and gives the concrete design scheme and implementation scheme. This paper uses the two-year sales data of a supermarket to validate the proposed clustering algorithm and achieve the goal of subdividing customers, and then analyze the clustering results to help enterprises to take different marketing strategies for different customer groups to improve sales performance.
Multimodal Estimation of Distribution Algorithms.
Yang, Qiang; Chen, Wei-Neng; Li, Yun; Chen, C L Philip; Xu, Xiang-Min; Zhang, Jun
2016-02-15
Taking the advantage of estimation of distribution algorithms (EDAs) in preserving high diversity, this paper proposes a multimodal EDA. Integrated with clustering strategies for crowding and speciation, two versions of this algorithm are developed, which operate at the niche level. Then these two algorithms are equipped with three distinctive techniques: 1) a dynamic cluster sizing strategy; 2) an alternative utilization of Gaussian and Cauchy distributions to generate offspring; and 3) an adaptive local search. The dynamic cluster sizing affords a potential balance between exploration and exploitation and reduces the sensitivity to the cluster size in the niching methods. Taking advantages of Gaussian and Cauchy distributions, we generate the offspring at the niche level through alternatively using these two distributions. Such utilization can also potentially offer a balance between exploration and exploitation. Further, solution accuracy is enhanced through a new local search scheme probabilistically conducted around seeds of niches with probabilities determined self-adaptively according to fitness values of these seeds. Extensive experiments conducted on 20 benchmark multimodal problems confirm that both algorithms can achieve competitive performance compared with several state-of-the-art multimodal algorithms, which is supported by nonparametric tests. Especially, the proposed algorithms are very promising for complex problems with many local optima.
Lukashin, A V; Fuchs, R
2001-05-01
Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and samples. In the present paper, we focus on several important issues related to clustering algorithms that have not yet been fully studied. We describe a simple and robust algorithm for the clustering of temporal gene expression profiles that is based on the simulated annealing procedure. In general, this algorithm guarantees to eventually find the globally optimal distribution of genes over clusters. We introduce an iterative scheme that serves to evaluate quantitatively the optimal number of clusters for each specific data set. The scheme is based on standard approaches used in regular statistical tests. The basic idea is to organize the search of the optimal number of clusters simultaneously with the optimization of the distribution of genes over clusters. The efficiency of the proposed algorithm has been evaluated by means of a reverse engineering experiment, that is, a situation in which the correct distribution of genes over clusters is known a priori. The employment of this statistically rigorous test has shown that our algorithm places greater than 90% genes into correct clusters. Finally, the algorithm has been tested on real gene expression data (expression changes during yeast cell cycle) for which the fundamental patterns of gene expression and the assignment of genes to clusters are well understood from numerous previous studies.
An agglomerative hierarchical clustering approach to visualisation in Bayesian clustering problems
Dawson, Kevin J.; Belkhir, Khalid
2009-01-01
Clustering problems (including the clustering of individuals into outcrossing populations, hybrid generations, full-sib families and selfing lines) have recently received much attention in population genetics. In these clustering problems, the parameter of interest is a partition of the set of sampled individuals, - the sample partition. In a fully Bayesian approach to clustering problems of this type, our knowledge about the sample partition is represented by a probability distribution on the space of possible sample partitions. Since the number of possible partitions grows very rapidly with the sample size, we can not visualise this probability distribution in its entirety, unless the sample is very small. As a solution to this visualisation problem, we recommend using an agglomerative hierarchical clustering algorithm, which we call the exact linkage algorithm. This algorithm is a special case of the maximin clustering algorithm that we introduced previously. The exact linkage algorithm is now implemented in our software package Partition View. The exact linkage algorithm takes the posterior co-assignment probabilities as input, and yields as output a rooted binary tree, - or more generally, a forest of such trees. Each node of this forest defines a set of individuals, and the node height is the posterior co-assignment probability of this set. This provides a useful visual representation of the uncertainty associated with the assignment of individuals to categories. It is also a useful starting point for a more detailed exploration of the posterior distribution in terms of the co-assignment probabilities. PMID:19337306
Energy Aware Clustering Algorithms for Wireless Sensor Networks
NASA Astrophysics Data System (ADS)
Rakhshan, Noushin; Rafsanjani, Marjan Kuchaki; Liu, Chenglian
2011-09-01
The sensor nodes deployed in wireless sensor networks (WSNs) are extremely power constrained, so maximizing the lifetime of the entire networks is mainly considered in the design. In wireless sensor networks, hierarchical network structures have the advantage of providing scalable and energy efficient solutions. In this paper, we investigate different clustering algorithms for WSNs and also compare these clustering algorithms based on metrics such as clustering distribution, cluster's load balancing, Cluster Head's (CH) selection strategy, CH's role rotation, node mobility, clusters overlapping, intra-cluster communications, reliability, security and location awareness.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation.
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. PMID:26221133
Load Balancing in Distributed Web Caching: A Novel Clustering Approach
NASA Astrophysics Data System (ADS)
Tiwari, R.; Kumar, K.; Khan, G.
2010-11-01
The World Wide Web suffers from scaling and reliability problems due to overloaded and congested proxy servers. Caching at local proxy servers helps, but cannot satisfy more than a third to half of requests; more requests are still sent to original remote origin servers. In this paper we have developed an algorithm for Distributed Web Cache, which incorporates cooperation among proxy servers of one cluster. This algorithm uses Distributed Web Cache concepts along with static hierarchies with geographical based clusters of level one proxy server with dynamic mechanism of proxy server during the congestion of one cluster. Congestion and scalability problems are being dealt by clustering concept used in our approach. This results in higher hit ratio of caches, with lesser latency delay for requested pages. This algorithm also guarantees data consistency between the original server objects and the proxy cache objects.
NASA Astrophysics Data System (ADS)
Zhang, Tianzhen; Wang, Xiumei; Gao, Xinbo
2018-04-01
Nowadays, several datasets are demonstrated by multi-view, which usually include shared and complementary information. Multi-view clustering methods integrate the information of multi-view to obtain better clustering results. Nonnegative matrix factorization has become an essential and popular tool in clustering methods because of its interpretation. However, existing nonnegative matrix factorization based multi-view clustering algorithms do not consider the disagreement between views and neglects the fact that different views will have different contributions to the data distribution. In this paper, we propose a new multi-view clustering method, named adaptive multi-view clustering based on nonnegative matrix factorization and pairwise co-regularization. The proposed algorithm can obtain the parts-based representation of multi-view data by nonnegative matrix factorization. Then, pairwise co-regularization is used to measure the disagreement between views. There is only one parameter to auto learning the weight values according to the contribution of each view to data distribution. Experimental results show that the proposed algorithm outperforms several state-of-the-arts algorithms for multi-view clustering.
Spatial cluster detection using dynamic programming.
Sverchkov, Yuriy; Jiang, Xia; Cooper, Gregory F
2012-03-25
The task of spatial cluster detection involves finding spatial regions where some property deviates from the norm or the expected value. In a probabilistic setting this task can be expressed as finding a region where some event is significantly more likely than usual. Spatial cluster detection is of interest in fields such as biosurveillance, mining of astronomical data, military surveillance, and analysis of fMRI images. In almost all such applications we are interested both in the question of whether a cluster exists in the data, and if it exists, we are interested in finding the most accurate characterization of the cluster. We present a general dynamic programming algorithm for grid-based spatial cluster detection. The algorithm can be used for both Bayesian maximum a-posteriori (MAP) estimation of the most likely spatial distribution of clusters and Bayesian model averaging over a large space of spatial cluster distributions to compute the posterior probability of an unusual spatial clustering. The algorithm is explained and evaluated in the context of a biosurveillance application, specifically the detection and identification of Influenza outbreaks based on emergency department visits. A relatively simple underlying model is constructed for the purpose of evaluating the algorithm, and the algorithm is evaluated using the model and semi-synthetic test data. When compared to baseline methods, tests indicate that the new algorithm can improve MAP estimates under certain conditions: the greedy algorithm we compared our method to was found to be more sensitive to smaller outbreaks, while as the size of the outbreaks increases, in terms of area affected and proportion of individuals affected, our method overtakes the greedy algorithm in spatial precision and recall. The new algorithm performs on-par with baseline methods in the task of Bayesian model averaging. We conclude that the dynamic programming algorithm performs on-par with other available methods for spatial cluster detection and point to its low computational cost and extendability as advantages in favor of further research and use of the algorithm.
Spatial cluster detection using dynamic programming
2012-01-01
Background The task of spatial cluster detection involves finding spatial regions where some property deviates from the norm or the expected value. In a probabilistic setting this task can be expressed as finding a region where some event is significantly more likely than usual. Spatial cluster detection is of interest in fields such as biosurveillance, mining of astronomical data, military surveillance, and analysis of fMRI images. In almost all such applications we are interested both in the question of whether a cluster exists in the data, and if it exists, we are interested in finding the most accurate characterization of the cluster. Methods We present a general dynamic programming algorithm for grid-based spatial cluster detection. The algorithm can be used for both Bayesian maximum a-posteriori (MAP) estimation of the most likely spatial distribution of clusters and Bayesian model averaging over a large space of spatial cluster distributions to compute the posterior probability of an unusual spatial clustering. The algorithm is explained and evaluated in the context of a biosurveillance application, specifically the detection and identification of Influenza outbreaks based on emergency department visits. A relatively simple underlying model is constructed for the purpose of evaluating the algorithm, and the algorithm is evaluated using the model and semi-synthetic test data. Results When compared to baseline methods, tests indicate that the new algorithm can improve MAP estimates under certain conditions: the greedy algorithm we compared our method to was found to be more sensitive to smaller outbreaks, while as the size of the outbreaks increases, in terms of area affected and proportion of individuals affected, our method overtakes the greedy algorithm in spatial precision and recall. The new algorithm performs on-par with baseline methods in the task of Bayesian model averaging. Conclusions We conclude that the dynamic programming algorithm performs on-par with other available methods for spatial cluster detection and point to its low computational cost and extendability as advantages in favor of further research and use of the algorithm. PMID:22443103
Fuzzy-Logic Based Distributed Energy-Efficient Clustering Algorithm for Wireless Sensor Networks.
Zhang, Ying; Wang, Jun; Han, Dezhi; Wu, Huafeng; Zhou, Rundong
2017-07-03
Due to the high-energy efficiency and scalability, the clustering routing algorithm has been widely used in wireless sensor networks (WSNs). In order to gather information more efficiently, each sensor node transmits data to its Cluster Head (CH) to which it belongs, by multi-hop communication. However, the multi-hop communication in the cluster brings the problem of excessive energy consumption of the relay nodes which are closer to the CH. These nodes' energy will be consumed more quickly than the farther nodes, which brings the negative influence on load balance for the whole networks. Therefore, we propose an energy-efficient distributed clustering algorithm based on fuzzy approach with non-uniform distribution (EEDCF). During CHs' election, we take nodes' energies, nodes' degree and neighbor nodes' residual energies into consideration as the input parameters. In addition, we take advantage of Takagi, Sugeno and Kang (TSK) fuzzy model instead of traditional method as our inference system to guarantee the quantitative analysis more reasonable. In our scheme, each sensor node calculates the probability of being as CH with the help of fuzzy inference system in a distributed way. The experimental results indicate EEDCF algorithm is better than some current representative methods in aspects of data transmission, energy consumption and lifetime of networks.
Swarm Intelligence in Text Document Clustering
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cui, Xiaohui; Potok, Thomas E
2008-01-01
Social animals or insects in nature often exhibit a form of emergent collective behavior. The research field that attempts to design algorithms or distributed problem-solving devices inspired by the collective behavior of social insect colonies is called Swarm Intelligence. Compared to the traditional algorithms, the swarm algorithms are usually flexible, robust, decentralized and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document collection clustering. The major challenge of today's information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role inmore » helping users to effectively navigate, summarize, and organize the overwhelmed information. In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. These clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools and ant food forage.« less
Dynamical Mass Measurements of Contaminated Galaxy Clusters Using Support Distribution Machines
NASA Astrophysics Data System (ADS)
Ntampaka, Michelle; Trac, Hy; Sutherland, Dougal; Fromenteau, Sebastien; Poczos, Barnabas; Schneider, Jeff
2018-01-01
We study dynamical mass measurements of galaxy clusters contaminated by interlopers and show that a modern machine learning (ML) algorithm can predict masses by better than a factor of two compared to a standard scaling relation approach. We create two mock catalogs from Multidark’s publicly available N-body MDPL1 simulation, one with perfect galaxy cluster membership infor- mation and the other where a simple cylindrical cut around the cluster center allows interlopers to contaminate the clusters. In the standard approach, we use a power-law scaling relation to infer cluster mass from galaxy line-of-sight (LOS) velocity dispersion. Assuming perfect membership knowledge, this unrealistic case produces a wide fractional mass error distribution, with a width E=0.87. Interlopers introduce additional scatter, significantly widening the error distribution further (E=2.13). We employ the support distribution machine (SDM) class of algorithms to learn from distributions of data to predict single values. Applied to distributions of galaxy observables such as LOS velocity and projected distance from the cluster center, SDM yields better than a factor-of-two improvement (E=0.67) for the contaminated case. Remarkably, SDM applied to contaminated clusters is better able to recover masses than even the scaling relation approach applied to uncon- taminated clusters. We show that the SDM method more accurately reproduces the cluster mass function, making it a valuable tool for employing cluster observations to evaluate cosmological models.
Abelian non-global logarithms from soft gluon clustering
NASA Astrophysics Data System (ADS)
Kelley, Randall; Walsh, Jonathan R.; Zuberi, Saba
2012-09-01
Most recombination-style jet algorithms cluster soft gluons in a complex way. This leads to previously identified correlations in the soft gluon phase space and introduces logarithmic corrections to jet cross sections, which are known as clustering logarithms. The leading Abelian clustering logarithms occur at least at next-to leading logarithm (NLL) in the exponent of the distribution. Using the framework of Soft Collinear Effective Theory (SCET), we show that new clustering effects contributing at NLL arise at each order. While numerical resummation of clustering logs is possible, it is unlikely that they can be analytically resummed to NLL. Clustering logarithms make the anti-kT algorithm theoretically preferred, for which they are power suppressed. They can arise in Abelian and non-Abelian terms, and we calculate the Abelian clustering logarithms at O ( {α_s^2} ) for the jet mass distribution using the Cambridge/Aachen and kT algorithms, including jet radius dependence, which extends previous results. We find that clustering logarithms can be naturally thought of as a class of non-global logarithms, which have traditionally been tied to non-Abelian correlations in soft gluon emission.
NASA Technical Reports Server (NTRS)
Eigen, D. J.; Fromm, F. R.; Northouse, R. A.
1974-01-01
A new clustering algorithm is presented that is based on dimensional information. The algorithm includes an inherent feature selection criterion, which is discussed. Further, a heuristic method for choosing the proper number of intervals for a frequency distribution histogram, a feature necessary for the algorithm, is presented. The algorithm, although usable as a stand-alone clustering technique, is then utilized as a global approximator. Local clustering techniques and configuration of a global-local scheme are discussed, and finally the complete global-local and feature selector configuration is shown in application to a real-time adaptive classification scheme for the analysis of remote sensed multispectral scanner data.
Data-driven process decomposition and robust online distributed modelling for large-scale processes
NASA Astrophysics Data System (ADS)
Shu, Zhang; Lijuan, Li; Lijuan, Yao; Shipin, Yang; Tao, Zou
2018-02-01
With the increasing attention of networked control, system decomposition and distributed models show significant importance in the implementation of model-based control strategy. In this paper, a data-driven system decomposition and online distributed subsystem modelling algorithm was proposed for large-scale chemical processes. The key controlled variables are first partitioned by affinity propagation clustering algorithm into several clusters. Each cluster can be regarded as a subsystem. Then the inputs of each subsystem are selected by offline canonical correlation analysis between all process variables and its controlled variables. Process decomposition is then realised after the screening of input and output variables. When the system decomposition is finished, the online subsystem modelling can be carried out by recursively block-wise renewing the samples. The proposed algorithm was applied in the Tennessee Eastman process and the validity was verified.
Jiang, Peng; Xu, Yiming; Wu, Feng
2016-01-01
Existing move-restricted node self-deployment algorithms are based on a fixed node communication radius, evaluate the performance based on network coverage or the connectivity rate and do not consider the number of nodes near the sink node and the energy consumption distribution of the network topology, thereby degrading network reliability and the energy consumption balance. Therefore, we propose a distributed underwater node self-deployment algorithm. First, each node begins the uneven clustering based on the distance on the water surface. Each cluster head node selects its next-hop node to synchronously construct a connected path to the sink node. Second, the cluster head node adjusts its depth while maintaining the layout formed by the uneven clustering and then adjusts the positions of in-cluster nodes. The algorithm originally considers the network reliability and energy consumption balance during node deployment and considers the coverage redundancy rate of all positions that a node may reach during the node position adjustment. Simulation results show, compared to the connected dominating set (CDS) based depth computation algorithm, that the proposed algorithm can increase the number of the nodes near the sink node and improve network reliability while guaranteeing the network connectivity rate. Moreover, it can balance energy consumption during network operation, further improve network coverage rate and reduce energy consumption. PMID:26784193
Iterative Track Fitting Using Cluster Classification in Multi Wire Proportional Chamber
NASA Astrophysics Data System (ADS)
Primor, David; Mikenberg, Giora; Etzion, Erez; Messer, Hagit
2007-10-01
This paper addresses the problem of track fitting of a charged particle in a multi wire proportional chamber (MWPC) using cathode readout strips. When a charged particle crosses a MWPC, a positive charge is induced on a cluster of adjacent strips. In the presence of high radiation background, the cluster charge measurements may be contaminated due to background particles, leading to less accurate hit position estimation. The least squares method for track fitting assumes the same position error distribution for all hits and thus loses its optimal properties on contaminated data. For this reason, a new robust algorithm is proposed. The algorithm first uses the known spatial charge distribution caused by a single charged particle over the strips, and classifies the clusters into ldquocleanrdquo and ldquodirtyrdquo clusters. Then, using the classification results, it performs an iterative weighted least squares fitting procedure, updating its optimal weights each iteration. The performance of the suggested algorithm is compared to other track fitting techniques using a simulation of tracks with radiation background. It is shown that the algorithm improves the track fitting performance significantly. A practical implementation of the algorithm is presented for muon track fitting in the cathode strip chamber (CSC) of the ATLAS experiment.
An extended affinity propagation clustering method based on different data density types.
Zhao, XiuLi; Xu, WeiXiang
2015-01-01
Affinity propagation (AP) algorithm, as a novel clustering method, does not require the users to specify the initial cluster centers in advance, which regards all data points as potential exemplars (cluster centers) equally and groups the clusters totally by the similar degree among the data points. But in many cases there exist some different intensive areas within the same data set, which means that the data set does not distribute homogeneously. In such situation the AP algorithm cannot group the data points into ideal clusters. In this paper, we proposed an extended AP clustering algorithm to deal with such a problem. There are two steps in our method: firstly the data set is partitioned into several data density types according to the nearest distances of each data point; and then the AP clustering method is, respectively, used to group the data points into clusters in each data density type. Two experiments are carried out to evaluate the performance of our algorithm: one utilizes an artificial data set and the other uses a real seismic data set. The experiment results show that groups are obtained more accurately by our algorithm than OPTICS and AP clustering algorithm itself.
Scalable Parallel Density-based Clustering and Applications
NASA Astrophysics Data System (ADS)
Patwary, Mostofa Ali
2014-04-01
Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
Reducing the Volume of NASA Earth-Science Data
NASA Technical Reports Server (NTRS)
Lee, Seungwon; Braverman, Amy J.; Guillaume, Alexandre
2010-01-01
A computer program reduces data generated by NASA Earth-science missions into representative clusters characterized by centroids and membership information, thereby reducing the large volume of data to a level more amenable to analysis. The program effects an autonomous data-reduction/clustering process to produce a representative distribution and joint relationships of the data, without assuming a specific type of distribution and relationship and without resorting to domain-specific knowledge about the data. The program implements a combination of a data-reduction algorithm known as the entropy-constrained vector quantization (ECVQ) and an optimization algorithm known as the differential evolution (DE). The combination of algorithms generates the Pareto front of clustering solutions that presents the compromise between the quality of the reduced data and the degree of reduction. Similar prior data-reduction computer programs utilize only a clustering algorithm, the parameters of which are tuned manually by users. In the present program, autonomous optimization of the parameters by means of the DE supplants the manual tuning of the parameters. Thus, the program determines the best set of clustering solutions without human intervention.
Reconstruction of a digital core containing clay minerals based on a clustering algorithm.
He, Yanlong; Pu, Chunsheng; Jing, Cheng; Gu, Xiaoyu; Chen, Qingdong; Liu, Hongzhi; Khan, Nasir; Dong, Qiaoling
2017-10-01
It is difficult to obtain a core sample and information for digital core reconstruction of mature sandstone reservoirs around the world, especially for an unconsolidated sandstone reservoir. Meanwhile, reconstruction and division of clay minerals play a vital role in the reconstruction of the digital cores, although the two-dimensional data-based reconstruction methods are specifically applicable as the microstructure reservoir simulation methods for the sandstone reservoir. However, reconstruction of clay minerals is still challenging from a research viewpoint for the better reconstruction of various clay minerals in the digital cores. In the present work, the content of clay minerals was considered on the basis of two-dimensional information about the reservoir. After application of the hybrid method, and compared with the model reconstructed by the process-based method, the digital core containing clay clusters without the labels of the clusters' number, size, and texture were the output. The statistics and geometry of the reconstruction model were similar to the reference model. In addition, the Hoshen-Kopelman algorithm was used to label various connected unclassified clay clusters in the initial model and then the number and size of clay clusters were recorded. At the same time, the K-means clustering algorithm was applied to divide the labeled, large connecting clusters into smaller clusters on the basis of difference in the clusters' characteristics. According to the clay minerals' characteristics, such as types, textures, and distributions, the digital core containing clay minerals was reconstructed by means of the clustering algorithm and the clay clusters' structure judgment. The distributions and textures of the clay minerals of the digital core were reasonable. The clustering algorithm improved the digital core reconstruction and provided an alternative method for the simulation of different clay minerals in the digital cores.
Gao, Ying; Wkram, Chris Hadri; Duan, Jiajie; Chou, Jarong
2015-01-01
In order to prolong the network lifetime, energy-efficient protocols adapted to the features of wireless sensor networks should be used. This paper explores in depth the nature of heterogeneous wireless sensor networks, and finally proposes an algorithm to address the problem of finding an effective pathway for heterogeneous clustering energy. The proposed algorithm implements cluster head selection according to the degree of energy attenuation during the network’s running and the degree of candidate nodes’ effective coverage on the whole network, so as to obtain an even energy consumption over the whole network for the situation with high degree of coverage. Simulation results show that the proposed clustering protocol has better adaptability to heterogeneous environments than existing clustering algorithms in prolonging the network lifetime. PMID:26690440
Dynamical Mass Measurements of Contaminated Galaxy Clusters Using Machine Learning
NASA Astrophysics Data System (ADS)
Ntampaka, M.; Trac, H.; Sutherland, D. J.; Fromenteau, S.; Póczos, B.; Schneider, J.
2016-11-01
We study dynamical mass measurements of galaxy clusters contaminated by interlopers and show that a modern machine learning algorithm can predict masses by better than a factor of two compared to a standard scaling relation approach. We create two mock catalogs from Multidark’s publicly available N-body MDPL1 simulation, one with perfect galaxy cluster membership information and the other where a simple cylindrical cut around the cluster center allows interlopers to contaminate the clusters. In the standard approach, we use a power-law scaling relation to infer cluster mass from galaxy line-of-sight (LOS) velocity dispersion. Assuming perfect membership knowledge, this unrealistic case produces a wide fractional mass error distribution, with a width of {{Δ }}ε ≈ 0.87. Interlopers introduce additional scatter, significantly widening the error distribution further ({{Δ }}ε ≈ 2.13). We employ the support distribution machine (SDM) class of algorithms to learn from distributions of data to predict single values. Applied to distributions of galaxy observables such as LOS velocity and projected distance from the cluster center, SDM yields better than a factor-of-two improvement ({{Δ }}ε ≈ 0.67) for the contaminated case. Remarkably, SDM applied to contaminated clusters is better able to recover masses than even the scaling relation approach applied to uncontaminated clusters. We show that the SDM method more accurately reproduces the cluster mass function, making it a valuable tool for employing cluster observations to evaluate cosmological models.
NASA Astrophysics Data System (ADS)
Kumar, Rohit; Puri, Rajeev K.
2018-03-01
Employing the quantum molecular dynamics (QMD) approach for nucleus-nucleus collisions, we test the predictive power of the energy-based clusterization algorithm, i.e., the simulating annealing clusterization algorithm (SACA), to describe the experimental data of charge distribution and various event-by-event correlations among fragments. The calculations are constrained into the Fermi-energy domain and/or mildly excited nuclear matter. Our detailed study spans over different system masses, and system-mass asymmetries of colliding partners show the importance of the energy-based clusterization algorithm for understanding multifragmentation. The present calculations are also compared with the other available calculations, which use one-body models, statistical models, and/or hybrid models.
Adaptive density trajectory cluster based on time and space distance
NASA Astrophysics Data System (ADS)
Liu, Fagui; Zhang, Zhijie
2017-10-01
There are some hotspot problems remaining in trajectory cluster for discovering mobile behavior regularity, such as the computation of distance between sub trajectories, the setting of parameter values in cluster algorithm and the uncertainty/boundary problem of data set. As a result, based on the time and space, this paper tries to define the calculation method of distance between sub trajectories. The significance of distance calculation for sub trajectories is to clearly reveal the differences in moving trajectories and to promote the accuracy of cluster algorithm. Besides, a novel adaptive density trajectory cluster algorithm is proposed, in which cluster radius is computed through using the density of data distribution. In addition, cluster centers and number are selected by a certain strategy automatically, and uncertainty/boundary problem of data set is solved by designed weighted rough c-means. Experimental results demonstrate that the proposed algorithm can perform the fuzzy trajectory cluster effectively on the basis of the time and space distance, and obtain the optimal cluster centers and rich cluster results information adaptably for excavating the features of mobile behavior in mobile and sociology network.
Two generalizations of Kohonen clustering
NASA Technical Reports Server (NTRS)
Bezdek, James C.; Pal, Nikhil R.; Tsao, Eric C. K.
1993-01-01
The relationship between the sequential hard c-means (SHCM), learning vector quantization (LVQ), and fuzzy c-means (FCM) clustering algorithms is discussed. LVQ and SHCM suffer from several major problems. For example, they depend heavily on initialization. If the initial values of the cluster centers are outside the convex hull of the input data, such algorithms, even if they terminate, may not produce meaningful results in terms of prototypes for cluster representation. This is due in part to the fact that they update only the winning prototype for every input vector. The impact and interaction of these two families with Kohonen's self-organizing feature mapping (SOFM), which is not a clustering method, but which often leads ideas to clustering algorithms is discussed. Then two generalizations of LVQ that are explicitly designed as clustering algorithms are presented; these algorithms are referred to as generalized LVQ = GLVQ; and fuzzy LVQ = FLVQ. Learning rules are derived to optimize an objective function whose goal is to produce 'good clusters'. GLVQ/FLVQ (may) update every node in the clustering net for each input vector. Neither GLVQ nor FLVQ depends upon a choice for the update neighborhood or learning rate distribution - these are taken care of automatically. Segmentation of a gray tone image is used as a typical application of these algorithms to illustrate the performance of GLVQ/FLVQ.
A gossip based information fusion protocol for distributed frequent itemset mining
NASA Astrophysics Data System (ADS)
Sohrabi, Mohammad Karim
2018-07-01
The computational complexity, huge memory space requirement, and time-consuming nature of frequent pattern mining process are the most important motivations for distribution and parallelization of this mining process. On the other hand, the emergence of distributed computational and operational environments, which causes the production and maintenance of data on different distributed data sources, makes the parallelization and distribution of the knowledge discovery process inevitable. In this paper, a gossip based distributed itemset mining (GDIM) algorithm is proposed to extract frequent itemsets, which are special types of frequent patterns, in a wireless sensor network environment. In this algorithm, local frequent itemsets of each sensor are extracted using a bit-wise horizontal approach (LHPM) from the nodes which are clustered using a leach-based protocol. Heads of clusters exploit a gossip based protocol in order to communicate each other to find the patterns which their global support is equal to or more than the specified support threshold. Experimental results show that the proposed algorithm outperforms the best existing gossip based algorithm in term of execution time.
Li, Xiaofang; Xu, Lizhong; Wang, Huibin; Song, Jie; Yang, Simon X.
2010-01-01
The traditional Low Energy Adaptive Cluster Hierarchy (LEACH) routing protocol is a clustering-based protocol. The uneven selection of cluster heads results in premature death of cluster heads and premature blind nodes inside the clusters, thus reducing the overall lifetime of the network. With a full consideration of information on energy and distance distribution of neighboring nodes inside the clusters, this paper proposes a new routing algorithm based on differential evolution (DE) to improve the LEACH routing protocol. To meet the requirements of monitoring applications in outdoor environments such as the meteorological, hydrological and wetland ecological environments, the proposed algorithm uses the simple and fast search features of DE to optimize the multi-objective selection of cluster heads and prevent blind nodes for improved energy efficiency and system stability. Simulation results show that the proposed new LEACH routing algorithm has better performance, effectively extends the working lifetime of the system, and improves the quality of the wireless sensor networks. PMID:22219670
Poole, William; Leinonen, Kalle; Shmulevich, Ilya
2017-01-01
Cancer researchers have long recognized that somatic mutations are not uniformly distributed within genes. However, most approaches for identifying cancer mutations focus on either the entire-gene or single amino-acid level. We have bridged these two methodologies with a multiscale mutation clustering algorithm that identifies variable length mutation clusters in cancer genes. We ran our algorithm on 539 genes using the combined mutation data in 23 cancer types from The Cancer Genome Atlas (TCGA) and identified 1295 mutation clusters. The resulting mutation clusters cover a wide range of scales and often overlap with many kinds of protein features including structured domains, phosphorylation sites, and known single nucleotide variants. We statistically associated these multiscale clusters with gene expression and drug response data to illuminate the functional and clinical consequences of mutations in our clusters. Interestingly, we find multiple clusters within individual genes that have differential functional associations: these include PTEN, FUBP1, and CDH1. This methodology has potential implications in identifying protein regions for drug targets, understanding the biological underpinnings of cancer, and personalizing cancer treatments. Toward this end, we have made the mutation clusters and the clustering algorithm available to the public. Clusters and pathway associations can be interactively browsed at m2c.systemsbiology.net. The multiscale mutation clustering algorithm is available at https://github.com/IlyaLab/M2C. PMID:28170390
Poole, William; Leinonen, Kalle; Shmulevich, Ilya; Knijnenburg, Theo A; Bernard, Brady
2017-02-01
Cancer researchers have long recognized that somatic mutations are not uniformly distributed within genes. However, most approaches for identifying cancer mutations focus on either the entire-gene or single amino-acid level. We have bridged these two methodologies with a multiscale mutation clustering algorithm that identifies variable length mutation clusters in cancer genes. We ran our algorithm on 539 genes using the combined mutation data in 23 cancer types from The Cancer Genome Atlas (TCGA) and identified 1295 mutation clusters. The resulting mutation clusters cover a wide range of scales and often overlap with many kinds of protein features including structured domains, phosphorylation sites, and known single nucleotide variants. We statistically associated these multiscale clusters with gene expression and drug response data to illuminate the functional and clinical consequences of mutations in our clusters. Interestingly, we find multiple clusters within individual genes that have differential functional associations: these include PTEN, FUBP1, and CDH1. This methodology has potential implications in identifying protein regions for drug targets, understanding the biological underpinnings of cancer, and personalizing cancer treatments. Toward this end, we have made the mutation clusters and the clustering algorithm available to the public. Clusters and pathway associations can be interactively browsed at m2c.systemsbiology.net. The multiscale mutation clustering algorithm is available at https://github.com/IlyaLab/M2C.
Azad, Ariful; Ouzounis, Christos A; Kyrpides, Nikos C; Buluç, Aydin
2018-01-01
Abstract Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. Here, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ∼70 million nodes with ∼68 billion edges in ∼2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license. PMID:29315405
Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.; ...
2018-01-05
Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore » memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azad, Ariful; Pavlopoulos, Georgios A.; Ouzounis, Christos A.
Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein–protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL’s scalability to cluster large datasets still remains a bottleneck due to high running times andmore » memory demands. In this paper, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of ~70 million nodes with ~68 billion edges in ~2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. Finally, HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.« less
VizieR Online Data Catalog: Proper motions of PM2000 open clusters (Krone-Martins+, 2010)
NASA Astrophysics Data System (ADS)
Krone-Martins, A.; Soubiran, C.; Ducourant, C.; Teixeira, R.; Le Campion, J. F.
2010-04-01
We present lists of proper-motions and kinematic membership probabilities in the region of 49 open clusters or possible open clusters. The stellar proper motions were taken from the Bordeaux PM2000 catalogue. The segregation between cluster and field stars and the assignment of membership probabilities was accomplished by applying a fully automated method based on parametrisations for the probability distribution functions and genetic algorithm optimisation heuristics associated with a derivative-based hill climbing algorithm for the likelihood optimization. (3 data files).
A highly efficient multi-core algorithm for clustering extremely large datasets
2010-01-01
Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
Parallel Clustering Algorithm for Large-Scale Biological Data Sets
Wang, Minchao; Zhang, Wu; Ding, Wang; Dai, Dongbo; Zhang, Huiran; Xie, Hao; Chen, Luonan; Guo, Yike; Xie, Jiang
2014-01-01
Backgrounds Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Methods Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. Result A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. PMID:24705246
Graphical Evaluation of Hierarchical Clustering Schemes. Technical Report No. 1.
ERIC Educational Resources Information Center
Halff, Henry M.
Graphical methods for evaluating the fit of Johnson's hierarchical clustering schemes are presented together with an example. These evaluation methods examine the extent to which the clustering algorithm can minimize the overlap of the distributions of intracluster and intercluster distances. (Author)
Multivariate Spatial Condition Mapping Using Subtractive Fuzzy Cluster Means
Sabit, Hakilo; Al-Anbuky, Adnan
2014-01-01
Wireless sensor networks are usually deployed for monitoring given physical phenomena taking place in a specific space and over a specific duration of time. The spatio-temporal distribution of these phenomena often correlates to certain physical events. To appropriately characterise these events-phenomena relationships over a given space for a given time frame, we require continuous monitoring of the conditions. WSNs are perfectly suited for these tasks, due to their inherent robustness. This paper presents a subtractive fuzzy cluster means algorithm and its application in data stream mining for wireless sensor systems over a cloud-computing-like architecture, which we call sensor cloud data stream mining. Benchmarking on standard mining algorithms, the k-means and the FCM algorithms, we have demonstrated that the subtractive fuzzy cluster means model can perform high quality distributed data stream mining tasks comparable to centralised data stream mining. PMID:25313495
Qin, Jiahu; Fu, Weiming; Gao, Huijun; Zheng, Wei Xing
2016-03-03
This paper is concerned with developing a distributed k-means algorithm and a distributed fuzzy c-means algorithm for wireless sensor networks (WSNs) where each node is equipped with sensors. The underlying topology of the WSN is supposed to be strongly connected. The consensus algorithm in multiagent consensus theory is utilized to exchange the measurement information of the sensors in WSN. To obtain a faster convergence speed as well as a higher possibility of having the global optimum, a distributed k-means++ algorithm is first proposed to find the initial centroids before executing the distributed k-means algorithm and the distributed fuzzy c-means algorithm. The proposed distributed k-means algorithm is capable of partitioning the data observed by the nodes into measure-dependent groups which have small in-group and large out-group distances, while the proposed distributed fuzzy c-means algorithm is capable of partitioning the data observed by the nodes into different measure-dependent groups with degrees of membership values ranging from 0 to 1. Simulation results show that the proposed distributed algorithms can achieve almost the same results as that given by the centralized clustering algorithms.
Density-based clustering analyses to identify heterogeneous cellular sub-populations
NASA Astrophysics Data System (ADS)
Heaster, Tiffany M.; Walsh, Alex J.; Landman, Bennett A.; Skala, Melissa C.
2017-02-01
Autofluorescence microscopy of NAD(P)H and FAD provides functional metabolic measurements at the single-cell level. Here, density-based clustering algorithms were applied to metabolic autofluorescence measurements to identify cell-level heterogeneity in tumor cell cultures. The performance of the density-based clustering algorithm, DENCLUE, was tested in samples with known heterogeneity (co-cultures of breast carcinoma lines). DENCLUE was found to better represent the distribution of cell clusters compared to Gaussian mixture modeling. Overall, DENCLUE is a promising approach to quantify cell-level heterogeneity, and could be used to understand single cell population dynamics in cancer progression and treatment.
Density-Aware Clustering Based on Aggregated Heat Kernel and Its Transformation
Huang, Hao; Yoo, Shinjae; Yu, Dantong; ...
2015-06-01
Current spectral clustering algorithms suffer from the sensitivity to existing noise, and parameter scaling, and may not be aware of different density distributions across clusters. If these problems are left untreated, the consequent clustering results cannot accurately represent true data patterns, in particular, for complex real world datasets with heterogeneous densities. This paper aims to solve these problems by proposing a diffusion-based Aggregated Heat Kernel (AHK) to improve the clustering stability, and a Local Density Affinity Transformation (LDAT) to correct the bias originating from different cluster densities. AHK statistically\\ models the heat diffusion traces along the entire time scale, somore » it ensures robustness during clustering process, while LDAT probabilistically reveals local density of each instance and suppresses the local density bias in the affinity matrix. Our proposed framework integrates these two techniques systematically. As a result, not only does it provide an advanced noise-resisting and density-aware spectral mapping to the original dataset, but also demonstrates the stability during the processing of tuning the scaling parameter (which usually controls the range of neighborhood). Furthermore, our framework works well with the majority of similarity kernels, which ensures its applicability to many types of data and problem domains. The systematic experiments on different applications show that our proposed algorithms outperform state-of-the-art clustering algorithms for the data with heterogeneous density distributions, and achieve robust clustering performance with respect to tuning the scaling parameter and handling various levels and types of noise.« less
Autonomous distributed self-organization for mobile wireless sensor networks.
Wen, Chih-Yu; Tang, Hung-Kai
2009-01-01
This paper presents an adaptive combined-metrics-based clustering scheme for mobile wireless sensor networks, which manages the mobile sensors by utilizing the hierarchical network structure and allocates network resources efficiently A local criteria is used to help mobile sensors form a new cluster or join a current cluster. The messages transmitted during hierarchical clustering are applied to choose distributed gateways such that communication for adjacent clusters and distributed topology control can be achieved. In order to balance the load among clusters and govern the topology change, a cluster reformation scheme using localized criterions is implemented. The proposed scheme is simulated and analyzed to abstract the network behaviors in a number of settings. The experimental results show that the proposed algorithm provides efficient network topology management and achieves high scalability in mobile sensor networks.
NASA Astrophysics Data System (ADS)
Lu, Siqi; Wang, Xiaorong; Wu, Junyong
2018-01-01
The paper presents a method to generate the planning scenarios, which is based on K-means clustering analysis algorithm driven by data, for the location and size planning of distributed photovoltaic (PV) units in the network. Taken the power losses of the network, the installation and maintenance costs of distributed PV, the profit of distributed PV and the voltage offset as objectives and the locations and sizes of distributed PV as decision variables, Pareto optimal front is obtained through the self-adaptive genetic algorithm (GA) and solutions are ranked by a method called technique for order preference by similarity to an ideal solution (TOPSIS). Finally, select the planning schemes at the top of the ranking list based on different planning emphasis after the analysis in detail. The proposed method is applied to a 10-kV distribution network in Gansu Province, China and the results are discussed.
NASA Astrophysics Data System (ADS)
Zhang, Jiangjiang; Lin, Guang; Li, Weixuan; Wu, Laosheng; Zeng, Lingzao
2018-03-01
Ensemble smoother (ES) has been widely used in inverse modeling of hydrologic systems. However, for problems where the distribution of model parameters is multimodal, using ES directly would be problematic. One popular solution is to use a clustering algorithm to identify each mode and update the clusters with ES separately. However, this strategy may not be very efficient when the dimension of parameter space is high or the number of modes is large. Alternatively, we propose in this paper a very simple and efficient algorithm, i.e., the iterative local updating ensemble smoother (ILUES), to explore multimodal distributions of model parameters in nonlinear hydrologic systems. The ILUES algorithm works by updating local ensembles of each sample with ES to explore possible multimodal distributions. To achieve satisfactory data matches in nonlinear problems, we adopt an iterative form of ES to assimilate the measurements multiple times. Numerical cases involving nonlinearity and multimodality are tested to illustrate the performance of the proposed method. It is shown that overall the ILUES algorithm can well quantify the parametric uncertainties of complex hydrologic models, no matter whether the multimodal distribution exists.
Chang, Yuchao; Tang, Hongying; Cheng, Yongbo; Zhao, Qin; Yuan, Baoqing Li andXiaobing
2017-07-19
Routing protocols based on topology control are significantly important for improving network longevity in wireless sensor networks (WSNs). Traditionally, some WSN routing protocols distribute uneven network traffic load to sensor nodes, which is not optimal for improving network longevity. Differently to conventional WSN routing protocols, we propose a dynamic hierarchical protocol based on combinatorial optimization (DHCO) to balance energy consumption of sensor nodes and to improve WSN longevity. For each sensor node, the DHCO algorithm obtains the optimal route by establishing a feasible routing set instead of selecting the cluster head or the next hop node. The process of obtaining the optimal route can be formulated as a combinatorial optimization problem. Specifically, the DHCO algorithm is carried out by the following procedures. It employs a hierarchy-based connection mechanism to construct a hierarchical network structure in which each sensor node is assigned to a special hierarchical subset; it utilizes the combinatorial optimization theory to establish the feasible routing set for each sensor node, and takes advantage of the maximum-minimum criterion to obtain their optimal routes to the base station. Various results of simulation experiments show effectiveness and superiority of the DHCO algorithm in comparison with state-of-the-art WSN routing algorithms, including low-energy adaptive clustering hierarchy (LEACH), hybrid energy-efficient distributed clustering (HEED), genetic protocol-based self-organizing network clustering (GASONeC), and double cost function-based routing (DCFR) algorithms.
Optimizing Cluster Heads for Energy Efficiency in Large-Scale Heterogeneous Wireless Sensor Networks
Gu, Yi; Wu, Qishi; Rao, Nageswara S. V.
2010-01-01
Many complex sensor network applications require deploying a large number of inexpensive and small sensors in a vast geographical region to achieve quality through quantity. Hierarchical clustering is generally considered as an efficient and scalable way to facilitate the management and operation of such large-scale networks and minimize the total energy consumption for prolonged lifetime. Judicious selection of cluster heads for data integration and communication is critical to the success of applications based on hierarchical sensor networks organized as layered clusters. We investigate the problem of selecting sensor nodes in a predeployed sensor network to be the cluster heads tomore » minimize the total energy needed for data gathering. We rigorously derive an analytical formula to optimize the number of cluster heads in sensor networks under uniform node distribution, and propose a Distance-based Crowdedness Clustering algorithm to determine the cluster heads in sensor networks under general node distribution. The results from an extensive set of experiments on a large number of simulated sensor networks illustrate the performance superiority of the proposed solution over the clustering schemes based on k -means algorithm.« less
Nuclear Potential Clustering As a New Tool to Detect Patterns in High Dimensional Datasets
NASA Astrophysics Data System (ADS)
Tonkova, V.; Paulus, D.; Neeb, H.
2013-02-01
We present a new approach for the clustering of high dimensional data without prior assumptions about the structure of the underlying distribution. The proposed algorithm is based on a concept adapted from nuclear physics. To partition the data, we model the dynamic behaviour of nucleons interacting in an N-dimensional space. An adaptive nuclear potential, comprised of a short-range attractive (strong interaction) and a long-range repulsive term (Coulomb force) is assigned to each data point. By modelling the dynamics, nucleons that are densely distributed in space fuse to build nuclei (clusters) whereas single point clusters repel each other. The formation of clusters is completed when the system reaches the state of minimal potential energy. The data are then grouped according to the particles' final effective potential energy level. The performance of the algorithm is tested with several synthetic datasets showing that the proposed method can robustly identify clusters even when complex configurations are present. Furthermore, quantitative MRI data from 43 multiple sclerosis patients were analyzed, showing a reasonable splitting into subgroups according to the individual patients' disease grade. The good performance of the algorithm on such highly correlated non-spherical datasets, which are typical for MRI derived image features, shows that Nuclear Potential Clustering is a valuable tool for automated data analysis, not only in the MRI domain.
Kamali, Tahereh; Stashuk, Daniel
2016-10-01
Robust and accurate segmentation of brain white matter (WM) fiber bundles assists in diagnosing and assessing progression or remission of neuropsychiatric diseases such as schizophrenia, autism and depression. Supervised segmentation methods are infeasible in most applications since generating gold standards is too costly. Hence, there is a growing interest in designing unsupervised methods. However, most conventional unsupervised methods require the number of clusters be known in advance which is not possible in most applications. The purpose of this study is to design an unsupervised segmentation algorithm for brain white matter fiber bundles which can automatically segment fiber bundles using intrinsic diffusion tensor imaging data information without considering any prior information or assumption about data distributions. Here, a new density based clustering algorithm called neighborhood distance entropy consistency (NDEC), is proposed which discovers natural clusters within data by simultaneously utilizing both local and global density information. The performance of NDEC is compared with other state of the art clustering algorithms including chameleon, spectral clustering, DBSCAN and k-means using Johns Hopkins University publicly available diffusion tensor imaging data. The performance of NDEC and other employed clustering algorithms were evaluated using dice ratio as an external evaluation criteria and density based clustering validation (DBCV) index as an internal evaluation metric. Across all employed clustering algorithms, NDEC obtained the highest average dice ratio (0.94) and DBCV value (0.71). NDEC can find clusters with arbitrary shapes and densities and consequently can be used for WM fiber bundle segmentation where there is no distinct boundary between various bundles. NDEC may also be used as an effective tool in other pattern recognition and medical diagnostic systems in which discovering natural clusters within data is a necessity. Copyright © 2016 Elsevier B.V. All rights reserved.
Decentralized cooperative TOA/AOA target tracking for hierarchical wireless sensor networks.
Chen, Ying-Chih; Wen, Chih-Yu
2012-11-08
This paper proposes a distributed method for cooperative target tracking in hierarchical wireless sensor networks. The concept of leader-based information processing is conducted to achieve object positioning, considering a cluster-based network topology. Random timers and local information are applied to adaptively select a sub-cluster for the localization task. The proposed energy-efficient tracking algorithm allows each sub-cluster member to locally estimate the target position with a Bayesian filtering framework and a neural networking model, and further performs estimation fusion in the leader node with the covariance intersection algorithm. This paper evaluates the merits and trade-offs of the protocol design towards developing more efficient and practical algorithms for object position estimation.
A Parametric k-Means Algorithm
Tarpey, Thaddeus
2007-01-01
Summary The k points that optimally represent a distribution (usually in terms of a squared error loss) are called the k principal points. This paper presents a computationally intensive method that automatically determines the principal points of a parametric distribution. Cluster means from the k-means algorithm are nonparametric estimators of principal points. A parametric k-means approach is introduced for estimating principal points by running the k-means algorithm on a very large simulated data set from a distribution whose parameters are estimated using maximum likelihood. Theoretical and simulation results are presented comparing the parametric k-means algorithm to the usual k-means algorithm and an example on determining sizes of gas masks is used to illustrate the parametric k-means algorithm. PMID:17917692
Molecular Isotopic Distribution Analysis (MIDAs) with Adjustable Mass Accuracy
NASA Astrophysics Data System (ADS)
Alves, Gelio; Ogurtsov, Aleksey Y.; Yu, Yi-Kuo
2014-01-01
In this paper, we present Molecular Isotopic Distribution Analysis (MIDAs), a new software tool designed to compute molecular isotopic distributions with adjustable accuracies. MIDAs offers two algorithms, one polynomial-based and one Fourier-transform-based, both of which compute molecular isotopic distributions accurately and efficiently. The polynomial-based algorithm contains few novel aspects, whereas the Fourier-transform-based algorithm consists mainly of improvements to other existing Fourier-transform-based algorithms. We have benchmarked the performance of the two algorithms implemented in MIDAs with that of eight software packages (BRAIN, Emass, Mercury, Mercury5, NeutronCluster, Qmass, JFC, IC) using a consensus set of benchmark molecules. Under the proposed evaluation criteria, MIDAs's algorithms, JFC, and Emass compute with comparable accuracy the coarse-grained (low-resolution) isotopic distributions and are more accurate than the other software packages. For fine-grained isotopic distributions, we compared IC, MIDAs's polynomial algorithm, and MIDAs's Fourier transform algorithm. Among the three, IC and MIDAs's polynomial algorithm compute isotopic distributions that better resemble their corresponding exact fine-grained (high-resolution) isotopic distributions. MIDAs can be accessed freely through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html.
Molecular Isotopic Distribution Analysis (MIDAs) with adjustable mass accuracy.
Alves, Gelio; Ogurtsov, Aleksey Y; Yu, Yi-Kuo
2014-01-01
In this paper, we present Molecular Isotopic Distribution Analysis (MIDAs), a new software tool designed to compute molecular isotopic distributions with adjustable accuracies. MIDAs offers two algorithms, one polynomial-based and one Fourier-transform-based, both of which compute molecular isotopic distributions accurately and efficiently. The polynomial-based algorithm contains few novel aspects, whereas the Fourier-transform-based algorithm consists mainly of improvements to other existing Fourier-transform-based algorithms. We have benchmarked the performance of the two algorithms implemented in MIDAs with that of eight software packages (BRAIN, Emass, Mercury, Mercury5, NeutronCluster, Qmass, JFC, IC) using a consensus set of benchmark molecules. Under the proposed evaluation criteria, MIDAs's algorithms, JFC, and Emass compute with comparable accuracy the coarse-grained (low-resolution) isotopic distributions and are more accurate than the other software packages. For fine-grained isotopic distributions, we compared IC, MIDAs's polynomial algorithm, and MIDAs's Fourier transform algorithm. Among the three, IC and MIDAs's polynomial algorithm compute isotopic distributions that better resemble their corresponding exact fine-grained (high-resolution) isotopic distributions. MIDAs can be accessed freely through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html.
Buried landmine detection using multivariate normal clustering
NASA Astrophysics Data System (ADS)
Duston, Brian M.
2001-10-01
A Bayesian classification algorithm is presented for discriminating buried land mines from buried and surface clutter in Ground Penetrating Radar (GPR) signals. This algorithm is based on multivariate normal (MVN) clustering, where feature vectors are used to identify populations (clusters) of mines and clutter objects. The features are extracted from two-dimensional images created from ground penetrating radar scans. MVN clustering is used to determine the number of clusters in the data and to create probability density models for target and clutter populations, producing the MVN clustering classifier (MVNCC). The Bayesian Information Criteria (BIC) is used to evaluate each model to determine the number of clusters in the data. An extension of the MVNCC allows the model to adapt to local clutter distributions by treating each of the MVN cluster components as a Poisson process and adaptively estimating the intensity parameters. The algorithm is developed using data collected by the Mine Hunter/Killer Close-In Detector (MH/K CID) at prepared mine lanes. The Mine Hunter/Killer is a prototype mine detecting and neutralizing vehicle developed for the U.S. Army to clear roads of anti-tank mines.
Machine-learned cluster identification in high-dimensional data.
Ultsch, Alfred; Lötsch, Jörn
2017-02-01
High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. Ward clustering imposed cluster structures on cluster-less "golf ball", "cuboid" and "S-shaped" data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Ward, W. O. C.; Wilkinson, P. B.; Chambers, J. E.; Oxby, L. S.; Bai, L.
2014-04-01
A novel method for the effective identification of bedrock subsurface elevation from electrical resistivity tomography images is described. Identifying subsurface boundaries in the topographic data can be difficult due to smoothness constraints used in inversion, so a statistical population-based approach is used that extends previous work in calculating isoresistivity surfaces. The analysis framework involves a procedure for guiding a clustering approach based on the fuzzy c-means algorithm. An approximation of resistivity distributions, found using kernel density estimation, was utilized as a means of guiding the cluster centroids used to classify data. A fuzzy method was chosen over hard clustering due to uncertainty in hard edges in the topography data, and a measure of clustering uncertainty was identified based on the reciprocal of cluster membership. The algorithm was validated using a direct comparison of known observed bedrock depths at two 3-D survey sites, using real-time GPS information of exposed bedrock by quarrying on one site, and borehole logs at the other. Results show similarly accurate detection as a leading isosurface estimation method, and the proposed algorithm requires significantly less user input and prior site knowledge. Furthermore, the method is effectively dimension-independent and will scale to data of increased spatial dimensions without a significant effect on the runtime. A discussion on the results by automated versus supervised analysis is also presented.
Wang, Shichen; Wong, Debbie; Forrest, Kerrie; Allen, Alexandra; Chao, Shiaoman; Huang, Bevan E; Maccaferri, Marco; Salvi, Silvio; Milner, Sara G; Cattivelli, Luigi; Mastrangelo, Anna M; Whan, Alex; Stephen, Stuart; Barker, Gary; Wieseke, Ralf; Plieske, Joerg; International Wheat Genome Sequencing Consortium; Lillemo, Morten; Mather, Diane; Appels, Rudi; Dolferus, Rudy; Brown-Guedira, Gina; Korol, Abraham; Akhunova, Alina R; Feuillet, Catherine; Salse, Jerome; Morgante, Michele; Pozniak, Curtis; Luo, Ming-Cheng; Dvorak, Jan; Morell, Matthew; Dubcovsky, Jorge; Ganal, Martin; Tuberosa, Roberto; Lawley, Cindy; Mikoulitch, Ivan; Cavanagh, Colin; Edwards, Keith J; Hayden, Matthew; Akhunov, Eduard
2014-01-01
High-density single nucleotide polymorphism (SNP) genotyping arrays are a powerful tool for studying genomic patterns of diversity, inferring ancestral relationships between individuals in populations and studying marker–trait associations in mapping experiments. We developed a genotyping array including about 90 000 gene-associated SNPs and used it to characterize genetic variation in allohexaploid and allotetraploid wheat populations. The array includes a significant fraction of common genome-wide distributed SNPs that are represented in populations of diverse geographical origin. We used density-based spatial clustering algorithms to enable high-throughput genotype calling in complex data sets obtained for polyploid wheat. We show that these model-free clustering algorithms provide accurate genotype calling in the presence of multiple clusters including clusters with low signal intensity resulting from significant sequence divergence at the target SNP site or gene deletions. Assays that detect low-intensity clusters can provide insight into the distribution of presence–absence variation (PAV) in wheat populations. A total of 46 977 SNPs from the wheat 90K array were genetically mapped using a combination of eight mapping populations. The developed array and cluster identification algorithms provide an opportunity to infer detailed haplotype structure in polyploid wheat and will serve as an invaluable resource for diversity studies and investigating the genetic basis of trait variation in wheat. PMID:24646323
Weighted community detection and data clustering using message passing
NASA Astrophysics Data System (ADS)
Shi, Cheng; Liu, Yanchen; Zhang, Pan
2018-03-01
Grouping objects into clusters based on the similarities or weights between them is one of the most important problems in science and engineering. In this work, by extending message-passing algorithms and spectral algorithms proposed for an unweighted community detection problem, we develop a non-parametric method based on statistical physics, by mapping the problem to the Potts model at the critical temperature of spin-glass transition and applying belief propagation to solve the marginals corresponding to the Boltzmann distribution. Our algorithm is robust to over-fitting and gives a principled way to determine whether there are significant clusters in the data and how many clusters there are. We apply our method to different clustering tasks. In the community detection problem in weighted and directed networks, we show that our algorithm significantly outperforms existing algorithms. In the clustering problem, where the data were generated by mixture models in the sparse regime, we show that our method works all the way down to the theoretical limit of detectability and gives accuracy very close to that of the optimal Bayesian inference. In the semi-supervised clustering problem, our method only needs several labels to work perfectly in classic datasets. Finally, we further develop Thouless-Anderson-Palmer equations which heavily reduce the computation complexity in dense networks but give almost the same performance as belief propagation.
Tang, Hongying; Cheng, Yongbo; Zhao, Qin; Li, Baoqing; Yuan, Xiaobing
2017-01-01
Routing protocols based on topology control are significantly important for improving network longevity in wireless sensor networks (WSNs). Traditionally, some WSN routing protocols distribute uneven network traffic load to sensor nodes, which is not optimal for improving network longevity. Differently to conventional WSN routing protocols, we propose a dynamic hierarchical protocol based on combinatorial optimization (DHCO) to balance energy consumption of sensor nodes and to improve WSN longevity. For each sensor node, the DHCO algorithm obtains the optimal route by establishing a feasible routing set instead of selecting the cluster head or the next hop node. The process of obtaining the optimal route can be formulated as a combinatorial optimization problem. Specifically, the DHCO algorithm is carried out by the following procedures. It employs a hierarchy-based connection mechanism to construct a hierarchical network structure in which each sensor node is assigned to a special hierarchical subset; it utilizes the combinatorial optimization theory to establish the feasible routing set for each sensor node, and takes advantage of the maximum–minimum criterion to obtain their optimal routes to the base station. Various results of simulation experiments show effectiveness and superiority of the DHCO algorithm in comparison with state-of-the-art WSN routing algorithms, including low-energy adaptive clustering hierarchy (LEACH), hybrid energy-efficient distributed clustering (HEED), genetic protocol-based self-organizing network clustering (GASONeC), and double cost function-based routing (DCFR) algorithms. PMID:28753962
Textural defect detect using a revised ant colony clustering algorithm
NASA Astrophysics Data System (ADS)
Zou, Chao; Xiao, Li; Wang, Bingwen
2007-11-01
We propose a totally novel method based on a revised ant colony clustering algorithm (ACCA) to explore the topic of textural defect detection. In this algorithm, our efforts are mainly made on the definition of local irregularity measurement and the implementation of the revised ACCA. The local irregular measurement defined evaluates the local textural inconsistency of each pixel against their mini-environment. In our revised ACCA, the behaviors of each ant are divided into two steps: release pheromone and act. The quantity of pheromone released is proportional to the irregularity measurement; the actions of the ants to act next are chosen independently of each other in a stochastic way according to some evaluated heuristic knowledge. The independency of ants implies the inherent parallel computation architecture of this algorithm. We apply the proposed method in some typical textural images with defects. From the series of pheromone distribution map (PDM), it can be clearly seen that the pheromone distribution approaches the textual defects gradually. By some post-processing, the final distribution of pheromone can demonstrate the shape and area of the defects well.
Unsupervised spike sorting based on discriminative subspace learning.
Keshtkaran, Mohammad Reza; Yang, Zhi
2014-01-01
Spike sorting is a fundamental preprocessing step for many neuroscience studies which rely on the analysis of spike trains. In this paper, we present two unsupervised spike sorting algorithms based on discriminative subspace learning. The first algorithm simultaneously learns the discriminative feature subspace and performs clustering. It uses histogram of features in the most discriminative projection to detect the number of neurons. The second algorithm performs hierarchical divisive clustering that learns a discriminative 1-dimensional subspace for clustering in each level of the hierarchy until achieving almost unimodal distribution in the subspace. The algorithms are tested on synthetic and in-vivo data, and are compared against two widely used spike sorting methods. The comparative results demonstrate that our spike sorting methods can achieve substantially higher accuracy in lower dimensional feature space, and they are highly robust to noise. Moreover, they provide significantly better cluster separability in the learned subspace than in the subspace obtained by principal component analysis or wavelet transform.
Eyler, Lauren; Hubbard, Alan; Juillard, Catherine
2016-10-01
Low and middle-income countries (LMICs) and the world's poor bear a disproportionate share of the global burden of injury. Data regarding disparities in injury are vital to inform injury prevention and trauma systems strengthening interventions targeted towards vulnerable populations, but are limited in LMICs. We aim to facilitate injury disparities research by generating a standardized methodology for assessing economic status in resource-limited country trauma registries where complex metrics such as income, expenditures, and wealth index are infeasible to assess. To address this need, we developed a cluster analysis-based algorithm for generating simple population-specific metrics of economic status using nationally representative Demographic and Health Surveys (DHS) household assets data. For a limited number of variables, g, our algorithm performs weighted k-medoids clustering of the population using all combinations of g asset variables and selects the combination of variables and number of clusters that maximize average silhouette width (ASW). In simulated datasets containing both randomly distributed variables and "true" population clusters defined by correlated categorical variables, the algorithm selected the correct variable combination and appropriate cluster numbers unless variable correlation was very weak. When used with 2011 Cameroonian DHS data, our algorithm identified twenty economic clusters with ASW 0.80, indicating well-defined population clusters. This economic model for assessing health disparities will be used in the new Cameroonian six-hospital centralized trauma registry. By describing our standardized methodology and algorithm for generating economic clustering models, we aim to facilitate measurement of health disparities in other trauma registries in resource-limited countries. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
NASA Technical Reports Server (NTRS)
Lennington, R. K.; Malek, H.
1978-01-01
A clustering method, CLASSY, was developed, which alternates maximum likelihood iteration with a procedure for splitting, combining, and eliminating the resulting statistics. The method maximizes the fit of a mixture of normal distributions to the observed first through fourth central moments of the data and produces an estimate of the proportions, means, and covariances in this mixture. The mathematical model which is the basic for CLASSY and the actual operation of the algorithm is described. Data comparing the performances of CLASSY and ISOCLS on simulated and actual LACIE data are presented.
Cooperative network clustering and task allocation for heterogeneous small satellite network
NASA Astrophysics Data System (ADS)
Qin, Jing
The research of small satellite has emerged as a hot topic in recent years because of its economical prospects and convenience in launching and design. Due to the size and energy constraints of small satellites, forming a small satellite network(SSN) in which all the satellites cooperate with each other to finish tasks is an efficient and effective way to utilize them. In this dissertation, I designed and evaluated a weight based dominating set clustering algorithm, which efficiently organizes the satellites into stable clusters. The traditional clustering algorithms of large monolithic satellite networks, such as formation flying and satellite swarm, are often limited on automatic formation of clusters. Therefore, a novel Distributed Weight based Dominating Set(DWDS) clustering algorithm is designed to address the clustering problems in the stochastically deployed SSNs. Considering the unique features of small satellites, this algorithm is able to form the clusters efficiently and stably. In this algorithm, satellites are separated into different groups according to their spatial characteristics. A minimum dominating set is chosen as the candidate cluster head set based on their weights, which is a weighted combination of residual energy and connection degree. Then the cluster heads admit new neighbors that accept their invitations into the cluster, until the maximum cluster size is reached. Evaluated by the simulation results, in a SSN with 200 to 800 nodes, the algorithm is able to efficiently cluster more than 90% of nodes in 3 seconds. The Deadline Based Resource Balancing (DBRB) task allocation algorithm is designed for efficient task allocations in heterogeneous LEO small satellite networks. In the task allocation process, the dispatcher needs to consider the deadlines of the tasks as well as the residue energy of different resources for best energy utilization. We assume the tasks adopt a Map-Reduce framework, in which a task can consist of multiple subtasks. The DBRB algorithm is deployed on the head node of a cluster. It gathers the status from each cluster member and calculates their Node Importance Factors (NIFs) from the carried resources, residue power and compute capacity. The algorithm calculates the number of concurrent subtasks based on the deadlines, and allocates the subtasks to the nodes according to their NIF values. The simulation results show that when cluster members carry multiple resources, resource are more balanced and rare resources serve longer in DBRB than in the Earliest Deadline First algorithm. We also show that the algorithm performs well in service isolation by serving multiple tasks with different deadlines. Moreover, the average task response time with various cluster size settings is well controlled within deadlines as well. Except non-realtime tasks, small satellites may execute realtime tasks as well. The location-dependent tasks, such as image capturing, data transmission and remote sensing tasks are realtime tasks that are required to be started / finished on specific time. The resource energy balancing algorithm for realtime and non-realtime mixed workload is developed to efficiently schedule the tasks for best system performance. It calculates the residue energy for each resource type and tries to preserve resources and node availability when distributing tasks. Non-realtime tasks can be preempted by realtime tasks to provide better QoS to realtime tasks. I compared the performance of proposed algorithm with a random-priority scheduling algorithm, with only realtime tasks, non-realtime tasks and mixed tasks. It shows the resource energy reservation algorithm outperforms the latter one with both balanced and imbalanced workloads. Although the resource energy balancing task allocation algorithm for mixed workload provides preemption mechanism for realtime tasks, realtime tasks can still fail due to resource exhaustion. For LEO small satellite flies around the earth on stable orbits, the location-dependent realtime tasks can be considered as periodical tasks. Therefore, it is possible to reserve energy for these realtime tasks. The resource energy reservation algorithm preserves energy for the realtime tasks when the execution routine of periodical realtime tasks is known. In order to reserve energy for tasks starting very early in each period that the node does not have enough energy charged, an energy wrapping mechanism is also designed to calculate the residue energy from the previous period. The simulation results show that without energy reservation, realtime task failure rate can reach more than 60% when the workload is highly imbalanced. In contrast, the resource energy reservation produces zero RT task failures and leads to equal or better aggregate system throughput than the non-reservation algorithm. The proposed algorithm also preserves more energy because it avoids task preemption. (Abstract shortened by ProQuest.).
Open source clustering software.
de Hoon, M J L; Imoto, S; Nolan, J; Miyano, S
2004-06-12
We have implemented k-means clustering, hierarchical clustering and self-organizing maps in a single multipurpose open-source library of C routines, callable from other C and C++ programs. Using this library, we have created an improved version of Michael Eisen's well-known Cluster program for Windows, Mac OS X and Linux/Unix. In addition, we generated a Python and a Perl interface to the C Clustering Library, thereby combining the flexibility of a scripting language with the speed of C. The C Clustering Library and the corresponding Python C extension module Pycluster were released under the Python License, while the Perl module Algorithm::Cluster was released under the Artistic License. The GUI code Cluster 3.0 for Windows, Macintosh and Linux/Unix, as well as the corresponding command-line program, were released under the same license as the original Cluster code. The complete source code is available at http://bonsai.ims.u-tokyo.ac.jp/mdehoon/software/cluster. Alternatively, Algorithm::Cluster can be downloaded from CPAN, while Pycluster is also available as part of the Biopython distribution.
Banerjee, Arindam; Ghosh, Joydeep
2004-05-01
Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, spkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques. Index Terms-Balanced clustering, expectation maximization (EM), frequency-sensitive competitive learning (FSCL), high-dimensional clustering, kmeans, normalized data, scalable clustering, streaming data, text clustering.
A system for learning statistical motion patterns.
Hu, Weiming; Xiao, Xuejuan; Fu, Zhouyu; Xie, Dan; Tan, Tieniu; Maybank, Steve
2006-09-01
Analysis of motion patterns is an effective approach for anomaly detection and behavior prediction. Current approaches for the analysis of motion patterns depend on known scenes, where objects move in predefined ways. It is highly desirable to automatically construct object motion patterns which reflect the knowledge of the scene. In this paper, we present a system for automatically learning motion patterns for anomaly detection and behavior prediction based on a proposed algorithm for robustly tracking multiple objects. In the tracking algorithm, foreground pixels are clustered using a fast accurate fuzzy K-means algorithm. Growing and prediction of the cluster centroids of foreground pixels ensure that each cluster centroid is associated with a moving object in the scene. In the algorithm for learning motion patterns, trajectories are clustered hierarchically using spatial and temporal information and then each motion pattern is represented with a chain of Gaussian distributions. Based on the learned statistical motion patterns, statistical methods are used to detect anomalies and predict behaviors. Our system is tested using image sequences acquired, respectively, from a crowded real traffic scene and a model traffic scene. Experimental results show the robustness of the tracking algorithm, the efficiency of the algorithm for learning motion patterns, and the encouraging performance of algorithms for anomaly detection and behavior prediction.
Effects of Combined Stellar Feedback on Star Formation in Stellar Clusters
NASA Astrophysics Data System (ADS)
Wall, Joshua Edward; McMillan, Stephen; Pellegrino, Andrew; Mac Low, Mordecai; Klessen, Ralf; Portegies Zwart, Simon
2018-01-01
We present results of hybrid MHD+N-body simulations of star cluster formation and evolution including self consistent feedback from the stars in the form of radiation, winds, and supernovae from all stars more massive than 7 solar masses. The MHD is modeled with the adaptive mesh refinement code FLASH, while the N-body computations are done with a direct algorithm. Radiation is modeled using ray tracing along long characteristics in directions distributed using the HEALPIX algorithm, and causes ionization and momentum deposition, while winds and supernova conserve momentum and energy during injection. Stellar evolution is followed using power-law fits to evolution models in SeBa. We use a gravity bridge within the AMUSE framework to couple the N-body dynamics of the stars to the gas dynamics in FLASH. Feedback from the massive stars alters the structure of young clusters as gas ejection occurs. We diagnose this behavior by distinguishing between fractal distribution and central clustering using a Q parameter computed from the minimum spanning tree of each model cluster. Global effects of feedback in our simulations will also be discussed.
An Eccentricity Based Data Routing Protocol with Uniform Node Distribution in 3D WSN.
Hosen, A S M Sanwar; Cho, Gi Hwan; Ra, In-Ho
2017-09-16
Due to nonuniform node distribution, the energy consumption of nodes are imbalanced in clustering-based wireless sensor networks (WSNs). It might have more impact when nodes are deployed in a three-dimensional (3D) environment. In this regard, we propose the eccentricity based data routing (EDR) protocol in a 3D WSN with uniform node distribution. It includes network partitions called 3D subspaces/clusters of equal member nodes, an energy-efficient routing centroid (RC) nodes election and data routing algorithm. The RC nodes election conducts in a quasi-static nature until a certain period unlike the periodic cluster heads election of typical clustering-based routing. It not only reduces the energy consumption of nodes during the election phase, but also in intra-communication. At the same time, the routing algorithm selects a forwarding node in such a way that balances the energy consumption among RC nodes and reduces the number of hops towards the sink. The simulation results validate and ensure the performance supremacy of the EDR protocol compared to existing protocols in terms of various metrics such as steady state and network lifetime in particular. Meanwhile, the results show the EDR is more robust in uniform node distribution compared to nonuniform.
An Eccentricity Based Data Routing Protocol with Uniform Node Distribution in 3D WSN
Hosen, A. S. M. Sanwar; Cho, Gi Hwan; Ra, In-Ho
2017-01-01
Due to nonuniform node distribution, the energy consumption of nodes are imbalanced in clustering-based wireless sensor networks (WSNs). It might have more impact when nodes are deployed in a three-dimensional (3D) environment. In this regard, we propose the eccentricity based data routing (EDR) protocol in a 3D WSN with uniform node distribution. It includes network partitions called 3D subspaces/clusters of equal member nodes, an energy-efficient routing centroid (RC) nodes election and data routing algorithm. The RC nodes election conducts in a quasi-static nature until a certain period unlike the periodic cluster heads election of typical clustering-based routing. It not only reduces the energy consumption of nodes during the election phase, but also in intra-communication. At the same time, the routing algorithm selects a forwarding node in such a way that balances the energy consumption among RC nodes and reduces the number of hops towards the sink. The simulation results validate and ensure the performance supremacy of the EDR protocol compared to existing protocols in terms of various metrics such as steady state and network lifetime in particular. Meanwhile, the results show the EDR is more robust in uniform node distribution compared to nonuniform. PMID:28926958
NASA Astrophysics Data System (ADS)
Dayananda, Karanam Ravichandran; Straub, Jeremy
2017-05-01
This paper proposes a new hybrid algorithm for security, which incorporates both distributed and hierarchal approaches. It uses a mobile data collector (MDC) to collect information in order to save energy of sensor nodes in a wireless sensor network (WSN) as, in most networks, these sensor nodes have limited energy. Wireless sensor networks are prone to security problems because, among other things, it is possible to use a rogue sensor node to eavesdrop on or alter the information being transmitted. To prevent this, this paper introduces a security algorithm for MDC-based WSNs. A key use of this algorithm is to protect the confidentiality of the information sent by the sensor nodes. The sensor nodes are deployed in a random fashion and form group structures called clusters. Each cluster has a cluster head. The cluster head collects data from the other nodes using the time-division multiple access protocol. The sensor nodes send their data to the cluster head for transmission to the base station node for further processing. The MDC acts as an intermediate node between the cluster head and base station. The MDC, using its dynamic acyclic graph path, collects the data from the cluster head and sends it to base station. This approach is useful for applications including warfighting, intelligent building and medicine. To assess the proposed system, the paper presents a comparison of its performance with other approaches and algorithms that can be used for similar purposes.
Cluster compression algorithm: A joint clustering/data compression concept
NASA Technical Reports Server (NTRS)
Hilbert, E. E.
1977-01-01
The Cluster Compression Algorithm (CCA), which was developed to reduce costs associated with transmitting, storing, distributing, and interpreting LANDSAT multispectral image data is described. The CCA is a preprocessing algorithm that uses feature extraction and data compression to more efficiently represent the information in the image data. The format of the preprocessed data enables simply a look-up table decoding and direct use of the extracted features to reduce user computation for either image reconstruction, or computer interpretation of the image data. Basically, the CCA uses spatially local clustering to extract features from the image data to describe spectral characteristics of the data set. In addition, the features may be used to form a sequence of scalar numbers that define each picture element in terms of the cluster features. This sequence, called the feature map, is then efficiently represented by using source encoding concepts. Various forms of the CCA are defined and experimental results are presented to show trade-offs and characteristics of the various implementations. Examples are provided that demonstrate the application of the cluster compression concept to multi-spectral images from LANDSAT and other sources.
Privacy Preserving Nearest Neighbor Search
NASA Astrophysics Data System (ADS)
Shaneck, Mark; Kim, Yongdae; Kumar, Vipin
Data mining is frequently obstructed by privacy concerns. In many cases data is distributed, and bringing the data together in one place for analysis is not possible due to privacy laws (e.g. HIPAA) or policies. Privacy preserving data mining techniques have been developed to address this issue by providing mechanisms to mine the data while giving certain privacy guarantees. In this chapter we address the issue of privacy preserving nearest neighbor search, which forms the kernel of many data mining applications. To this end, we present a novel algorithm based on secure multiparty computation primitives to compute the nearest neighbors of records in horizontally distributed data. We show how this algorithm can be used in three important data mining algorithms, namely LOF outlier detection, SNN clustering, and kNN classification. We prove the security of these algorithms under the semi-honest adversarial model, and describe methods that can be used to optimize their performance. Keywords: Privacy Preserving Data Mining, Nearest Neighbor Search, Outlier Detection, Clustering, Classification, Secure Multiparty Computation
The Detection and Statistics of Giant Arcs behind CLASH Clusters
NASA Astrophysics Data System (ADS)
Xu, Bingxiao; Postman, Marc; Meneghetti, Massimo; Seitz, Stella; Zitrin, Adi; Merten, Julian; Maoz, Dani; Frye, Brenda; Umetsu, Keiichi; Zheng, Wei; Bradley, Larry; Vega, Jesus; Koekemoer, Anton
2016-02-01
We developed an algorithm to find and characterize gravitationally lensed galaxies (arcs) to perform a comparison of the observed and simulated arc abundance. Observations are from the Cluster Lensing And Supernova survey with Hubble (CLASH). Simulated CLASH images are created using the MOKA package and also clusters selected from the high-resolution, hydrodynamical simulations, MUSIC, over the same mass and redshift range as the CLASH sample. The algorithm's arc elongation accuracy, completeness, and false positive rate are determined and used to compute an estimate of the true arc abundance. We derive a lensing efficiency of 4 ± 1 arcs (with length ≥6″ and length-to-width ratio ≥7) per cluster for the X-ray-selected CLASH sample, 4 ± 1 arcs per cluster for the MOKA-simulated sample, and 3 ± 1 arcs per cluster for the MUSIC-simulated sample. The observed and simulated arc statistics are in full agreement. We measure the photometric redshifts of all detected arcs and find a median redshift zs = 1.9 with 33% of the detected arcs having zs > 3. We find that the arc abundance does not depend strongly on the source redshift distribution but is sensitive to the mass distribution of the dark matter halos (e.g., the c-M relation). Our results show that consistency between the observed and simulated distributions of lensed arc sizes and axial ratios can be achieved by using cluster-lensing simulations that are carefully matched to the selection criteria used in the observations.
Distributed Multihop Clustering Approach for Wireless Sensor Networks
NASA Astrophysics Data System (ADS)
Israr, Nauman; Awan, Irfan
Prolonging the life time of Wireless Sensor Networks (WSNs) has been the focus of current research. One of the issues that needs to be addressed along with prolonging the network life time is to ensure uniform energy consumption across the network in WSNs especially in case of random network deployment. Cluster based routing algorithms are believed to be the best choice for WSNs because they work on the principle of divide and conquer and also improve the network life time considerably compared to flat based routing schemes. In this paper we propose a new routing strategy based on two layers clustering which exploits the redundancy property of the network in order to minimise duplicate data transmission and also make the intercluster and intracluster communication multihop. The proposed algorithm makes use of the nodes in a network whose area coverage is covered by the neighbouring nodes. These nodes are marked as temporary cluster heads and later use these temporary cluster heads randomly for multihop intercluster communication. Performance studies indicate that the proposed algorithm solves effectively the problem of load balancing across the network and is more energy efficient compared to the enhanced version of widely used Leach algorithm.
Are judgments a form of data clustering? Reexamining contrast effects with the k-means algorithm.
Boillaud, Eric; Molina, Guylaine
2015-04-01
A number of theories have been proposed to explain in precise mathematical terms how statistical parameters and sequential properties of stimulus distributions affect category ratings. Various contextual factors such as the mean, the midrange, and the median of the stimuli; the stimulus range; the percentile rank of each stimulus; and the order of appearance have been assumed to influence judgmental contrast. A data clustering reinterpretation of judgmental relativity is offered wherein the influence of the initial choice of centroids on judgmental contrast involves 2 combined frequency and consistency tendencies. Accounts of the k-means algorithm are provided, showing good agreement with effects observed on multiple distribution shapes and with a variety of interaction effects relating to the number of stimuli, the number of response categories, and the method of skewing. Experiment 1 demonstrates that centroid initialization accounts for contrast effects obtained with stretched distributions. Experiment 2 demonstrates that the iterative convergence inherent to the k-means algorithm accounts for the contrast reduction observed across repeated blocks of trials. The concept of within-cluster variance minimization is discussed, as is the applicability of a backward k-means calculation method for inferring, from empirical data, the values of the centroids that would serve as a representation of the judgmental context. (c) 2015 APA, all rights reserved.
Design and implementation of streaming media server cluster based on FFMpeg.
Zhao, Hong; Zhou, Chun-long; Jin, Bao-zhao
2015-01-01
Poor performance and network congestion are commonly observed in the streaming media single server system. This paper proposes a scheme to construct a streaming media server cluster system based on FFMpeg. In this scheme, different users are distributed to different servers according to their locations and the balance among servers is maintained by the dynamic load-balancing algorithm based on active feedback. Furthermore, a service redirection algorithm is proposed to improve the transmission efficiency of streaming media data. The experiment results show that the server cluster system has significantly alleviated the network congestion and improved the performance in comparison with the single server system.
Design and Implementation of Streaming Media Server Cluster Based on FFMpeg
Zhao, Hong; Zhou, Chun-long; Jin, Bao-zhao
2015-01-01
Poor performance and network congestion are commonly observed in the streaming media single server system. This paper proposes a scheme to construct a streaming media server cluster system based on FFMpeg. In this scheme, different users are distributed to different servers according to their locations and the balance among servers is maintained by the dynamic load-balancing algorithm based on active feedback. Furthermore, a service redirection algorithm is proposed to improve the transmission efficiency of streaming media data. The experiment results show that the server cluster system has significantly alleviated the network congestion and improved the performance in comparison with the single server system. PMID:25734187
Mwangi, Benson; Soares, Jair C; Hasan, Khader M
2014-10-30
Neuroimaging machine learning studies have largely utilized supervised algorithms - meaning they require both neuroimaging scan data and corresponding target variables (e.g. healthy vs. diseased) to be successfully 'trained' for a prediction task. Noticeably, this approach may not be optimal or possible when the global structure of the data is not well known and the researcher does not have an a priori model to fit the data. We set out to investigate the utility of an unsupervised machine learning technique; t-distributed stochastic neighbour embedding (t-SNE) in identifying 'unseen' sample population patterns that may exist in high-dimensional neuroimaging data. Multimodal neuroimaging scans from 92 healthy subjects were pre-processed using atlas-based methods, integrated and input into the t-SNE algorithm. Patterns and clusters discovered by the algorithm were visualized using a 2D scatter plot and further analyzed using the K-means clustering algorithm. t-SNE was evaluated against classical principal component analysis. Remarkably, based on unlabelled multimodal scan data, t-SNE separated study subjects into two very distinct clusters which corresponded to subjects' gender labels (cluster silhouette index value=0.79). The resulting clusters were used to develop an unsupervised minimum distance clustering model which identified 93.5% of subjects' gender. Notably, from a neuropsychiatric perspective this method may allow discovery of data-driven disease phenotypes or sub-types of treatment responders. Copyright © 2014 Elsevier B.V. All rights reserved.
The distribution of early- and late-type galaxies in the Coma cluster
NASA Technical Reports Server (NTRS)
Doi, M.; Fukugita, M.; Okamura, S.; Turner, E. L.
1995-01-01
The spatial distribution and the morohology-density relation of Coma cluster galaxies are studied using a new homogeneous photmetric sample of 450 galaxies down to B = 16.0 mag with quantitative morphology classification. The sample covers a wide area (10 deg X 10 deg), extending well beyond the Coma cluster. Morphological classifications into early- (E+SO) and late-(S) type galaxies are made by an automated algorithm using simple photometric parameters, with which the misclassification rate is expected to be approximately 10% with respect to early and late types given in the Third Reference Catalogue of Bright Galaxies. The flattened distribution of Coma cluster galaxies, as noted in previous studies, is most conspicuously seen if the early-type galaxies are selected. Early-type galaxies are distributed in a thick filament extended from the NE to the WSW direction that delineates a part of large-scale structure. Spiral galaxies show a distribution with a modest density gradient toward the cluster center; at least bright spiral galaxies are present close to the center of the Coma cluster. We also examine the morphology-density relation for the Coma cluster including its surrounding regions.
Robust spike sorting of retinal ganglion cells tuned to spot stimuli.
Ghahari, Alireza; Badea, Tudor C
2016-08-01
We propose an automatic spike sorting approach for the data recorded from a microelectrode array during visual stimulation of wild type retinas with tiled spot stimuli. The approach first detects individual spikes per electrode by their signature local minima. With the mixture probability distribution of the local minima estimated afterwards, it applies a minimum-squared-error clustering algorithm to sort the spikes into different clusters. A template waveform for each cluster per electrode is defined, and a number of reliability tests are performed on it and its corresponding spikes. Finally, a divisive hierarchical clustering algorithm is used to deal with the correlated templates per cluster type across all the electrodes. According to the measures of performance of the spike sorting approach, it is robust even in the cases of recordings with low signal-to-noise ratio.
Consensus-Based Sorting of Neuronal Spike Waveforms
Fournier, Julien; Mueller, Christian M.; Shein-Idelson, Mark; Hemberger, Mike
2016-01-01
Optimizing spike-sorting algorithms is difficult because sorted clusters can rarely be checked against independently obtained “ground truth” data. In most spike-sorting algorithms in use today, the optimality of a clustering solution is assessed relative to some assumption on the distribution of the spike shapes associated with a particular single unit (e.g., Gaussianity) and by visual inspection of the clustering solution followed by manual validation. When the spatiotemporal waveforms of spikes from different cells overlap, the decision as to whether two spikes should be assigned to the same source can be quite subjective, if it is not based on reliable quantitative measures. We propose a new approach, whereby spike clusters are identified from the most consensual partition across an ensemble of clustering solutions. Using the variability of the clustering solutions across successive iterations of the same clustering algorithm (template matching based on K-means clusters), we estimate the probability of spikes being clustered together and identify groups of spikes that are not statistically distinguishable from one another. Thus, we identify spikes that are most likely to be clustered together and therefore correspond to consistent spike clusters. This method has the potential advantage that it does not rely on any model of the spike shapes. It also provides estimates of the proportion of misclassified spikes for each of the identified clusters. We tested our algorithm on several datasets for which there exists a ground truth (simultaneous intracellular data), and show that it performs close to the optimum reached by a support vector machine trained on the ground truth. We also show that the estimated rate of misclassification matches the proportion of misclassified spikes measured from the ground truth data. PMID:27536990
Consensus-Based Sorting of Neuronal Spike Waveforms.
Fournier, Julien; Mueller, Christian M; Shein-Idelson, Mark; Hemberger, Mike; Laurent, Gilles
2016-01-01
Optimizing spike-sorting algorithms is difficult because sorted clusters can rarely be checked against independently obtained "ground truth" data. In most spike-sorting algorithms in use today, the optimality of a clustering solution is assessed relative to some assumption on the distribution of the spike shapes associated with a particular single unit (e.g., Gaussianity) and by visual inspection of the clustering solution followed by manual validation. When the spatiotemporal waveforms of spikes from different cells overlap, the decision as to whether two spikes should be assigned to the same source can be quite subjective, if it is not based on reliable quantitative measures. We propose a new approach, whereby spike clusters are identified from the most consensual partition across an ensemble of clustering solutions. Using the variability of the clustering solutions across successive iterations of the same clustering algorithm (template matching based on K-means clusters), we estimate the probability of spikes being clustered together and identify groups of spikes that are not statistically distinguishable from one another. Thus, we identify spikes that are most likely to be clustered together and therefore correspond to consistent spike clusters. This method has the potential advantage that it does not rely on any model of the spike shapes. It also provides estimates of the proportion of misclassified spikes for each of the identified clusters. We tested our algorithm on several datasets for which there exists a ground truth (simultaneous intracellular data), and show that it performs close to the optimum reached by a support vector machine trained on the ground truth. We also show that the estimated rate of misclassification matches the proportion of misclassified spikes measured from the ground truth data.
NASA Astrophysics Data System (ADS)
Franke, R.
2016-11-01
In many networks discovered in biology, medicine, neuroscience and other disciplines special properties like a certain degree distribution and hierarchical cluster structure (also called communities) can be observed as general organizing principles. Detecting the cluster structure of an unknown network promises to identify functional subdivisions, hierarchy and interactions on a mesoscale. It is not trivial choosing an appropriate detection algorithm because there are multiple network, cluster and algorithmic properties to be considered. Edges can be weighted and/or directed, clusters overlap or build a hierarchy in several ways. Algorithms differ not only in runtime, memory requirements but also in allowed network and cluster properties. They are based on a specific definition of what a cluster is, too. On the one hand, a comprehensive network creation model is needed to build a large variety of benchmark networks with different reasonable structures to compare algorithms. On the other hand, if a cluster structure is already known, it is desirable to separate effects of this structure from other network properties. This can be done with null model networks that mimic an observed cluster structure to improve statistics on other network features. A third important application is the general study of properties in networks with different cluster structures, possibly evolving over time. Currently there are good benchmark and creation models available. But what is left is a precise sandbox model to build hierarchical, overlapping and directed clusters for undirected or directed, binary or weighted complex random networks on basis of a sophisticated blueprint. This gap shall be closed by the model CHIMERA (Cluster Hierarchy Interconnection Model for Evaluation, Research and Analysis) which will be introduced and described here for the first time.
Anomaly clustering in hyperspectral images
NASA Astrophysics Data System (ADS)
Doster, Timothy J.; Ross, David S.; Messinger, David W.; Basener, William F.
2009-05-01
The topological anomaly detection algorithm (TAD) differs from other anomaly detection algorithms in that it uses a topological/graph-theoretic model for the image background instead of modeling the image with a Gaussian normal distribution. In the construction of the model, TAD produces a hard threshold separating anomalous pixels from background in the image. We build on this feature of TAD by extending the algorithm so that it gives a measure of the number of anomalous objects, rather than the number of anomalous pixels, in a hyperspectral image. This is done by identifying, and integrating, clusters of anomalous pixels via a graph theoretical method combining spatial and spectral information. The method is applied to a cluttered HyMap image and combines small groups of pixels containing like materials, such as those corresponding to rooftops and cars, into individual clusters. This improves visualization and interpretation of objects.
Energy Efficient and Stable Weight Based Clustering for Mobile Ad Hoc Networks
NASA Astrophysics Data System (ADS)
Bouk, Safdar H.; Sasase, Iwao
Recently several weighted clustering algorithms have been proposed, however, to the best of our knowledge; there is none that propagates weights to other nodes without weight message for leader election, normalizes node parameters and considers neighboring node parameters to calculate node weights. In this paper, we propose an Energy Efficient and Stable Weight Based Clustering (EE-SWBC) algorithm that elects cluster heads without sending any additional weight message. It propagates node parameters to its neighbors through neighbor discovery message (HELLO Message) and stores these parameters in neighborhood list. Each node normalizes parameters and efficiently calculates its own weight and the weights of neighboring nodes from that neighborhood table using Grey Decision Method (GDM). GDM finds the ideal solution (best node parameters in neighborhood list) and calculates node weights in comparison to the ideal solution. The node(s) with maximum weight (parameters closer to the ideal solution) are elected as cluster heads. In result, EE-SWBC fairly selects potential nodes with parameters closer to ideal solution with less overhead. Different performance metrics of EE-SWBC and Distributed Weighted Clustering Algorithm (DWCA) are compared through simulations. The simulation results show that EE-SWBC maintains fewer average numbers of stable clusters with minimum overhead, less energy consumption and fewer changes in cluster structure within network compared to DWCA.
Advances in Significance Testing for Cluster Detection
NASA Astrophysics Data System (ADS)
Coleman, Deidra Andrea
Over the past two decades, much attention has been given to data driven project goals such as the Human Genome Project and the development of syndromic surveillance systems. A major component of these types of projects is analyzing the abundance of data. Detecting clusters within the data can be beneficial as it can lead to the identification of specified sequences of DNA nucleotides that are related to important biological functions or the locations of epidemics such as disease outbreaks or bioterrorism attacks. Cluster detection techniques require efficient and accurate hypothesis testing procedures. In this dissertation, we improve upon the hypothesis testing procedures for cluster detection by enhancing distributional theory and providing an alternative method for spatial cluster detection using syndromic surveillance data. In Chapter 2, we provide an efficient method to compute the exact distribution of the number and coverage of h-clumps of a collection of words. This method involves defining a Markov chain using a minimal deterministic automaton to reduce the number of states needed for computation. We allow words of the collection to contain other words of the collection making the method more general. We use our method to compute the distributions of the number and coverage of h-clumps in the Chi motif of H. influenza.. In Chapter 3, we provide an efficient algorithm to compute the exact distribution of multiple window discrete scan statistics for higher-order, multi-state Markovian sequences. This algorithm involves defining a Markov chain to efficiently keep track of probabilities needed to compute p-values of the statistic. We use our algorithm to identify cases where the available approximation does not perform well. We also use our algorithm to detect unusual clusters of made free throw shots by National Basketball Association players during the 2009-2010 regular season. In Chapter 4, we give a procedure to detect outbreaks using syndromic surveillance data while controlling the Bayesian False Discovery Rate (BFDR). The procedure entails choosing an appropriate Bayesian model that captures the spatial dependency inherent in epidemiological data and considers all days of interest, selecting a test statistic based on a chosen measure that provides the magnitude of the maximumal spatial cluster for each day, and identifying a cutoff value that controls the BFDR for rejecting the collective null hypothesis of no outbreak over a collection of days for a specified region.We use our procedure to analyze botulism-like syndrome data collected by the North Carolina Disease Event Tracking and Epidemiologic Collection Tool (NC DETECT).
Large-Scale Cooperative Task Distribution on Peer-to-Peer Networks
2012-01-01
SUBTITLE Large-scale cooperative task distribution on peer-to-peer networks 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6...of agents, and each agent attempts to form a coalition with its most profitable partner. The second algorithm builds upon the Shapley for- mula [37...ters at the second layer. These Category Layer clusters each represent a single resource, and agents join one or more clusters based on their
Clustering single cells: a review of approaches on high-and low-depth single-cell RNA-seq data.
Menon, Vilas
2017-12-11
Advances in single-cell RNA-sequencing technology have resulted in a wealth of studies aiming to identify transcriptomic cell types in various biological systems. There are multiple experimental approaches to isolate and profile single cells, which provide different levels of cellular and tissue coverage. In addition, multiple computational strategies have been proposed to identify putative cell types from single-cell data. From a data generation perspective, recent single-cell studies can be classified into two groups: those that distribute reads shallowly over large numbers of cells and those that distribute reads more deeply over a smaller cell population. Although there are advantages to both approaches in terms of cellular and tissue coverage, it is unclear whether different computational cell type identification methods are better suited to one or the other experimental paradigm. This study reviews three cell type clustering algorithms, each representing one of three broad approaches, and finds that PCA-based algorithms appear most suited to low read depth data sets, whereas gene clustering-based and biclustering algorithms perform better on high read depth data sets. In addition, highly related cell classes are better distinguished by higher-depth data, given the same total number of reads; however, simultaneous discovery of distinct and similar types is better served by lower-depth, higher cell number data. Overall, this study suggests that the depth of profiling should be determined by initial assumptions about the diversity of cells in the population, and that the selection of clustering algorithm(s) is subsequently based on the depth of profiling will allow for better identification of putative transcriptomic cell types. © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com.
THE DETECTION AND STATISTICS OF GIANT ARCS BEHIND CLASH CLUSTERS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xu, Bingxiao; Zheng, Wei; Postman, Marc
We developed an algorithm to find and characterize gravitationally lensed galaxies (arcs) to perform a comparison of the observed and simulated arc abundance. Observations are from the Cluster Lensing And Supernova survey with Hubble (CLASH). Simulated CLASH images are created using the MOKA package and also clusters selected from the high-resolution, hydrodynamical simulations, MUSIC, over the same mass and redshift range as the CLASH sample. The algorithm's arc elongation accuracy, completeness, and false positive rate are determined and used to compute an estimate of the true arc abundance. We derive a lensing efficiency of 4 ± 1 arcs (with length ≥6″ andmore » length-to-width ratio ≥7) per cluster for the X-ray-selected CLASH sample, 4 ± 1 arcs per cluster for the MOKA-simulated sample, and 3 ± 1 arcs per cluster for the MUSIC-simulated sample. The observed and simulated arc statistics are in full agreement. We measure the photometric redshifts of all detected arcs and find a median redshift z{sub s} = 1.9 with 33% of the detected arcs having z{sub s} > 3. We find that the arc abundance does not depend strongly on the source redshift distribution but is sensitive to the mass distribution of the dark matter halos (e.g., the c–M relation). Our results show that consistency between the observed and simulated distributions of lensed arc sizes and axial ratios can be achieved by using cluster-lensing simulations that are carefully matched to the selection criteria used in the observations.« less
Clustering Module in OLAP for Horticultural Crops using SpagoBI
NASA Astrophysics Data System (ADS)
Putri, D.; Sitanggang, I. S.
2017-03-01
Horticultural crops data are organized by the Ministry of Agriculture, Republic of Indonesia. The data are presented annually in a tabular form and result a large data set. This situation makes users difficult to obtain summaries of horticultural crops data. This study aims to develop a clustering module in the SOLAP system for the distribution of horticultural crops in Indonesia and to visualize the results of clustering in a map using SpagoBI. The algorithm used for clustering is K-Means. Horticultural crops data include vegetables, ornamental plants, medicinal plants, and fruits from 2000 to 2013. The clustering module displays clustering results of horticultural crops in the form of text and table on SpagoBI. This module can also visualize the distribution of horticultural crops in the form of map on the HTML page. The application is expected to be useful for users in order to easily obtain summaries of the horticultural crops distribution data and its clusters. The summaries and clusters can be beneficial for the stakeholders to determine potential areas in Indonesia for horticultural crops.
Overlapping clusters for distributed computation.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mirrokni, Vahab; Andersen, Reid; Gleich, David F.
2010-11-01
Scalable, distributed algorithms must address communication problems. We investigate overlapping clusters, or vertex partitions that intersect, for graph computations. This setup stores more of the graph than required but then affords the ease of implementation of vertex partitioned algorithms. Our hope is that this technique allows us to reduce communication in a computation on a distributed graph. The motivation above draws on recent work in communication avoiding algorithms. Mohiyuddin et al. (SC09) design a matrix-powers kernel that gives rise to an overlapping partition. Fritzsche et al. (CSC2009) develop an overlapping clustering for a Schwarz method. Both techniques extend an initialmore » partitioning with overlap. Our procedure generates overlap directly. Indeed, Schwarz methods are commonly used to capitalize on overlap. Elsewhere, overlapping communities (Ahn et al, Nature 2009; Mishra et al. WAW2007) are now a popular model of structure in social networks. These have long been studied in statistics (Cole and Wishart, CompJ 1970). We present two types of results: (i) an estimated swapping probability {rho}{infinity}; and (ii) the communication volume of a parallel PageRank solution (link-following {alpha} = 0.85) using an additive Schwarz method. The volume ratio is the amount of extra storage for the overlap (2 means we store the graph twice). Below, as the ratio increases, the swapping probability and PageRank communication volume decreases.« less
Scaling Deep Learning on GPU and Knights Landing clusters
You, Yang; Buluc, Aydin; Demmel, James
2017-09-26
The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learningmore » systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \\textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.« less
Scaling Deep Learning on GPU and Knights Landing clusters
DOE Office of Scientific and Technical Information (OSTI.GOV)
You, Yang; Buluc, Aydin; Demmel, James
The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learningmore » systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \\textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation.« less
Automatic Regionalization Algorithm for Distributed State Estimation in Power Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Dexin; Yang, Liuqing; Florita, Anthony
The deregulation of the power system and the incorporation of generation from renewable energy sources recessitates faster state estimation in the smart grid. Distributed state estimation (DSE) has become a promising and scalable solution to this urgent demand. In this paper, we investigate the regionalization algorithms for the power system, a necessary step before distributed state estimation can be performed. To the best of the authors' knowledge, this is the first investigation on automatic regionalization (AR). We propose three spectral clustering based AR algorithms. Simulations show that our proposed algorithms outperform the two investigated manual regionalization cases. With the helpmore » of AR algorithms, we also show how the number of regions impacts the accuracy and convergence speed of the DSE and conclude that the number of regions needs to be chosen carefully to improve the convergence speed of DSEs.« less
Automatic Regionalization Algorithm for Distributed State Estimation in Power Systems: Preprint
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, Dexin; Yang, Liuqing; Florita, Anthony
The deregulation of the power system and the incorporation of generation from renewable energy sources recessitates faster state estimation in the smart grid. Distributed state estimation (DSE) has become a promising and scalable solution to this urgent demand. In this paper, we investigate the regionalization algorithms for the power system, a necessary step before distributed state estimation can be performed. To the best of the authors' knowledge, this is the first investigation on automatic regionalization (AR). We propose three spectral clustering based AR algorithms. Simulations show that our proposed algorithms outperform the two investigated manual regionalization cases. With the helpmore » of AR algorithms, we also show how the number of regions impacts the accuracy and convergence speed of the DSE and conclude that the number of regions needs to be chosen carefully to improve the convergence speed of DSEs.« less
A new solution-adaptive grid generation method for transonic airfoil flow calculations
NASA Technical Reports Server (NTRS)
Nakamura, S.; Holst, T. L.
1981-01-01
The clustering algorithm is controlled by a second-order, ordinary differential equation which uses the airfoil surface density gradient as a forcing function. The solution to this differential equation produces a surface grid distribution which is automatically clustered in regions with large gradients. The interior grid points are established from this surface distribution by using an interpolation scheme which is fast and retains the desirable properties of the original grid generated from the standard elliptic equation approach.
Fusion And Inference From Multiple And Massive Disparate Distributed Dynamic Data Sets
2017-07-01
principled methodology for two-sample graph testing; designed a provably almost-surely perfect vertex clustering algorithm for block model graphs; proved...3.7 Semi-Supervised Clustering Methodology ...................................................................... 9 3.8 Robust Hypothesis Testing...dimensional Euclidean space – allows the full arsenal of statistical and machine learning methodology for multivariate Euclidean data to be deployed for
Identifying irregularly shaped crime hot-spots using a multiobjective evolutionary algorithm
NASA Astrophysics Data System (ADS)
Wu, Xiaolan; Grubesic, Tony H.
2010-12-01
Spatial cluster detection techniques are widely used in criminology, geography, epidemiology, and other fields. In particular, spatial scan statistics are popular and efficient techniques for detecting areas of elevated crime or disease events. The majority of spatial scan approaches attempt to delineate geographic zones by evaluating the significance of clusters using likelihood ratio statistics tested with the Poisson distribution. While this can be effective, many scan statistics give preference to circular clusters, diminishing their ability to identify elongated and/or irregular shaped clusters. Although adjusting the shape of the scan window can mitigate some of these problems, both the significance of irregular clusters and their spatial structure must be accounted for in a meaningful way. This paper utilizes a multiobjective evolutionary algorithm to find clusters with maximum significance while quantitatively tracking their geographic structure. Crime data for the city of Cincinnati are utilized to demonstrate the advantages of the new approach and highlight its benefits versus more traditional scan statistics.
Depth data research of GIS based on clustering analysis algorithm
NASA Astrophysics Data System (ADS)
Xiong, Yan; Xu, Wenli
2018-03-01
The data of GIS have spatial distribution. Geographic data has both spatial characteristics and attribute characteristics, and also changes with time. Therefore, the amount of data is very large. Nowadays, many industries and departments in the society are using GIS. However, without proper data analysis and mining scheme, GIS will not exert its maximum effectiveness and will waste a lot of data. In this paper, we use the geographic information demand of a national security department as the experimental object, combining the characteristics of GIS data, taking into account the characteristics of time, space, attributes and so on, and using cluster analysis algorithm. We further study the mining scheme for depth data, and get the algorithm model. This algorithm can automatically classify sample data, and then carry out exploratory analysis. The research shows that the algorithm model and the information mining scheme can quickly find hidden depth information from the surface data of GIS, thus improving the efficiency of the security department. This algorithm can also be extended to other fields.
Archambeau, Cédric; Verleysen, Michel
2007-01-01
A new variational Bayesian learning algorithm for Student-t mixture models is introduced. This algorithm leads to (i) robust density estimation, (ii) robust clustering and (iii) robust automatic model selection. Gaussian mixture models are learning machines which are based on a divide-and-conquer approach. They are commonly used for density estimation and clustering tasks, but are sensitive to outliers. The Student-t distribution has heavier tails than the Gaussian distribution and is therefore less sensitive to any departure of the empirical distribution from Gaussianity. As a consequence, the Student-t distribution is suitable for constructing robust mixture models. In this work, we formalize the Bayesian Student-t mixture model as a latent variable model in a different way from Svensén and Bishop [Svensén, M., & Bishop, C. M. (2005). Robust Bayesian mixture modelling. Neurocomputing, 64, 235-252]. The main difference resides in the fact that it is not necessary to assume a factorized approximation of the posterior distribution on the latent indicator variables and the latent scale variables in order to obtain a tractable solution. Not neglecting the correlations between these unobserved random variables leads to a Bayesian model having an increased robustness. Furthermore, it is expected that the lower bound on the log-evidence is tighter. Based on this bound, the model complexity, i.e. the number of components in the mixture, can be inferred with a higher confidence.
NASA Astrophysics Data System (ADS)
Moazami Goodarzi, Hamed; Kazemi, Mohammad Hosein
2018-05-01
Microgrid (MG) clustering is regarded as an important driver in improving the robustness of MGs. However, little research has been conducted on providing appropriate MG clustering. This article addresses this shortfall. It proposes a novel multi-objective optimization approach for finding optimal clustering of autonomous MGs by focusing on variables such as distributed generation (DG) droop parameters, the location and capacity of DG units, renewable energy sources, capacitors and powerline transmission. Power losses are minimized and voltage stability is improved while virtual cut-set lines with minimum power transmission for clustering MGs are obtained. A novel chaotic grey wolf optimizer (CGWO) algorithm is applied to solve the proposed multi-objective problem. The performance of the approach is evaluated by utilizing a 69-bus MG in several scenarios.
A new neuro-fuzzy training algorithm for identifying dynamic characteristics of smart dampers
NASA Astrophysics Data System (ADS)
Dzung Nguyen, Sy; Choi, Seung-Bok
2012-08-01
This paper proposes a new algorithm, named establishing neuro-fuzzy system (ENFS), to identify dynamic characteristics of smart dampers such as magnetorheological (MR) and electrorheological (ER) dampers. In the ENFS, data clustering is performed based on the proposed algorithm named partitioning data space (PDS). Firstly, the PDS builds data clusters in joint input-output data space with appropriate constraints. The role of these constraints is to create reasonable data distribution in clusters. The ENFS then uses these clusters to perform the following tasks. Firstly, the fuzzy sets expressing characteristics of data clusters are established. The structure of the fuzzy sets is adjusted to be suitable for features of the data set. Secondly, an appropriate structure of neuro-fuzzy (NF) expressed by an optimal number of labeled data clusters and the fuzzy-set groups is determined. After the ENFS is introduced, its effectiveness is evaluated by a prediction-error-comparative work between the proposed method and some other methods in identifying numerical data sets such as ‘daily data of stock A’, or in identifying a function. The ENFS is then applied to identify damping force characteristics of the smart dampers. In order to evaluate the effectiveness of the ENFS in identifying the damping forces of the smart dampers, the prediction errors are presented by comparing with experimental results.
A Local Scalable Distributed Expectation Maximization Algorithm for Large Peer-to-Peer Networks
NASA Technical Reports Server (NTRS)
Bhaduri, Kanishka; Srivastava, Ashok N.
2009-01-01
This paper offers a local distributed algorithm for expectation maximization in large peer-to-peer environments. The algorithm can be used for a variety of well-known data mining tasks in a distributed environment such as clustering, anomaly detection, target tracking to name a few. This technology is crucial for many emerging peer-to-peer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Centralizing all or some of the data for building global models is impractical in such peer-to-peer environments because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, and dynamic nature of the data/network. The distributed algorithm we have developed in this paper is provably-correct i.e. it converges to the same result compared to a similar centralized algorithm and can automatically adapt to changes to the data and the network. We show that the communication overhead of the algorithm is very low due to its local nature. This monitoring algorithm is then used as a feedback loop to sample data from the network and rebuild the model when it is outdated. We present thorough experimental results to verify our theoretical claims.
Improved Ant Colony Clustering Algorithm and Its Performance Study
Gao, Wei
2016-01-01
Clustering analysis is used in many disciplines and applications; it is an important tool that descriptively identifies homogeneous groups of objects based on attribute values. The ant colony clustering algorithm is a swarm-intelligent method used for clustering problems that is inspired by the behavior of ant colonies that cluster their corpses and sort their larvae. A new abstraction ant colony clustering algorithm using a data combination mechanism is proposed to improve the computational efficiency and accuracy of the ant colony clustering algorithm. The abstraction ant colony clustering algorithm is used to cluster benchmark problems, and its performance is compared with the ant colony clustering algorithm and other methods used in existing literature. Based on similar computational difficulties and complexities, the results show that the abstraction ant colony clustering algorithm produces results that are not only more accurate but also more efficiently determined than the ant colony clustering algorithm and the other methods. Thus, the abstraction ant colony clustering algorithm can be used for efficient multivariate data clustering. PMID:26839533
Categorization of hyperspectral information (HSI) based on the distribution of spectra in hyperspace
NASA Astrophysics Data System (ADS)
Resmini, Ronald G.
2003-09-01
Hyperspectral information (HSI) data are commonly categorized by a description of the dominant physical geographic background captured in the image cube. In other words, HSI categorization is commonly based on a cursory, visual assessment of whether the data are of desert, forest, urban, littoral, jungle, alpine, etc., terrains. Additionally, often the design of HSI collection experiments is based on the acquisition of data of the various backgrounds or of objects of interest within the various terrain types. These data are for assessing and quantifying algorithm performance as well as for algorithm development activities. Here, results of an investigation into the validity of the backgrounds-driven mode of characterizing the diversity of hyperspectral data are presented. HSI data are described quantitatively, in the space where most algorithms operate: n-dimensional (n-D) hyperspace, where n is the number of bands in an HSI data cube. Nineteen metrics designed to probe hyperspace are applied to 14 HYDICE HSI data cubes that represent nine different backgrounds. Each of the 14 sets (one for each HYDICE cube) of 19 metric values was analyzed for clustering. With the present set of data and metrics, there is no clear, unambiguous break-out of metrics based on the nine different geographic backgrounds. The break-outs clump seemingly unrelated data types together; e.g., littoral and urban/residential. Most metrics are normally distributed and indicate no clustering; one metric is one outlier away from normal (i.e., two clusters); and five are comprised of two distributions (i.e., two clusters). Overall, there are three different break-outs that do not correspond to conventional background categories. Implications of these preliminary results are discussed as are recommendations for future work.
An Efficient Optimization Method for Solving Unsupervised Data Classification Problems.
Shabanzadeh, Parvaneh; Yusof, Rubiyah
2015-01-01
Unsupervised data classification (or clustering) analysis is one of the most useful tools and a descriptive task in data mining that seeks to classify homogeneous groups of objects based on similarity and is used in many medical disciplines and various applications. In general, there is no single algorithm that is suitable for all types of data, conditions, and applications. Each algorithm has its own advantages, limitations, and deficiencies. Hence, research for novel and effective approaches for unsupervised data classification is still active. In this paper a heuristic algorithm, Biogeography-Based Optimization (BBO) algorithm, was adapted for data clustering problems by modifying the main operators of BBO algorithm, which is inspired from the natural biogeography distribution of different species. Similar to other population-based algorithms, BBO algorithm starts with an initial population of candidate solutions to an optimization problem and an objective function that is calculated for them. To evaluate the performance of the proposed algorithm assessment was carried on six medical and real life datasets and was compared with eight well known and recent unsupervised data classification algorithms. Numerical results demonstrate that the proposed evolutionary optimization algorithm is efficient for unsupervised data classification.
Improvement of the SEP protocol based on community structure of node degree
NASA Astrophysics Data System (ADS)
Li, Donglin; Wei, Suyuan
2017-05-01
Analyzing the Stable election protocol (SEP) in wireless sensor networks and aiming at the problem of inhomogeneous cluster-heads distribution and unreasonable cluster-heads selectivity and single hop transmission in the SEP, a SEP Protocol based on community structure of node degree (SEP-CSND) is proposed. In this algorithm, network node deployed by using grid deployment model, and the connection between nodes established by setting up the communication threshold. The community structure constructed by node degree, then cluster head is elected in the community structure. On the basis of SEP, the node's residual energy and node degree is added in cluster-heads election. The information is transmitted with mode of multiple hops between network nodes. The simulation experiments showed that compared to the classical LEACH and SEP, this algorithm balances the energy consumption of the entire network and significantly prolongs network lifetime.
NASA Technical Reports Server (NTRS)
Socolovsky, Eduardo A.; Bushnell, Dennis M. (Technical Monitor)
2002-01-01
The cosine or correlation measures of similarity used to cluster high dimensional data are interpreted as projections, and the orthogonal components are used to define a complementary dissimilarity measure to form a similarity-dissimilarity measure pair. Using a geometrical approach, a number of properties of this pair is established. This approach is also extended to general inner-product spaces of any dimension. These properties include the triangle inequality for the defined dissimilarity measure, error estimates for the triangle inequality and bounds on both measures that can be obtained with a few floating-point operations from previously computed values of the measures. The bounds and error estimates for the similarity and dissimilarity measures can be used to reduce the computational complexity of clustering algorithms and enhance their scalability, and the triangle inequality allows the design of clustering algorithms for high dimensional distributed data.
Probabilistic cosmological mass mapping from weak lensing shear
Schneider, M. D.; Ng, K. Y.; Dawson, W. A.; ...
2017-04-10
Here, we infer gravitational lensing shear and convergence fields from galaxy ellipticity catalogs under a spatial process prior for the lensing potential. We demonstrate the performance of our algorithm with simulated Gaussian-distributed cosmological lensing shear maps and a reconstruction of the mass distribution of the merging galaxy cluster Abell 781 using galaxy ellipticities measured with the Deep Lens Survey. Given interim posterior samples of lensing shear or convergence fields on the sky, we describe an algorithm to infer cosmological parameters via lens field marginalization. In the most general formulation of our algorithm we make no assumptions about weak shear ormore » Gaussian-distributed shape noise or shears. Because we require solutions and matrix determinants of a linear system of dimension that scales with the number of galaxies, we expect our algorithm to require parallel high-performance computing resources for application to ongoing wide field lensing surveys.« less
Focus-based filtering + clustering technique for power-law networks with small world phenomenon
NASA Astrophysics Data System (ADS)
Boutin, François; Thièvre, Jérôme; Hascoët, Mountaz
2006-01-01
Realistic interaction networks usually present two main properties: a power-law degree distribution and a small world behavior. Few nodes are linked to many nodes and adjacent nodes are likely to share common neighbors. Moreover, graph structure usually presents a dense core that is difficult to explore with classical filtering and clustering techniques. In this paper, we propose a new filtering technique accounting for a user-focus. This technique extracts a tree-like graph with also power-law degree distribution and small world behavior. Resulting structure is easily drawn with classical force-directed drawing algorithms. It is also quickly clustered and displayed into a multi-level silhouette tree (MuSi-Tree) from any user-focus. We built a new graph filtering + clustering + drawing API and report a case study.
A novel complex networks clustering algorithm based on the core influence of nodes.
Tong, Chao; Niu, Jianwei; Dai, Bin; Xie, Zhongyu
2014-01-01
In complex networks, cluster structure, identified by the heterogeneity of nodes, has become a common and important topological property. Network clustering methods are thus significant for the study of complex networks. Currently, many typical clustering algorithms have some weakness like inaccuracy and slow convergence. In this paper, we propose a clustering algorithm by calculating the core influence of nodes. The clustering process is a simulation of the process of cluster formation in sociology. The algorithm detects the nodes with core influence through their betweenness centrality, and builds the cluster's core structure by discriminant functions. Next, the algorithm gets the final cluster structure after clustering the rest of the nodes in the network by optimizing method. Experiments on different datasets show that the clustering accuracy of this algorithm is superior to the classical clustering algorithm (Fast-Newman algorithm). It clusters faster and plays a positive role in revealing the real cluster structure of complex networks precisely.
NASA Astrophysics Data System (ADS)
Marinoni, Christian; Davis, Marc; Newman, Jeffrey A.; Coil, Alison L.
2002-11-01
We have developed a new geometrical method for identifying and reconstructing a homogeneous and highly complete set of galaxy groups within flux-limited redshift surveys. Our method combines information from the three-dimensional Voronoi diagram and its dual, the Delaunay triangulation, to obtain group and cluster catalogs that are remarkably robust over wide ranges in redshift and degree of density enhancement. As free by-products, this Voronoi-Delaunay method (VDM) provides a nonparametric measurement of the galaxy density around each object observed and a quantitative measure of the distribution of cosmological voids in the survey volume. In this paper, we describe the VDM algorithm in detail and test its effectiveness using a family of mock catalogs that simulate the Deep Extragalactic Evolutionary Probe (DEEP2) Redshift Survey, which should present at least as much challenge to cluster reconstruction methods as any other near-future survey that is capable of resolving their velocity dispersions. Using these mock DEEP2 catalogs, we demonstrate that the VDM algorithm can be used to identify a homogeneous set of groups in a magnitude-limited sample throughout the survey redshift window 0.7
NASA Astrophysics Data System (ADS)
Niknam, Taher; Kavousifard, Abdollah; Tabatabaei, Sajad; Aghaei, Jamshid
2011-10-01
In this paper a new multiobjective modified honey bee mating optimization (MHBMO) algorithm is presented to investigate the distribution feeder reconfiguration (DFR) problem considering renewable energy sources (RESs) (photovoltaics, fuel cell and wind energy) connected to the distribution network. The objective functions of the problem to be minimized are the electrical active power losses, the voltage deviations, the total electrical energy costs and the total emissions of RESs and substations. During the optimization process, the proposed algorithm finds a set of non-dominated (Pareto) optimal solutions which are stored in an external memory called repository. Since the objective functions investigated are not the same, a fuzzy clustering algorithm is utilized to handle the size of the repository in the specified limits. Moreover, a fuzzy-based decision maker is adopted to select the 'best' compromised solution among the non-dominated optimal solutions of multiobjective optimization problem. In order to see the feasibility and effectiveness of the proposed algorithm, two standard distribution test systems are used as case studies.
Algorithm to determine the percolation largest component in interconnected networks.
Schneider, Christian M; Araújo, Nuno A M; Herrmann, Hans J
2013-04-01
Interconnected networks have been shown to be much more vulnerable to random and targeted failures than isolated ones, raising several interesting questions regarding the identification and mitigation of their risk. The paradigm to address these questions is the percolation model, where the resilience of the system is quantified by the dependence of the size of the largest cluster on the number of failures. Numerically, the major challenge is the identification of this cluster and the calculation of its size. Here, we propose an efficient algorithm to tackle this problem. We show that the algorithm scales as O(NlogN), where N is the number of nodes in the network, a significant improvement compared to O(N(2)) for a greedy algorithm, which permits studying much larger networks. Our new strategy can be applied to any network topology and distribution of interdependencies, as well as any sequence of failures.
Tools for Analyzing Computing Resource Management Strategies and Algorithms for SDR Clouds
NASA Astrophysics Data System (ADS)
Marojevic, Vuk; Gomez-Miguelez, Ismael; Gelonch, Antoni
2012-09-01
Software defined radio (SDR) clouds centralize the computing resources of base stations. The computing resource pool is shared between radio operators and dynamically loads and unloads digital signal processing chains for providing wireless communications services on demand. Each new user session request particularly requires the allocation of computing resources for executing the corresponding SDR transceivers. The huge amount of computing resources of SDR cloud data centers and the numerous session requests at certain hours of a day require an efficient computing resource management. We propose a hierarchical approach, where the data center is divided in clusters that are managed in a distributed way. This paper presents a set of computing resource management tools for analyzing computing resource management strategies and algorithms for SDR clouds. We use the tools for evaluating a different strategies and algorithms. The results show that more sophisticated algorithms can achieve higher resource occupations and that a tradeoff exists between cluster size and algorithm complexity.
Optimal colour quality of LED clusters based on memory colours.
Smet, Kevin; Ryckaert, Wouter R; Pointer, Michael R; Deconinck, Geert; Hanselaer, Peter
2011-03-28
The spectral power distributions of tri- and tetrachromatic clusters of Light-Emitting-Diodes, composed of simulated and commercially available LEDs, were optimized with a genetic algorithm to maximize the luminous efficacy of radiation and the colour quality as assessed by the memory colour quality metric developed by the authors. The trade-off of the colour quality as assessed by the memory colour metric and the luminous efficacy of radiation was investigated by calculating the Pareto optimal front using the NSGA-II genetic algorithm. Optimal peak wavelengths and spectral widths of the LEDs were derived, and over half of them were found to be close to Thornton's prime colours. The Pareto optimal fronts of real LED clusters were always found to be smaller than those of the simulated clusters. The effect of binning on designing a real LED cluster was investigated and was found to be quite large. Finally, a real LED cluster of commercially available AlGaInP, InGaN and phosphor white LEDs was optimized to obtain a higher score on memory colour quality scale than its corresponding CIE reference illuminant.
NASA Astrophysics Data System (ADS)
Wagstaff, Kiri L.
2012-03-01
On obtaining a new data set, the researcher is immediately faced with the challenge of obtaining a high-level understanding from the observations. What does a typical item look like? What are the dominant trends? How many distinct groups are included in the data set, and how is each one characterized? Which observable values are common, and which rarely occur? Which items stand out as anomalies or outliers from the rest of the data? This challenge is exacerbated by the steady growth in data set size [11] as new instruments push into new frontiers of parameter space, via improvements in temporal, spatial, and spectral resolution, or by the desire to "fuse" observations from different modalities and instruments into a larger-picture understanding of the same underlying phenomenon. Data clustering algorithms provide a variety of solutions for this task. They can generate summaries, locate outliers, compress data, identify dense or sparse regions of feature space, and build data models. It is useful to note up front that "clusters" in this context refer to groups of items within some descriptive feature space, not (necessarily) to "galaxy clusters" which are dense regions in physical space. The goal of this chapter is to survey a variety of data clustering methods, with an eye toward their applicability to astronomical data analysis. In addition to improving the individual researcher’s understanding of a given data set, clustering has led directly to scientific advances, such as the discovery of new subclasses of stars [14] and gamma-ray bursts (GRBs) [38]. All clustering algorithms seek to identify groups within a data set that reflect some observed, quantifiable structure. Clustering is traditionally an unsupervised approach to data analysis, in the sense that it operates without any direct guidance about which items should be assigned to which clusters. There has been a recent trend in the clustering literature toward supporting semisupervised or constrained clustering, in which some partial information about item assignments or other components of the resulting output are already known and must be accommodated by the solution. Some algorithms seek a partition of the data set into distinct clusters, while others build a hierarchy of nested clusters that can capture taxonomic relationships. Some produce a single optimal solution, while others construct a probabilistic model of cluster membership. More formally, clustering algorithms operate on a data set X composed of items represented by one or more features (dimensions). These could include physical location, such as right ascension and declination, as well as other properties such as brightness, color, temporal change, size, texture, and so on. Let D be the number of dimensions used to represent each item, xi ∈ RD. The clustering goal is to produce an organization P of the items in X that optimizes an objective function f : P -> R, which quantifies the quality of solution P. Often f is defined so as to maximize similarity within a cluster and minimize similarity between clusters. To that end, many algorithms make use of a measure d : X x X -> R of the distance between two items. A partitioning algorithm produces a set of clusters P = {c1, . . . , ck} such that the clusters are nonoverlapping (c_i intersected with c_j = empty set, i != j) subsets of the data set (Union_i c_i=X). Hierarchical algorithms produce a series of partitions P = {p1, . . . , pn }. For a complete hierarchy, the number of partitions n’= n, the number of items in the data set; the top partition is a single cluster containing all items, and the bottom partition contains n clusters, each containing a single item. For model-based clustering, each cluster c_j is represented by a model m_j , such as the cluster center or a Gaussian distribution. The wide array of available clustering algorithms may seem bewildering, and covering all of them is beyond the scope of this chapter. Choosing among them for a particular application involves considerations of the kind of data being analyzed, algorithm runtime efficiency, and how much prior knowledge is available about the problem domain, which can dictate the nature of clusters sought. Fundamentally, the clustering method and its representations of clusters carries with it a definition of what a cluster is, and it is important that this be aligned with the analysis goals for the problem at hand. In this chapter, I emphasize this point by identifying for each algorithm the cluster representation as a model, m_j , even for algorithms that are not typically thought of as creating a “model.” This chapter surveys a basic collection of clustering methods useful to any practitioner who is interested in applying clustering to a new data set. The algorithms include k-means (Section 25.2), EM (Section 25.3), agglomerative (Section 25.4), and spectral (Section 25.5) clustering, with side mentions of variants such as kernel k-means and divisive clustering. The chapter also discusses each algorithm’s strengths and limitations and provides pointers to additional in-depth reading for each subject. Section 25.6 discusses methods for incorporating domain knowledge into the clustering process. This chapter concludes with a brief survey of interesting applications of clustering methods to astronomy data (Section 25.7). The chapter begins with k-means because it is both generally accessible and so widely used that understanding it can be considered a necessary prerequisite for further work in the field. EM can be viewed as a more sophisticated version of k-means that uses a generative model for each cluster and probabilistic item assignments. Agglomerative clustering is the most basic form of hierarchical clustering and provides a basis for further exploration of algorithms in that vein. Spectral clustering permits a departure from feature-vector-based clustering and can operate on data sets instead represented as affinity, or similarity matrices—cases in which only pairwise information is known. The list of algorithms covered in this chapter is representative of those most commonly in use, but it is by no means comprehensive. There is an extensive collection of existing books on clustering that provide additional background and depth. Three early books that remain useful today are Anderberg’s Cluster Analysis for Applications [3], Hartigan’s Clustering Algorithms [25], and Gordon’s Classification [22]. The latter covers basics on similarity measures, partitioning and hierarchical algorithms, fuzzy clustering, overlapping clustering, conceptual clustering, validations methods, and visualization or data reduction techniques such as principal components analysis (PCA),multidimensional scaling, and self-organizing maps. More recently, Jain et al. provided a useful and informative survey [27] of a variety of different clustering algorithms, including those mentioned here as well as fuzzy, graph-theoretic, and evolutionary clustering. Everitt’s Cluster Analysis [19] provides a modern overview of algorithms, similarity measures, and evaluation methods.
A Genetic Algorithm That Exchanges Neighboring Centers for Fuzzy c-Means Clustering
ERIC Educational Resources Information Center
Chahine, Firas Safwan
2012-01-01
Clustering algorithms are widely used in pattern recognition and data mining applications. Due to their computational efficiency, partitional clustering algorithms are better suited for applications with large datasets than hierarchical clustering algorithms. K-means is among the most popular partitional clustering algorithm, but has a major…
NASA Astrophysics Data System (ADS)
Custódio, Susana; Lima, Vânia; Vales, Dina; Cesca, Simone; Carrilho, Fernando
2016-04-01
The matching between linear trends of hypocentres and fault planes indicated by focal mechanisms (FMs) is frequently used to infer the location and geometry of active faults. This practice works well in regions of fast lithospheric deformation, where earthquake patterns are clear and major structures accommodate the bulk of deformation, but typically fails in regions of slow and distributed deformation. We present a new joint FM and hypocentre cluster algorithm that is able to detect systematically the consistency between hypocentre lineations and FMs, even in regions of distributed deformation. We apply the method to the Azores-western Mediterranean region, with particular emphasis on western Iberia. The analysis relies on a compilation of hypocentres and FMs taken from regional and global earthquake catalogues, academic theses and technical reports, complemented by new FMs for western Iberia. The joint clustering algorithm images both well-known and new seismo-tectonic features. The Azores triple junction is characterised by FMs with vertical pressure (P) axes, in good agreement with the divergent setting, and the Iberian domain is characterised by NW-SE oriented P axes, indicating a response of the lithosphere to the ongoing oblique convergence between Nubia and Eurasia. Several earthquakes remain unclustered in the western Mediterranean domain, which may indicate a response to local stresses. The major regions of consistent faulting that we identify are the mid-Atlantic ridge, the Terceira rift, the Trans-Alboran shear zone and the north coast of Algeria. In addition, other smaller earthquake clusters present a good match between epicentre lineations and FM fault planes. These clusters may signal single active faults or wide zones of distributed but consistent faulting. Mainland Portugal is dominated by strike-slip earthquakes with fault planes coincident with the predominant NNE-SSW and WNW-ESE oriented earthquake lineations. Clusters offshore SW Iberia are predominantly strike-slip or reverse, confirming previous suggestions of slip partitioning.
NASA Astrophysics Data System (ADS)
Herring, Anna L.; Middleton, Jill; Walsh, Rick; Kingston, Andrew; Sheppard, Adrian
2017-09-01
We investigate capillary pressure-saturation (PC-S) relationships for drainage-imbibition experiments conducted with air (nonwetting phase) and brine (wetting phase) in Bentheimer sandstone cores. Three different flow rate conditions, ranging over three orders of magnitude, are investigated. X-ray micro-computed tomographic imaging is used to characterize the distribution and amount of fluids and their interfacial characteristics. Capillary pressure is measured via (1) bulk-phase pressure transducer measurements, and (2) image-based curvature measurements, calculated using a novel 3D curvature algorithm. We distinguish between connected (percolating) and disconnected air clusters: curvatures measured on the connected phase interfaces are used to validate the curvature algorithm and provide an indication of the equilibrium condition of the data; curvature and volume distributions of disconnected clusters provide insight to the snap-off processes occurring during drainage and imbibition under different flow rate conditions.
SOTXTSTREAM: Density-based self-organizing clustering of text streams.
Bryant, Avory C; Cios, Krzysztof J
2017-01-01
A streaming data clustering algorithm is presented building upon the density-based self-organizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets.
Distributed shared memory for roaming large volumes.
Castanié, Laurent; Mion, Christophe; Cavin, Xavier; Lévy, Bruno
2006-01-01
We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While the size of the probe is limited by the total amount of texture memory on the cluster, the size of the total data set has no theoretical limit. The cluster is used as a distributed graphics processing unit that both aggregates graphics power and graphics memory. A hardware-accelerated volume renderer runs in parallel on the cluster nodes and the final image compositing is implemented using a pipelined sort-last rendering algorithm. Meanwhile, volume bricking and volume paging allow efficient data caching. On each rendering node, a distributed hierarchical cache system implements a global software-based distributed shared memory on the cluster. In case of a cache miss, this system first checks page residency on the other cluster nodes instead of directly accessing local disks. Using two Gigabit Ethernet network interfaces per node, we accelerate data fetching by a factor of 4 compared to directly accessing local disks. The system also implements asynchronous disk access and texture loading, which makes it possible to overlap data loading, volume slicing and rendering for optimal volume roaming.
Gong, Chunye; Bao, Weimin; Tang, Guojian; Jiang, Yuewen; Liu, Jie
2014-01-01
It is very time consuming to solve fractional differential equations. The computational complexity of two-dimensional fractional differential equation (2D-TFDE) with iterative implicit finite difference method is O(M(x)M(y)N(2)). In this paper, we present a parallel algorithm for 2D-TFDE and give an in-depth discussion about this algorithm. A task distribution model and data layout with virtual boundary are designed for this parallel algorithm. The experimental results show that the parallel algorithm compares well with the exact solution. The parallel algorithm on single Intel Xeon X5540 CPU runs 3.16-4.17 times faster than the serial algorithm on single CPU core. The parallel efficiency of 81 processes is up to 88.24% compared with 9 processes on a distributed memory cluster system. We do think that the parallel computing technology will become a very basic method for the computational intensive fractional applications in the near future.
NASA Astrophysics Data System (ADS)
Lin, Yen-Ting; Hsieh, Bau-Ching; Lin, Sheng-Chieh; Oguri, Masamune; Chen, Kai-Feng; Tanaka, Masayuki; Chiu, I.-non; Huang, Song; Kodama, Tadayuki; Leauthaud, Alexie; More, Surhud; Nishizawa, Atsushi J.; Bundy, Kevin; Lin, Lihwai; Miyazaki, Satoshi; HSC Collaboration
2018-01-01
The unprecedented depth and area surveyed by the Subaru Strategic Program with the Hyper Suprime-Cam (HSC-SSP) have enabled us to construct and publish the largest distant cluster sample out to z~1 to date. In this exploratory study of cluster galaxy evolution from z=1 to z=0.3, we investigate the stellar mass assembly history of brightest cluster galaxies (BCGs), and evolution of stellar mass and luminosity distributions, stellar mass surface density profile, as well as the population of radio galaxies. Our analysis is the first high redshift application of the top N richest cluster selection, which is shown to allow us to trace the cluster galaxy evolution faithfully. Our stellar mass is derived from a machine-learning algorithm, which we show to be unbiased and accurate with respect to the COSMOS data. We find very mild stellar mass growth in BCGs, and no evidence for evolution in both the total stellar mass-cluster mass correlation and the shape of the stellar mass surface density profile. The clusters are found to contain more red galaxies compared to the expectations from the field, even after the differences in density between the two environments have been taken into account. We also present the first measurement of the radio luminosity distribution in clusters out to z~1.
NASA Astrophysics Data System (ADS)
Wang, Audrey; Price, David T.
2007-03-01
A simple integrated algorithm was developed to relate global climatology to distributions of tree plant functional types (PFT). Multivariate cluster analysis was performed to analyze the statistical homogeneity of the climate space occupied by individual tree PFTs. Forested regions identified from the satellite-based GLC2000 classification were separated into tropical, temperate, and boreal sub-PFTs for use in the Canadian Terrestrial Ecosystem Model (CTEM). Global data sets of monthly minimum temperature, growing degree days, an index of climatic moisture, and estimated PFT cover fractions were then used as variables in the cluster analysis. The statistical results for individual PFT clusters were found consistent with other global-scale classifications of dominant vegetation. As an improvement of the quantification of the climatic limitations on PFT distributions, the results also demonstrated overlapping of PFT cluster boundaries that reflected vegetation transitions, for example, between tropical and temperate biomes. The resulting global database should provide a better basis for simulating the interaction of climate change and terrestrial ecosystem dynamics using global vegetation models.
Atom Probe Tomography Analysis of the Distribution of Rhenium in Nickel Alloys
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mottura, A.; Warnken, N; Miller, Michael K
2010-01-01
Atom probe tomography (APT) is used to characterise the distributions of rhenium in a binary Ni-Re alloy and the nickel-based single-crystal CMSX-4 superalloy. A purpose-built algorithm is developed to quantify the size distribution of solute clusters, and applied to the APT datasets to critique the hypothesis that rhenium is prone to the formation of clusters in these systems. No evidence is found to indicate that rhenium forms solute clusters above the level expected from random fluctuations. In CMSX-4, enrichment of Re is detected in the matrix phase close to the matrix/precipitate ({gamma}/{gamma}{prime}) phase boundaries. Phase field modelling indicates that thismore » is due to the migration of the {gamma}/{gamma}{prime} interface during cooling from the temperature of operation. Thus, neither clustering of rhenium nor interface enrichments can be the cause of the enhancement in high temperature mechanical properties conferred by rhenium alloying.« less
Data Analytics for Smart Parking Applications.
Piovesan, Nicola; Turi, Leo; Toigo, Enrico; Martinez, Borja; Rossi, Michele
2016-09-23
We consider real-life smart parking systems where parking lot occupancy data are collected from field sensor devices and sent to backend servers for further processing and usage for applications. Our objective is to make these data useful to end users, such as parking managers, and, ultimately, to citizens. To this end, we concoct and validate an automated classification algorithm having two objectives: (1) outlier detection: to detect sensors with anomalous behavioral patterns, i.e., outliers; and (2) clustering: to group the parking sensors exhibiting similar patterns into distinct clusters. We first analyze the statistics of real parking data, obtaining suitable simulation models for parking traces. We then consider a simple classification algorithm based on the empirical complementary distribution function of occupancy times and show its limitations. Hence, we design a more sophisticated algorithm exploiting unsupervised learning techniques (self-organizing maps). These are tuned following a supervised approach using our trace generator and are compared against other clustering schemes, namely expectation maximization, k-means clustering and DBSCAN, considering six months of data from a real sensor deployment. Our approach is found to be superior in terms of classification accuracy, while also being capable of identifying all of the outliers in the dataset.
Change detection in synthetic aperture radar images based on image fusion and fuzzy clustering.
Gong, Maoguo; Zhou, Zhiqiang; Ma, Jingjing
2012-04-01
This paper presents an unsupervised distribution-free change detection approach for synthetic aperture radar (SAR) images based on an image fusion strategy and a novel fuzzy clustering algorithm. The image fusion technique is introduced to generate a difference image by using complementary information from a mean-ratio image and a log-ratio image. In order to restrain the background information and enhance the information of changed regions in the fused difference image, wavelet fusion rules based on an average operator and minimum local area energy are chosen to fuse the wavelet coefficients for a low-frequency band and a high-frequency band, respectively. A reformulated fuzzy local-information C-means clustering algorithm is proposed for classifying changed and unchanged regions in the fused difference image. It incorporates the information about spatial context in a novel fuzzy way for the purpose of enhancing the changed information and of reducing the effect of speckle noise. Experiments on real SAR images show that the image fusion strategy integrates the advantages of the log-ratio operator and the mean-ratio operator and gains a better performance. The change detection results obtained by the improved fuzzy clustering algorithm exhibited lower error than its preexistences.
Data Analytics for Smart Parking Applications
Piovesan, Nicola; Turi, Leo; Toigo, Enrico; Martinez, Borja; Rossi, Michele
2016-01-01
We consider real-life smart parking systems where parking lot occupancy data are collected from field sensor devices and sent to backend servers for further processing and usage for applications. Our objective is to make these data useful to end users, such as parking managers, and, ultimately, to citizens. To this end, we concoct and validate an automated classification algorithm having two objectives: (1) outlier detection: to detect sensors with anomalous behavioral patterns, i.e., outliers; and (2) clustering: to group the parking sensors exhibiting similar patterns into distinct clusters. We first analyze the statistics of real parking data, obtaining suitable simulation models for parking traces. We then consider a simple classification algorithm based on the empirical complementary distribution function of occupancy times and show its limitations. Hence, we design a more sophisticated algorithm exploiting unsupervised learning techniques (self-organizing maps). These are tuned following a supervised approach using our trace generator and are compared against other clustering schemes, namely expectation maximization, k-means clustering and DBSCAN, considering six months of data from a real sensor deployment. Our approach is found to be superior in terms of classification accuracy, while also being capable of identifying all of the outliers in the dataset. PMID:27669259
Exploring cluster Monte Carlo updates with Boltzmann machines
NASA Astrophysics Data System (ADS)
Wang, Lei
2017-11-01
Boltzmann machines are physics informed generative models with broad applications in machine learning. They model the probability distribution of an input data set with latent variables and generate new samples accordingly. Applying the Boltzmann machines back to physics, they are ideal recommender systems to accelerate the Monte Carlo simulation of physical systems due to their flexibility and effectiveness. More intriguingly, we show that the generative sampling of the Boltzmann machines can even give different cluster Monte Carlo algorithms. The latent representation of the Boltzmann machines can be designed to mediate complex interactions and identify clusters of the physical system. We demonstrate these findings with concrete examples of the classical Ising model with and without four-spin plaquette interactions. In the future, automatic searches in the algorithm space parametrized by Boltzmann machines may discover more innovative Monte Carlo updates.
The global Minmax k-means algorithm.
Wang, Xiaoyan; Bai, Yanping
2016-01-01
The global k -means algorithm is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure from suitable initial positions, and employs k -means to minimize the sum of the intra-cluster variances. However the global k -means algorithm sometimes results singleton clusters and the initial positions sometimes are bad, after a bad initialization, poor local optimal can be easily obtained by k -means algorithm. In this paper, we modified the global k -means algorithm to eliminate the singleton clusters at first, and then we apply MinMax k -means clustering error method to global k -means algorithm to overcome the effect of bad initialization, proposed the global Minmax k -means algorithm. The proposed clustering method is tested on some popular data sets and compared to the k -means algorithm, the global k -means algorithm and the MinMax k -means algorithm. The experiment results show our proposed algorithm outperforms other algorithms mentioned in the paper.
Noise-enhanced clustering and competitive learning algorithms.
Osoba, Osonde; Kosko, Bart
2013-01-01
Noise can provably speed up convergence in many centroid-based clustering algorithms. This includes the popular k-means clustering algorithm. The clustering noise benefit follows from the general noise benefit for the expectation-maximization algorithm because many clustering algorithms are special cases of the expectation-maximization algorithm. Simulations show that noise also speeds up convergence in stochastic unsupervised competitive learning, supervised competitive learning, and differential competitive learning. Copyright © 2012 Elsevier Ltd. All rights reserved.
Parallel implementation of D-Phylo algorithm for maximum likelihood clusters.
Malik, Shamita; Sharma, Dolly; Khatri, Sunil Kumar
2017-03-01
This study explains a newly developed parallel algorithm for phylogenetic analysis of DNA sequences. The newly designed D-Phylo is a more advanced algorithm for phylogenetic analysis using maximum likelihood approach. The D-Phylo while misusing the seeking capacity of k -means keeps away from its real constraint of getting stuck at privately conserved motifs. The authors have tested the behaviour of D-Phylo on Amazon Linux Amazon Machine Image(Hardware Virtual Machine)i2.4xlarge, six central processing unit, 122 GiB memory, 8 × 800 Solid-state drive Elastic Block Store volume, high network performance up to 15 processors for several real-life datasets. Distributing the clusters evenly on all the processors provides us the capacity to accomplish a near direct speed if there should arise an occurrence of huge number of processors.
Dynamics of fragment formation in neutron-rich matter
NASA Astrophysics Data System (ADS)
Alcain, P. N.; Dorso, C. O.
2018-01-01
Background: Neutron stars are astronomical systems with nucleons subjected to extreme conditions. Due to the longer range Coulomb repulsion between protons, the system has structural inhomogeneities. Several interactions tailored to reproduce nuclear matter plus a screened Coulomb term reproduce these inhomogeneities known as nuclear pasta. These structural inhomogeneities, located in the crusts of neutron stars, can also arise in expanding systems depending on the thermodynamic conditions (temperature, proton fraction, etc.) and the expansion velocity. Purpose: We aim to find the dynamics of the fragment formation for expanding systems simulated according to the little big bang model. This expansion resembles the evolution of merging neutron stars. Method: We study the dynamics of the nucleons with semiclassical molecular dynamics models. Starting with an equilibrium configuration, we expand the system homogeneously until we arrive at an asymptotic configuration (i.e., very low final densities). We study, with four different cluster recognition algorithms, the fragment distribution throughout this expansion and the dynamics of the cluster formation. Results: Studying the topology of the equilibrium states, before the expansion, we reproduced the known pasta phases plus a novel phase we called pregnocchi, consisting of proton aggregates embedded in a neutron sea. We have identified different fragmentation regimes, depending on the initial temperature and fragment velocity. In particular, for the already mentioned pregnocchi, a neutron cloud surrounds the clusters during the early stages of the expansion, resulting in systems that give rise to configurations compatible with the emergence of the r process. Conclusions: We showed that a proper identification of the cluster distribution is highly dependent on the cluster recognition algorithm chosen, and found that the early cluster recognition algorithm (ECRA) was the most stable one. This approach allowed us to identify the dynamics of the fragment formation. These calculations pave the way to a comparison between Earth experiments and neutron star studies.
Adaptive Scaling of Cluster Boundaries for Large-Scale Social Media Data Clustering.
Meng, Lei; Tan, Ah-Hwee; Wunsch, Donald C
2016-12-01
The large scale and complex nature of social media data raises the need to scale clustering techniques to big data and make them capable of automatically identifying data clusters with few empirical settings. In this paper, we present our investigation and three algorithms based on the fuzzy adaptive resonance theory (Fuzzy ART) that have linear computational complexity, use a single parameter, i.e., the vigilance parameter to identify data clusters, and are robust to modest parameter settings. The contribution of this paper lies in two aspects. First, we theoretically demonstrate how complement coding, commonly known as a normalization method, changes the clustering mechanism of Fuzzy ART, and discover the vigilance region (VR) that essentially determines how a cluster in the Fuzzy ART system recognizes similar patterns in the feature space. The VR gives an intrinsic interpretation of the clustering mechanism and limitations of Fuzzy ART. Second, we introduce the idea of allowing different clusters in the Fuzzy ART system to have different vigilance levels in order to meet the diverse nature of the pattern distribution of social media data. To this end, we propose three vigilance adaptation methods, namely, the activation maximization (AM) rule, the confliction minimization (CM) rule, and the hybrid integration (HI) rule. With an initial vigilance value, the resulting clustering algorithms, namely, the AM-ART, CM-ART, and HI-ART, can automatically adapt the vigilance values of all clusters during the learning epochs in order to produce better cluster boundaries. Experiments on four social media data sets show that AM-ART, CM-ART, and HI-ART are more robust than Fuzzy ART to the initial vigilance value, and they usually achieve better or comparable performance and much faster speed than the state-of-the-art clustering algorithms that also do not require a predefined number of clusters.
MRPack: Multi-Algorithm Execution Using Compute-Intensive Approach in MapReduce
2015-01-01
Large quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement. PMID:26305223
MRPack: Multi-Algorithm Execution Using Compute-Intensive Approach in MapReduce.
Idris, Muhammad; Hussain, Shujaat; Siddiqi, Muhammad Hameed; Hassan, Waseem; Syed Muhammad Bilal, Hafiz; Lee, Sungyoung
2015-01-01
Large quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement.
Vascular system modeling in parallel environment - distributed and shared memory approaches
Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne
2011-01-01
The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
NASA Astrophysics Data System (ADS)
Davies, Michael; Ganapathysubramanian, Baskar; Balasubramanian, Ganesh
2017-03-01
We present results from a computational framework integrating genetic algorithm and molecular dynamics simulations to systematically design isotope engineered graphene structures for reduced thermal conductivity. In addition to the effect of mass disorder, our results reveal the importance of atomic distribution on thermal conductivity for the same isotopic concentration. Distinct groups of isotope-substituted graphene sheets are identified based on the atomic composition and distribution. Our results show that in structures with equiatomic compositions, the enhanced scattering by lattice vibrations results in lower thermal conductivities due to the absence of isotopic clusters.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Davis, C.; et al.
We present the calibration of the Dark Energy Survey Year 1 (DES Y1) weak lensing source galaxy redshift distributions from clustering measurements. By cross-correlating the positions of source galaxies with luminous red galaxies selected by the redMaGiC algorithm we measure the redshift distributions of the source galaxies as placed into different tomographic bins. These measurements constrain any such shifts to an accuracy ofmore » $$\\sim0.02$$ and can be computed even when the clustering measurements do not span the full redshift range. The highest-redshift source bin is not constrained by the clustering measurements because of the minimal redshift overlap with the redMaGiC galaxies. We compare our constraints with those obtained from $$\\texttt{COSMOS}$$ 30-band photometry and find that our two very different methods produce consistent constraints.« less
Distributed computing for membrane-based modeling of action potential propagation.
Porras, D; Rogers, J M; Smith, W M; Pollard, A E
2000-08-01
Action potential propagation simulations with physiologic membrane currents and macroscopic tissue dimensions are computationally expensive. We, therefore, analyzed distributed computing schemes to reduce execution time in workstation clusters by parallelizing solutions with message passing. Four schemes were considered in two-dimensional monodomain simulations with the Beeler-Reuter membrane equations. Parallel speedups measured with each scheme were compared to theoretical speedups, recognizing the relationship between speedup and code portions that executed serially. A data decomposition scheme based on total ionic current provided the best performance. Analysis of communication latencies in that scheme led to a load-balancing algorithm in which measured speedups at 89 +/- 2% and 75 +/- 8% of theoretical speedups were achieved in homogeneous and heterogeneous clusters of workstations. Speedups in this scheme with the Luo-Rudy dynamic membrane equations exceeded 3.0 with eight distributed workstations. Cluster speedups were comparable to those measured during parallel execution on a shared memory machine.
Clustering PPI data by combining FA and SHC method.
Lei, Xiujuan; Ying, Chao; Wu, Fang-Xiang; Xu, Jin
2015-01-01
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value.
Clustering PPI data by combining FA and SHC method
2015-01-01
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value. PMID:25707632
NASA Astrophysics Data System (ADS)
Alfarizy, A. D.; Indahwati; Sartono, B.
2017-03-01
Indonesia is the largest Hollywood movie industry target market in Southeast Asia in 2015. Hollywood movies distributed in Indonesia targeted people in all range of ages including children. Low awareness of guiding children while watching movies make them could watch any rated films even the unsuitable ones for their ages. Even after being translated into Bahasa and passed the censorship phase, words that uncomfortable for children to watch still exist. The purpose of this research is to cluster box office Hollywood movies based on Indonesian subtitle, revenue, IMDb user rating and genres as one of the reference for adults to choose right movies for their children to watch. Text mining is used to extract words from the subtitles and count the frequency for three group of words (bad words, sexual words and terror words), while Partition Around Medoids (PAM) Algorithm with Gower similarity coefficient as proximity matrix is used as clustering method. We clustered 624 movies from 2006 until first half of 2016 from IMDb. Cluster with highest silhouette coefficient value (0.36) is the one with 5 clusters. Animation, Adventure and Comedy movies with high revenue like in cluster 5 is recommended for children to watch, while Comedy movies with high revenue like in cluster 4 should be avoided to watch.
Information Clustering Based on Fuzzy Multisets.
ERIC Educational Resources Information Center
Miyamoto, Sadaaki
2003-01-01
Proposes a fuzzy multiset model for information clustering with application to information retrieval on the World Wide Web. Highlights include search engines; term clustering; document clustering; algorithms for calculating cluster centers; theoretical properties concerning clustering algorithms; and examples to show how the algorithms work.…
NASA Astrophysics Data System (ADS)
Hortos, William S.
2008-04-01
Proposed distributed wavelet-based algorithms are a means to compress sensor data received at the nodes forming a wireless sensor network (WSN) by exchanging information between neighboring sensor nodes. Local collaboration among nodes compacts the measurements, yielding a reduced fused set with equivalent information at far fewer nodes. Nodes may be equipped with multiple sensor types, each capable of sensing distinct phenomena: thermal, humidity, chemical, voltage, or image signals with low or no frequency content as well as audio, seismic or video signals within defined frequency ranges. Compression of the multi-source data through wavelet-based methods, distributed at active nodes, reduces downstream processing and storage requirements along the paths to sink nodes; it also enables noise suppression and more energy-efficient query routing within the WSN. Targets are first detected by the multiple sensors; then wavelet compression and data fusion are applied to the target returns, followed by feature extraction from the reduced data; feature data are input to target recognition/classification routines; targets are tracked during their sojourns through the area monitored by the WSN. Algorithms to perform these tasks are implemented in a distributed manner, based on a partition of the WSN into clusters of nodes. In this work, a scheme of collaborative processing is applied for hierarchical data aggregation and decorrelation, based on the sensor data itself and any redundant information, enabled by a distributed, in-cluster wavelet transform with lifting that allows multiple levels of resolution. The wavelet-based compression algorithm significantly decreases RF bandwidth and other resource use in target processing tasks. Following wavelet compression, features are extracted. The objective of feature extraction is to maximize the probabilities of correct target classification based on multi-source sensor measurements, while minimizing the resource expenditures at participating nodes. Therefore, the feature-extraction method based on the Haar DWT is presented that employs a maximum-entropy measure to determine significant wavelet coefficients. Features are formed by calculating the energy of coefficients grouped around the competing clusters. A DWT-based feature extraction algorithm used for vehicle classification in WSNs can be enhanced by an added rule for selecting the optimal number of resolution levels to improve the correct classification rate and reduce energy consumption expended in local algorithm computations. Published field trial data for vehicular ground targets, measured with multiple sensor types, are used to evaluate the wavelet-assisted algorithms. Extracted features are used in established target recognition routines, e.g., the Bayesian minimum-error-rate classifier, to compare the effects on the classification performance of the wavelet compression. Simulations of feature sets and recognition routines at different resolution levels in target scenarios indicate the impact on classification rates, while formulas are provided to estimate reduction in resource use due to distributed compression.
An improved clustering algorithm based on reverse learning in intelligent transportation
NASA Astrophysics Data System (ADS)
Qiu, Guoqing; Kou, Qianqian; Niu, Ting
2017-05-01
With the development of artificial intelligence and data mining technology, big data has gradually entered people's field of vision. In the process of dealing with large data, clustering is an important processing method. By introducing the reverse learning method in the clustering process of PAM clustering algorithm, to further improve the limitations of one-time clustering in unsupervised clustering learning, and increase the diversity of clustering clusters, so as to improve the quality of clustering. The algorithm analysis and experimental results show that the algorithm is feasible.
A roadmap of clustering algorithms: finding a match for a biomedical application.
Andreopoulos, Bill; An, Aijun; Wang, Xiaogang; Schroeder, Michael
2009-05-01
Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.
Visualizing Time-Varying Distribution Data in EOS Application
NASA Technical Reports Server (NTRS)
Shen, Han-Wei
2004-01-01
In this research, we have developed several novel visualization methods for spatial probability density function data. Our focus has been on 2D spatial datasets, where each pixel is a random variable, and has multiple samples which are the results of experiments on that random variable. We developed novel clustering algorithms as a means to reduce the information contained in these datasets; and investigated different ways of interpreting and clustering the data.
Efficient clustering aggregation based on data fragments.
Wu, Ou; Hu, Weiming; Maybank, Stephen J; Zhu, Mingliang; Li, Bing
2012-06-01
Clustering aggregation, known as clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a single better clustering. Existing clustering aggregation algorithms are applied directly to data points, in what is referred to as the point-based approach. The algorithms are inefficient if the number of data points is large. We define an efficient approach for clustering aggregation based on data fragments. In this fragment-based approach, a data fragment is any subset of the data that is not split by any of the clustering results. To establish the theoretical bases of the proposed approach, we prove that clustering aggregation can be performed directly on data fragments under two widely used goodness measures for clustering aggregation taken from the literature. Three new clustering aggregation algorithms are described. The experimental results obtained using several public data sets show that the new algorithms have lower computational complexity than three well-known existing point-based clustering aggregation algorithms (Agglomerative, Furthest, and LocalSearch); nevertheless, the new algorithms do not sacrifice the accuracy.
Lagrangian analysis by clustering. An example in the Nordic Seas.
NASA Astrophysics Data System (ADS)
Koszalka, Inga; Lacasce, Joseph H.
2010-05-01
We propose a new method for obtaining average velocities and eddy diffusivities from Lagrangian data. Rather than grouping the drifter-derived velocities in uniform geographical bins, as is commonly done, we group a specified number of nearest-neighbor velocities. This is done via a clustering algorithm operating on the instantaneous positions of the drifters. Thus it is the data distribution itself which determines the positions of the averages and the areal extent of the clusters. A major advantage is that because the number of members is essentially the same for all clusters, the statistical accuracy is more uniform than with geographical bins. We illustrate the technique using synthetic data from a stochastic model, employing a realistic mean flow. The latter is an accurate representation of the surface currents in the Nordic Seas and is strongly inhomogeneous in space. We use the clustering algorithm to extract the mean velocities and diffusivities (both of which are known from the stochastic model). We also compare the results to those obtained with fixed geographical bins. Clustering is more successful at capturing spatial variability of the mean flow and also improves convergence in the eddy diffusivity estimates. We discuss both the future prospects and shortcomings of the new method.
A clustering method of Chinese medicine prescriptions based on modified firefly algorithm.
Yuan, Feng; Liu, Hong; Chen, Shou-Qiang; Xu, Liang
2016-12-01
This paper is aimed to study the clustering method for Chinese medicine (CM) medical cases. The traditional K-means clustering algorithm had shortcomings such as dependence of results on the selection of initial value, trapping in local optimum when processing prescriptions form CM medical cases. Therefore, a new clustering method based on the collaboration of firefly algorithm and simulated annealing algorithm was proposed. This algorithm dynamically determined the iteration of firefly algorithm and simulates sampling of annealing algorithm by fitness changes, and increased the diversity of swarm through expansion of the scope of the sudden jump, thereby effectively avoiding premature problem. The results from confirmatory experiments for CM medical cases suggested that, comparing with traditional K-means clustering algorithms, this method was greatly improved in the individual diversity and the obtained clustering results, the computing results from this method had a certain reference value for cluster analysis on CM prescriptions.
Evaluation of Low-Voltage Distribution Network Index Based on Improved Principal Component Analysis
NASA Astrophysics Data System (ADS)
Fan, Hanlu; Gao, Suzhou; Fan, Wenjie; Zhong, Yinfeng; Zhu, Lei
2018-01-01
In order to evaluate the development level of the low-voltage distribution network objectively and scientifically, chromatography analysis method is utilized to construct evaluation index model of low-voltage distribution network. Based on the analysis of principal component and the characteristic of logarithmic distribution of the index data, a logarithmic centralization method is adopted to improve the principal component analysis algorithm. The algorithm can decorrelate and reduce the dimensions of the evaluation model and the comprehensive score has a better dispersion degree. The clustering method is adopted to analyse the comprehensive score because the comprehensive score of the courts is concentrated. Then the stratification evaluation of the courts is realized. An example is given to verify the objectivity and scientificity of the evaluation method.
ClusterViz: A Cytoscape APP for Cluster Analysis of Biological Network.
Wang, Jianxin; Zhong, Jiancheng; Chen, Gang; Li, Min; Wu, Fang-xiang; Pan, Yi
2015-01-01
Cluster analysis of biological networks is one of the most important approaches for identifying functional modules and predicting protein functions. Furthermore, visualization of clustering results is crucial to uncover the structure of biological networks. In this paper, ClusterViz, an APP of Cytoscape 3 for cluster analysis and visualization, has been developed. In order to reduce complexity and enable extendibility for ClusterViz, we designed the architecture of ClusterViz based on the framework of Open Services Gateway Initiative. According to the architecture, the implementation of ClusterViz is partitioned into three modules including interface of ClusterViz, clustering algorithms and visualization and export. ClusterViz fascinates the comparison of the results of different algorithms to do further related analysis. Three commonly used clustering algorithms, FAG-EC, EAGLE and MCODE, are included in the current version. Due to adopting the abstract interface of algorithms in module of the clustering algorithms, more clustering algorithms can be included for the future use. To illustrate usability of ClusterViz, we provided three examples with detailed steps from the important scientific articles, which show that our tool has helped several research teams do their research work on the mechanism of the biological networks.
Deterministic annealing for density estimation by multivariate normal mixtures
NASA Astrophysics Data System (ADS)
Kloppenburg, Martin; Tavan, Paul
1997-03-01
An approach to maximum-likelihood density estimation by mixtures of multivariate normal distributions for large high-dimensional data sets is presented. Conventionally that problem is tackled by notoriously unstable expectation-maximization (EM) algorithms. We remove these instabilities by the introduction of soft constraints, enabling deterministic annealing. Our developments are motivated by the proof that algorithmically stable fuzzy clustering methods that are derived from statistical physics analogs are special cases of EM procedures.
Jothi, R; Mohanty, Sraban Kumar; Ojha, Aparajita
2016-04-01
Gene expression data clustering is an important biological process in DNA microarray analysis. Although there have been many clustering algorithms for gene expression analysis, finding a suitable and effective clustering algorithm is always a challenging problem due to the heterogeneous nature of gene profiles. Minimum Spanning Tree (MST) based clustering algorithms have been successfully employed to detect clusters of varying shapes and sizes. This paper proposes a novel clustering algorithm using Eigenanalysis on Minimum Spanning Tree based neighborhood graph (E-MST). As MST of a set of points reflects the similarity of the points with their neighborhood, the proposed algorithm employs a similarity graph obtained from k(') rounds of MST (k(')-MST neighborhood graph). By studying the spectral properties of the similarity matrix obtained from k(')-MST graph, the proposed algorithm achieves improved clustering results. We demonstrate the efficacy of the proposed algorithm on 12 gene expression datasets. Experimental results show that the proposed algorithm performs better than the standard clustering algorithms. Copyright © 2016 Elsevier Ltd. All rights reserved.
Sun, Liping; Luo, Yonglong; Ding, Xintao; Zhang, Ji
2014-01-01
An important component of a spatial clustering algorithm is the distance measure between sample points in object space. In this paper, the traditional Euclidean distance measure is replaced with innovative obstacle distance measure for spatial clustering under obstacle constraints. Firstly, we present a path searching algorithm to approximate the obstacle distance between two points for dealing with obstacles and facilitators. Taking obstacle distance as similarity metric, we subsequently propose the artificial immune clustering with obstacle entity (AICOE) algorithm for clustering spatial point data in the presence of obstacles and facilitators. Finally, the paper presents a comparative analysis of AICOE algorithm and the classical clustering algorithms. Our clustering model based on artificial immune system is also applied to the case of public facility location problem in order to establish the practical applicability of our approach. By using the clone selection principle and updating the cluster centers based on the elite antibodies, the AICOE algorithm is able to achieve the global optimum and better clustering effect.
Fast Constrained Spectral Clustering and Cluster Ensemble with Random Projection
Liu, Wenfen
2017-01-01
Constrained spectral clustering (CSC) method can greatly improve the clustering accuracy with the incorporation of constraint information into spectral clustering and thus has been paid academic attention widely. In this paper, we propose a fast CSC algorithm via encoding landmark-based graph construction into a new CSC model and applying random sampling to decrease the data size after spectral embedding. Compared with the original model, the new algorithm has the similar results with the increase of its model size asymptotically; compared with the most efficient CSC algorithm known, the new algorithm runs faster and has a wider range of suitable data sets. Meanwhile, a scalable semisupervised cluster ensemble algorithm is also proposed via the combination of our fast CSC algorithm and dimensionality reduction with random projection in the process of spectral ensemble clustering. We demonstrate by presenting theoretical analysis and empirical results that the new cluster ensemble algorithm has advantages in terms of efficiency and effectiveness. Furthermore, the approximate preservation of random projection in clustering accuracy proved in the stage of consensus clustering is also suitable for the weighted k-means clustering and thus gives the theoretical guarantee to this special kind of k-means clustering where each point has its corresponding weight. PMID:29312447
Community health assessment using self-organizing maps and geographic information systems
Basara, Heather G; Yuan, May
2008-01-01
Background From a public health perspective, a healthier community environment correlates with fewer occurrences of chronic or infectious diseases. Our premise is that community health is a non-linear function of environmental and socioeconomic effects that are not normally distributed among communities. The objective was to integrate multivariate data sets representing social, economic, and physical environmental factors to evaluate the hypothesis that communities with similar environmental characteristics exhibit similar distributions of disease. Results The SOM algorithm used the intrinsic distributions of 92 environmental variables to classify 511 communities into five clusters. SOM determined clusters were reprojected to geographic space and compared with the distributions of several health outcomes. ANOVA results indicated that the variability between community clusters was significant with respect to the spatial distribution of disease occurrence. Conclusion Our study demonstrated a positive relationship between environmental conditions and health outcomes in communities using the SOM-GIS method to overcome data and methodological challenges traditionally encountered in public health research. Results demonstrated that community health can be classified using environmental variables and that the SOM-GIS method may be applied to multivariate environmental health studies. PMID:19116020
Large Scale Frequent Pattern Mining using MPI One-Sided Model
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vishnu, Abhinav; Agarwal, Khushbu
In this paper, we propose a work-stealing runtime --- Library for Work Stealing LibWS --- using MPI one-sided model for designing scalable FP-Growth --- {\\em de facto} frequent pattern mining algorithm --- on large scale systems. LibWS provides locality efficient and highly scalable work-stealing techniques for load balancing on a variety of data distributions. We also propose a novel communication algorithm for FP-growth data exchange phase, which reduces the communication complexity from state-of-the-art O(p) to O(f + p/f) for p processes and f frequent attributed-ids. FP-Growth is implemented using LibWS and evaluated on several work distributions and support counts. Anmore » experimental evaluation of the FP-Growth on LibWS using 4096 processes on an InfiniBand Cluster demonstrates excellent efficiency for several work distributions (87\\% efficiency for Power-law and 91% for Poisson). The proposed distributed FP-Tree merging algorithm provides 38x communication speedup on 4096 cores.« less
Karayiannis, N B
2000-01-01
This paper presents the development and investigates the properties of ordered weighted learning vector quantization (LVQ) and clustering algorithms. These algorithms are developed by using gradient descent to minimize reformulation functions based on aggregation operators. An axiomatic approach provides conditions for selecting aggregation operators that lead to admissible reformulation functions. Minimization of admissible reformulation functions based on ordered weighted aggregation operators produces a family of soft LVQ and clustering algorithms, which includes fuzzy LVQ and clustering algorithms as special cases. The proposed LVQ and clustering algorithms are used to perform segmentation of magnetic resonance (MR) images of the brain. The diagnostic value of the segmented MR images provides the basis for evaluating a variety of ordered weighted LVQ and clustering algorithms.
Probabilistic Cosmological Mass Mapping from Weak Lensing Shear
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schneider, M. D.; Dawson, W. A.; Ng, K. Y.
2017-04-10
We infer gravitational lensing shear and convergence fields from galaxy ellipticity catalogs under a spatial process prior for the lensing potential. We demonstrate the performance of our algorithm with simulated Gaussian-distributed cosmological lensing shear maps and a reconstruction of the mass distribution of the merging galaxy cluster Abell 781 using galaxy ellipticities measured with the Deep Lens Survey. Given interim posterior samples of lensing shear or convergence fields on the sky, we describe an algorithm to infer cosmological parameters via lens field marginalization. In the most general formulation of our algorithm we make no assumptions about weak shear or Gaussian-distributedmore » shape noise or shears. Because we require solutions and matrix determinants of a linear system of dimension that scales with the number of galaxies, we expect our algorithm to require parallel high-performance computing resources for application to ongoing wide field lensing surveys.« less
Hierarchical Dirichlet process model for gene expression clustering
2013-01-01
Clustering is an important data processing tool for interpreting microarray data and genomic network inference. In this article, we propose a clustering algorithm based on the hierarchical Dirichlet processes (HDP). The HDP clustering introduces a hierarchical structure in the statistical model which captures the hierarchical features prevalent in biological data such as the gene express data. We develop a Gibbs sampling algorithm based on the Chinese restaurant metaphor for the HDP clustering. We apply the proposed HDP algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to outperform several popular clustering algorithms by revealing the underlying hierarchical structure of the data. For the yeast cell cycle data, we compare the HDP result to the standard result and show that the HDP algorithm provides more information and reduces the unnecessary clustering fragments. PMID:23587447
Canonical PSO Based K-Means Clustering Approach for Real Datasets.
Dey, Lopamudra; Chakraborty, Sanjay
2014-01-01
"Clustering" the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.
Oh, Jeongsu; Choi, Chi-Hwan; Park, Min-Kyu; Kim, Byung Kwon; Hwang, Kyuin; Lee, Sang-Heon; Hong, Soon Gyu; Nasir, Arshan; Cho, Wan-Sup; Kim, Kyung Mo
2016-01-01
High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology-a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at http://clustomcloud.kopri.re.kr.
Park, Min-Kyu; Kim, Byung Kwon; Hwang, Kyuin; Lee, Sang-Heon; Hong, Soon Gyu; Nasir, Arshan; Cho, Wan-Sup; Kim, Kyung Mo
2016-01-01
High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology–a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at http://clustomcloud.kopri.re.kr. PMID:26954507
Spatial-temporal travel pattern mining using massive taxi trajectory data
NASA Astrophysics Data System (ADS)
Zheng, Linjiang; Xia, Dong; Zhao, Xin; Tan, Longyou; Li, Hang; Chen, Li; Liu, Weining
2018-07-01
Deep understanding of residents' travel patterns would provide helpful insights into the mechanisms of many socioeconomic phenomena. With the rapid development of location-aware computing technologies, researchers have easy access to large quantities of travel data. As an important data source, taxi trajectory data are featured by their high quality, good continuity and wide distribution, making it suitable for travel pattern mining. In this paper, we use taxi trajectory data to study spatial-temporal characterization of urban residents' travel patterns from two aspects: attractive areas and hot paths. Firstly, a framework of trajectory preprocessing, including data cleaning and extracting the taxi passenger pick-up/drop-off points, is presented to reduce the noise and redundancy in raw trajectory data. Then, a grid density based clustering algorithm is proposed to discover travel attractive areas in different periods of a day. On this basis, we put forward a spatial-temporal trajectory clustering method to discover hot paths among travel attractive areas. Compared with previous algorithms, which only consider the spatial constraint between trajectories, temporal constraint is also considered in our method. Through the experiments, we discuss how to determine the optimal parameters of the two clustering algorithms and verify the effectiveness of the algorithms using real data. Furthermore, we analyze spatial-temporal characterization of Chongqing residents' travel pattern.
Halligan, Brian D.; Geiger, Joey F.; Vallejos, Andrew K.; Greene, Andrew S.; Twigger, Simon N.
2009-01-01
One of the major difficulties for many laboratories setting up proteomics programs has been obtaining and maintaining the computational infrastructure required for the analysis of the large flow of proteomics data. We describe a system that combines distributed cloud computing and open source software to allow laboratories to set up scalable virtual proteomics analysis clusters without the investment in computational hardware or software licensing fees. Additionally, the pricing structure of distributed computing providers, such as Amazon Web Services, allows laboratories or even individuals to have large-scale computational resources at their disposal at a very low cost per run. We provide detailed step by step instructions on how to implement the virtual proteomics analysis clusters as well as a list of current available preconfigured Amazon machine images containing the OMSSA and X!Tandem search algorithms and sequence databases on the Medical College of Wisconsin Proteomics Center website (http://proteomics.mcw.edu/vipdac). PMID:19358578
Halligan, Brian D; Geiger, Joey F; Vallejos, Andrew K; Greene, Andrew S; Twigger, Simon N
2009-06-01
One of the major difficulties for many laboratories setting up proteomics programs has been obtaining and maintaining the computational infrastructure required for the analysis of the large flow of proteomics data. We describe a system that combines distributed cloud computing and open source software to allow laboratories to set up scalable virtual proteomics analysis clusters without the investment in computational hardware or software licensing fees. Additionally, the pricing structure of distributed computing providers, such as Amazon Web Services, allows laboratories or even individuals to have large-scale computational resources at their disposal at a very low cost per run. We provide detailed step-by-step instructions on how to implement the virtual proteomics analysis clusters as well as a list of current available preconfigured Amazon machine images containing the OMSSA and X!Tandem search algorithms and sequence databases on the Medical College of Wisconsin Proteomics Center Web site ( http://proteomics.mcw.edu/vipdac ).
A hierarchical model for clustering m(6)A methylation peaks in MeRIP-seq data.
Cui, Xiaodong; Meng, Jia; Zhang, Shaowu; Rao, Manjeet K; Chen, Yidong; Huang, Yufei
2016-08-22
The recent advent of the state-of-art high throughput sequencing technology, known as Methylated RNA Immunoprecipitation combined with RNA sequencing (MeRIP-seq) revolutionizes the area of mRNA epigenetics and enables the biologists and biomedical researchers to have a global view of N (6)-Methyladenosine (m(6)A) on transcriptome. Yet there is a significant need for new computation tools for processing and analysing MeRIP-Seq data to gain a further insight into the function and m(6)A mRNA methylation. We developed a novel algorithm and an open source R package ( http://compgenomics.utsa.edu/metcluster ) for uncovering the potential types of m(6)A methylation by clustering the degree of m(6)A methylation peaks in MeRIP-Seq data. This algorithm utilizes a hierarchical graphical model to model the reads account variance and the underlying clusters of the methylation peaks. Rigorous statistical inference is performed to estimate the model parameter and detect the number of clusters. MeTCluster is evaluated on both simulated and real MeRIP-seq datasets and the results demonstrate its high accuracy in characterizing the clusters of methylation peaks. Our algorithm was applied to two different sets of real MeRIP-seq datasets and reveals a novel pattern that methylation peaks with less peak enrichment tend to clustered in the 5' end of both in both mRNAs and lncRNAs, whereas those with higher peak enrichment are more likely to be distributed in CDS and towards the 3'end of mRNAs and lncRNAs. This result might suggest that m(6)A's functions could be location specific. In this paper, a novel hierarchical graphical model based algorithm was developed for clustering the enrichment of methylation peaks in MeRIP-seq data. MeTCluster is written in R and is publicly available.
NASA Astrophysics Data System (ADS)
Li, Y. S.; Xu, C.; Hui, P. M.
2018-07-01
Multiple stable states, hysteresis, sensitivity to initial distributions, and a control algorithm for promoting cooperation are studied in an evolutionary prisoner's dilemma with agents connected into a regular random network. A system could evolve into states of different cooperative frequencies xc in different runs, even starting with the same initial cooperative frequency xc(in) and payoff parameters. For a large reward R, some values of xc(in) either take the system to a group of low cooperative frequency (LCF) states or to a few high cooperative frequency (HCF) states. These states differ by their network structures, with cooperative players connected into ring-like structure in LCF states and compact clusters in HCF states. Hysteresis in xc is observed when R is swept down and up, when the final state of the previous R is used as the initial state of the next R. The analysis led us to propose a closed pack cluster algorithm that gives HCF states effectively. The algorithm intervenes the system at some point in time by selectively switching some non-cooperative D-agents into cooperative C-agents at the peripheral of an existing cluster of C-agents. It ensures protection of a small C-cluster from which more cooperation can be induced. Practically, a governing body may first allow a society to evolve freely and then derive suitable policy to promote selected pockets of good practices for attaining a higher level of common good.
Collaborative Clustering for Sensor Networks
NASA Technical Reports Server (NTRS)
Wagstaff. Loro :/; Green Jillian; Lane, Terran
2011-01-01
Traditionally, nodes in a sensor network simply collect data and then pass it on to a centralized node that archives, distributes, and possibly analyzes the data. However, analysis at the individual nodes could enable faster detection of anomalies or other interesting events, as well as faster responses such as sending out alerts or increasing the data collection rate. There is an additional opportunity for increased performance if individual nodes can communicate directly with their neighbors. Previously, a method was developed by which machine learning classification algorithms could collaborate to achieve high performance autonomously (without requiring human intervention). This method worked for supervised learning algorithms, in which labeled data is used to train models. The learners collaborated by exchanging labels describing the data. The new advance enables clustering algorithms, which do not use labeled data, to also collaborate. This is achieved by defining a new language for collaboration that uses pair-wise constraints to encode useful information for other learners. These constraints specify that two items must, or cannot, be placed into the same cluster. Previous work has shown that clustering with these constraints (in isolation) already improves performance. In the problem formulation, each learner resides at a different node in the sensor network and makes observations (collects data) independently of the other learners. Each learner clusters its data and then selects a pair of items about which it is uncertain and uses them to query its neighbors. The resulting feedback (a must and cannot constraint from each neighbor) is combined by the learner into a consensus constraint, and it then reclusters its data while incorporating the new constraint. A strategy was also proposed for cleaning the resulting constraint sets, which may contain conflicting constraints; this improves performance significantly. This approach has been applied to collaborative clustering of seismic and infrasonic data collected by the Mount Erebus Volcano Observatory in Antarctica. Previous approaches to distributed clustering cannot readily be applied in a sensor network setting, because they assume that each node has the same view of the data set. A view is the set of features used to represent each object. When a single data set is partitioned across several computational nodes, distributed clustering works; all objects have the same view. But when the data is collected from different locations, using different sensors, a more flexible approach is needed. This approach instead operates in situations where the data collected at each node has a different view (e.g., seismic vs. infrasonic sensors), but they observe the same events. This enables them to exchange information about the likely cluster membership relations between objects, even if they do not use the same features to represent the objects.
Toward the optimization of normalized graph Laplacian.
Xie, Bo; Wang, Meng; Tao, Dacheng
2011-04-01
Normalized graph Laplacian has been widely used in many practical machine learning algorithms, e.g., spectral clustering and semisupervised learning. However, all of them use the Euclidean distance to construct the graph Laplacian, which does not necessarily reflect the inherent distribution of the data. In this brief, we propose a method to directly optimize the normalized graph Laplacian by using pairwise constraints. The learned graph is consistent with equivalence and nonequivalence pairwise relationships, and thus it can better represent similarity between samples. Meanwhile, our approach, unlike metric learning, automatically determines the scale factor during the optimization. The learned normalized Laplacian matrix can be directly applied in spectral clustering and semisupervised learning algorithms. Comprehensive experiments demonstrate the effectiveness of the proposed approach.
Hielscher, Andreas H; Bartel, Sebastian
2004-02-01
Optical tomography (OT) is a fast developing novel imaging modality that uses near-infrared (NIR) light to obtain cross-sectional views of optical properties inside the human body. A major challenge remains the time-consuming, computational-intensive image reconstruction problem that converts NIR transmission measurements into cross-sectional images. To increase the speed of iterative image reconstruction schemes that are commonly applied for OT, we have developed and implemented several parallel algorithms on a cluster of workstations. Static process distribution as well as dynamic load balancing schemes suitable for heterogeneous clusters and varying machine performances are introduced and tested. The resulting algorithms are shown to accelerate the reconstruction process to various degrees, substantially reducing the computation times for clinically relevant problems.
Clustering XML Documents Using Frequent Subtrees
NASA Astrophysics Data System (ADS)
Kutty, Sangeetha; Tran, Tien; Nayak, Richi; Li, Yuefeng
This paper presents an experimental study conducted over the INEX 2008 Document Mining Challenge corpus using both the structure and the content of XML documents for clustering them. The concise common substructures known as the closed frequent subtrees are generated using the structural information of the XML documents. The closed frequent subtrees are then used to extract the constrained content from the documents. A matrix containing the term distribution of the documents in the dataset is developed using the extracted constrained content. The k-way clustering algorithm is applied to the matrix to obtain the required clusters. In spite of the large number of documents in the INEX 2008 Wikipedia dataset, the proposed frequent subtree-based clustering approach was successful in clustering the documents. This approach significantly reduces the dimensionality of the terms used for clustering without much loss in accuracy.
A hybrid monkey search algorithm for clustering analysis.
Chen, Xin; Zhou, Yongquan; Luo, Qifang
2014-01-01
Clustering is a popular data analysis and data mining technique. The k-means clustering algorithm is one of the most commonly used methods. However, it highly depends on the initial solution and is easy to fall into local optimum solution. In view of the disadvantages of the k-means method, this paper proposed a hybrid monkey algorithm based on search operator of artificial bee colony algorithm for clustering analysis and experiment on synthetic and real life datasets to show that the algorithm has a good performance than that of the basic monkey algorithm for clustering analysis.
Clustering for Binary Data Sets by Using Genetic Algorithm-Incremental K-means
NASA Astrophysics Data System (ADS)
Saharan, S.; Baragona, R.; Nor, M. E.; Salleh, R. M.; Asrah, N. M.
2018-04-01
This research was initially driven by the lack of clustering algorithms that specifically focus in binary data. To overcome this gap in knowledge, a promising technique for analysing this type of data became the main subject in this research, namely Genetic Algorithms (GA). For the purpose of this research, GA was combined with the Incremental K-means (IKM) algorithm to cluster the binary data streams. In GAIKM, the objective function was based on a few sufficient statistics that may be easily and quickly calculated on binary numbers. The implementation of IKM will give an advantage in terms of fast convergence. The results show that GAIKM is an efficient and effective new clustering algorithm compared to the clustering algorithms and to the IKM itself. In conclusion, the GAIKM outperformed other clustering algorithms such as GCUK, IKM, Scalable K-means (SKM) and K-means clustering and paves the way for future research involving missing data and outliers.
Canonical PSO Based K-Means Clustering Approach for Real Datasets
Dey, Lopamudra; Chakraborty, Sanjay
2014-01-01
“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms. PMID:27355083
SciSpark's SRDD : A Scientific Resilient Distributed Dataset for Multidimensional Data
NASA Astrophysics Data System (ADS)
Palamuttam, R. S.; Wilson, B. D.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; McGibbney, L. J.; Ramirez, P.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We have developed SciSpark, a robust Big Data framework, that extends ApacheTM Spark for scaling scientific computations. Apache Spark improves the map-reduce implementation in ApacheTM Hadoop for parallel computing on a cluster, by emphasizing in-memory computation, "spilling" to disk only as needed, and relying on lazy evaluation. Central to Spark is the Resilient Distributed Dataset (RDD), an in-memory distributed data structure that extends the functional paradigm provided by the Scala programming language. However, RDDs are ideal for tabular or unstructured data, and not for highly dimensional data. The SciSpark project introduces the Scientific Resilient Distributed Dataset (sRDD), a distributed-computing array structure which supports iterative scientific algorithms for multidimensional data. SciSpark processes data stored in NetCDF and HDF files by partitioning them across time or space and distributing the partitions among a cluster of compute nodes. We show usability and extensibility of SciSpark by implementing distributed algorithms for geospatial operations on large collections of multi-dimensional grids. In particular we address the problem of scaling an automated method for finding Mesoscale Convective Complexes. SciSpark provides a tensor interface to support the pluggability of different matrix libraries. We evaluate performance of the various matrix libraries in distributed pipelines, such as Nd4jTM and BreezeTM. We detail the architecture and design of SciSpark, our efforts to integrate climate science algorithms, parallel ingest and partitioning (sharding) of A-Train satellite observations from model grids. These solutions are encompassed in SciSpark, an open-source software framework for distributed computing on scientific data.
A method of operation scheduling based on video transcoding for cluster equipment
NASA Astrophysics Data System (ADS)
Zhou, Haojie; Yan, Chun
2018-04-01
Because of the cluster technology in real-time video transcoding device, the application of facing the massive growth in the number of video assignments and resolution and bit rate of diversity, task scheduling algorithm, and analyze the current mainstream of cluster for real-time video transcoding equipment characteristics of the cluster, combination with the characteristics of the cluster equipment task delay scheduling algorithm is proposed. This algorithm enables the cluster to get better performance in the generation of the job queue and the lower part of the job queue when receiving the operation instruction. In the end, a small real-time video transcode cluster is constructed to analyze the calculation ability, running time, resource occupation and other aspects of various algorithms in operation scheduling. The experimental results show that compared with traditional clustering task scheduling algorithm, task delay scheduling algorithm has more flexible and efficient characteristics.
[Cluster analysis in biomedical researches].
Akopov, A S; Moskovtsev, A A; Dolenko, S A; Savina, G D
2013-01-01
Cluster analysis is one of the most popular methods for the analysis of multi-parameter data. The cluster analysis reveals the internal structure of the data, group the separate observations on the degree of their similarity. The review provides a definition of the basic concepts of cluster analysis, and discusses the most popular clustering algorithms: k-means, hierarchical algorithms, Kohonen networks algorithms. Examples are the use of these algorithms in biomedical research.
Data depth based clustering analysis
Jeong, Myeong -Hun; Cai, Yaping; Sullivan, Clair J.; ...
2016-01-01
Here, this paper proposes a new algorithm for identifying patterns within data, based on data depth. Such a clustering analysis has an enormous potential to discover previously unknown insights from existing data sets. Many clustering algorithms already exist for this purpose. However, most algorithms are not affine invariant. Therefore, they must operate with different parameters after the data sets are rotated, scaled, or translated. Further, most clustering algorithms, based on Euclidean distance, can be sensitive to noises because they have no global perspective. Parameter selection also significantly affects the clustering results of each algorithm. Unlike many existing clustering algorithms, themore » proposed algorithm, called data depth based clustering analysis (DBCA), is able to detect coherent clusters after the data sets are affine transformed without changing a parameter. It is also robust to noises because using data depth can measure centrality and outlyingness of the underlying data. Further, it can generate relatively stable clusters by varying the parameter. The experimental comparison with the leading state-of-the-art alternatives demonstrates that the proposed algorithm outperforms DBSCAN and HDBSCAN in terms of affine invariance, and exceeds or matches the ro-bustness to noises of DBSCAN or HDBSCAN. The robust-ness to parameter selection is also demonstrated through the case study of clustering twitter data.« less
Clustering analysis of moving target signatures
NASA Astrophysics Data System (ADS)
Martone, Anthony; Ranney, Kenneth; Innocenti, Roberto
2010-04-01
Previously, we developed a moving target indication (MTI) processing approach to detect and track slow-moving targets inside buildings, which successfully detected moving targets (MTs) from data collected by a low-frequency, ultra-wideband radar. Our MTI algorithms include change detection, automatic target detection (ATD), clustering, and tracking. The MTI algorithms can be implemented in a real-time or near-real-time system; however, a person-in-the-loop is needed to select input parameters for the clustering algorithm. Specifically, the number of clusters to input into the cluster algorithm is unknown and requires manual selection. A critical need exists to automate all aspects of the MTI processing formulation. In this paper, we investigate two techniques that automatically determine the number of clusters: the adaptive knee-point (KP) algorithm and the recursive pixel finding (RPF) algorithm. The KP algorithm is based on a well-known heuristic approach for determining the number of clusters. The RPF algorithm is analogous to the image processing, pixel labeling procedure. Both algorithms are used to analyze the false alarm and detection rates of three operational scenarios of personnel walking inside wood and cinderblock buildings.
NASA Astrophysics Data System (ADS)
Li, Jingying; Bai, Lu; Wu, Zhensen; Guo, Lixin; Gong, Yanjun
2017-11-01
In this paper, diffusion limited aggregation (DLA) algorithm is improved to generate the alumina particle cluster with different radius of monomers in the plume. Scattering properties of these alumina clusters are solved by the multiple sphere T matrix method (MSTM). The effect of the number and radius of monomers on the scattering properties of clusters of alumina particles is discussed. The scattering properties of two types of alumina particle clusters are compared, one has different radius of monomers that follows lognormal probability distribution, another has the same radius of monomers that equals the mean of lognormal probability distribution. The result show that the scattering phase functions and linear polarization degrees of these two types of alumina particle clusters are of great differences. For the alumina clusters with different radius of monomers, the forward scatterings are bigger and the linear polarization degree has multiple peaks. Moreover, the vary of their scattering properties do not have strong correlative with the change of number of monomers. For larger booster motors, 25-38% of the plume being condensed alumina. The alumina can scatter radiation from other sources present in the plume and effect on radiation transfer characteristics of plume. In addition, the shape, size distribution and refractive index of the particles in the plume are estimated by linear polarization degree. Therefore, accurate scattering properties calculation is very important to decrease the deviation in the related research.
Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm
NASA Astrophysics Data System (ADS)
Umam, Khoirul; Bustamam, Alhadi; Lestari, Dian
2017-03-01
DNA is one of the carrier of genetic information of living organisms. Encoding, sequencing, and clustering DNA sequences has become the key jobs and routine in the world of molecular biology, in particular on bioinformatics application. There are two type of clustering, hierarchical clustering and partitioning clustering. In this paper, we combined two type clustering i.e. K-Means (partitioning clustering) and DIANA (hierarchical clustering), therefore it called Hybrid clustering. Application of hybrid clustering using Parallel K-Means algorithm and DIANA algorithm used to clustering DNA sequences of Human Papillomavirus (HPV). The clustering process is started with Collecting DNA sequences of HPV are obtained from NCBI (National Centre for Biotechnology Information), then performing characteristics extraction of DNA sequences. The characteristics extraction result is store in a matrix form, then normalize this matrix using Min-Max normalization and calculate genetic distance using Euclidian Distance. Furthermore, the hybrid clustering is applied by using implementation of Parallel K-Means algorithm and DIANA algorithm. The aim of using Hybrid Clustering is to obtain better clusters result. For validating the resulted clusters, to get optimum number of clusters, we use Davies-Bouldin Index (DBI). In this study, the result of implementation of Parallel K-Means clustering is data clustered become 5 clusters with minimal IDB value is 0.8741, and Hybrid Clustering clustered data become 13 sub-clusters with minimal IDB values = 0.8216, 0.6845, 0.3331, 0.1994 and 0.3952. The IDB value of hybrid clustering less than IBD value of Parallel K-Means clustering only that perform at 1ts stage. Its means clustering using Hybrid Clustering have the better result to clustered DNA sequence of HPV than perform parallel K-Means Clustering only.
Clustering algorithm for determining community structure in large networks
NASA Astrophysics Data System (ADS)
Pujol, Josep M.; Béjar, Javier; Delgado, Jordi
2006-07-01
We propose an algorithm to find the community structure in complex networks based on the combination of spectral analysis and modularity optimization. The clustering produced by our algorithm is as accurate as the best algorithms on the literature of modularity optimization; however, the main asset of the algorithm is its efficiency. The best match for our algorithm is Newman’s fast algorithm, which is the reference algorithm for clustering in large networks due to its efficiency. When both algorithms are compared, our algorithm outperforms the fast algorithm both in efficiency and accuracy of the clustering, in terms of modularity. Thus, the results suggest that the proposed algorithm is a good choice to analyze the community structure of medium and large networks in the range of tens and hundreds of thousand vertices.
Cota-Ruiz, Juan; Rosiles, Jose-Gerardo; Sifuentes, Ernesto; Rivas-Perea, Pablo
2012-01-01
This research presents a distributed and formula-based bilateration algorithm that can be used to provide initial set of locations. In this scheme each node uses distance estimates to anchors to solve a set of circle-circle intersection (CCI) problems, solved through a purely geometric formulation. The resulting CCIs are processed to pick those that cluster together and then take the average to produce an initial node location. The algorithm is compared in terms of accuracy and computational complexity with a Least-Squares localization algorithm, based on the Levenberg-Marquardt methodology. Results in accuracy vs. computational performance show that the bilateration algorithm is competitive compared with well known optimized localization algorithms.
NASA Astrophysics Data System (ADS)
Lin, Yen-Ting; Hsieh, Bau-Ching; Lin, Sheng-Chieh; Oguri, Masamune; Chen, Kai-Feng; Tanaka, Masayuki; Chiu, I.-Non; Huang, Song; Kodama, Tadayuki; Leauthaud, Alexie; More, Surhud; Nishizawa, Atsushi J.; Bundy, Kevin; Lin, Lihwai; Miyazaki, Satoshi
2017-12-01
The unprecedented depth and area surveyed by the Subaru Strategic Program with the Hyper Suprime-Cam (HSC-SSP) have enabled us to construct and publish the largest distant cluster sample out to z∼ 1 to date. In this exploratory study of cluster galaxy evolution from z = 1 to z = 0.3, we investigate the stellar mass assembly history of brightest cluster galaxies (BCGs), the evolution of stellar mass and luminosity distributions, the stellar mass surface density profile, as well as the population of radio galaxies. Our analysis is the first high-redshift application of the top N richest cluster selection, which is shown to allow us to trace the cluster galaxy evolution faithfully. Over the 230 deg2 area of the current HSC-SSP footprint, selecting the top 100 clusters in each of the four redshift bins allows us to observe the buildup of galaxy population in descendants of clusters whose z≈ 1 mass is about 2× {10}14 {M}ȯ . Our stellar mass is derived from a machine-learning algorithm, which is found to be unbiased and accurate with respect to the COSMOS data. We find very mild stellar mass growth in BCGs (about 35% between z = 1 and 0.3), and no evidence for evolution in both the total stellar mass–cluster mass correlation and the shape of the stellar mass surface density profile. We also present the first measurement of the radio luminosity distribution in clusters out to z∼ 1, and show hints of changes in the dominant accretion mode powering the cluster radio galaxies at z∼ 0.8.
NASA Technical Reports Server (NTRS)
Lennington, R. K.; Johnson, J. K.
1979-01-01
An efficient procedure which clusters data using a completely unsupervised clustering algorithm and then uses labeled pixels to label the resulting clusters or perform a stratified estimate using the clusters as strata is developed. Three clustering algorithms, CLASSY, AMOEBA, and ISOCLS, are compared for efficiency. Three stratified estimation schemes and three labeling schemes are also considered and compared.
`Inter-Arrival Time' Inspired Algorithm and its Application in Clustering and Molecular Phylogeny
NASA Astrophysics Data System (ADS)
Kolekar, Pandurang S.; Kale, Mohan M.; Kulkarni-Kale, Urmila
2010-10-01
Bioinformatics, being multidisciplinary field, involves applications of various methods from allied areas of Science for data mining using computational approaches. Clustering and molecular phylogeny is one of the key areas in Bioinformatics, which help in study of classification and evolution of organisms. Molecular phylogeny algorithms can be divided into distance based and character based methods. But most of these methods are dependent on pre-alignment of sequences and become computationally intensive with increase in size of data and hence demand alternative efficient approaches. `Inter arrival time distribution' (IATD) is a popular concept in the theory of stochastic system modeling but its potential in molecular data analysis has not been fully explored. The present study reports application of IATD in Bioinformatics for clustering and molecular phylogeny. The proposed method provides IATDs of nucleotides in genomic sequences. The distance function based on statistical parameters of IATDs is proposed and distance matrix thus obtained is used for the purpose of clustering and molecular phylogeny. The method is applied on a dataset of 3' non-coding region sequences (NCR) of Dengue virus type 3 (DENV-3), subtype III, reported in 2008. The phylogram thus obtained revealed the geographical distribution of DENV-3 isolates. Sri Lankan DENV-3 isolates were further observed to be clustered in two sub-clades corresponding to pre and post Dengue hemorrhagic fever emergence groups. These results are consistent with those reported earlier, which are obtained using pre-aligned sequence data as an input. These findings encourage applications of the IATD based method in molecular phylogenetic analysis in particular and data mining in general.
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets
Wernisch, Lorenz
2017-01-01
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.
Gabasova, Evelina; Reid, John; Wernisch, Lorenz
2017-10-01
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.
CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks.
Li, Min; Li, Dongyan; Tang, Yu; Wu, Fangxiang; Wang, Jianxin
2017-08-31
Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster.
CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks
Li, Min; Li, Dongyan; Tang, Yu; Wang, Jianxin
2017-01-01
Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster. PMID:28858211
Multi-Parent Clustering Algorithms from Stochastic Grammar Data Models
NASA Technical Reports Server (NTRS)
Mjoisness, Eric; Castano, Rebecca; Gray, Alexander
1999-01-01
We introduce a statistical data model and an associated optimization-based clustering algorithm which allows data vectors to belong to zero, one or several "parent" clusters. For each data vector the algorithm makes a discrete decision among these alternatives. Thus, a recursive version of this algorithm would place data clusters in a Directed Acyclic Graph rather than a tree. We test the algorithm with synthetic data generated according to the statistical data model. We also illustrate the algorithm using real data from large-scale gene expression assays.
Fast detection of the fuzzy communities based on leader-driven algorithm
NASA Astrophysics Data System (ADS)
Fang, Changjian; Mu, Dejun; Deng, Zhenghong; Hu, Jun; Yi, Chen-He
2018-03-01
In this paper, we present the leader-driven algorithm (LDA) for learning community structure in networks. The algorithm allows one to find overlapping clusters in a network, an important aspect of real networks, especially social networks. The algorithm requires no input parameters and learns the number of clusters naturally from the network. It accomplishes this using leadership centrality in a clever manner. It identifies local minima of leadership centrality as followers which belong only to one cluster, and the remaining nodes are leaders which connect clusters. In this way, the number of clusters can be learned using only the network structure. The LDA is also an extremely fast algorithm, having runtime linear in the network size. Thus, this algorithm can be used to efficiently cluster extremely large networks.
NASA Astrophysics Data System (ADS)
Ebeling, H.; Edge, A. C.; Bohringer, H.; Allen, S. W.; Crawford, C. S.; Fabian, A. C.; Voges, W.; Huchra, J. P.
1998-12-01
We present a 90 per cent flux-complete sample of the 201 X-ray-brightest clusters of galaxies in the northern hemisphere (delta>=0 deg), at high Galactic latitudes (|b|>=20 deg), with measured redshifts z<=0.3 and fluxes higher than 4.4x10^-12 erg cm^-2 s^-1 in the 0.1-2.4 keV band. The sample, called the ROSAT Brightest Cluster Sample (BCS), is selected from ROSAT All-Sky Survey data and is the largest X-ray-selected cluster sample compiled to date. In addition to Abell clusters, which form the bulk of the sample, the BCS also contains the X-ray-brightest Zwicky clusters and other clusters selected from their X-ray properties alone. Effort has been made to ensure the highest possible completeness of the sample and the smallest possible contamination by non-cluster X-ray sources. X-ray fluxes are computed using an algorithm tailored for the detection and characterization of X-ray emission from galaxy clusters. These fluxes are accurate to better than 15 per cent (mean 1sigma error). We find the cumulative logN-logS distribution of clusters to follow a power law kappa S^alpha with alpha=1.31^+0.06_-0.03 (errors are the 10th and 90th percentiles) down to fluxes of 2x10^-12 erg cm^-2 s^-1, i.e. considerably below the BCS flux limit. Although our best-fitting slope disagrees formally with the canonical value of -1.5 for a Euclidean distribution, the BCS logN-logS distribution is consistent with a non-evolving cluster population if cosmological effects are taken into account. Our sample will allow us to examine large-scale structure in the northern hemisphere, determine the spatial cluster-cluster correlation function, investigate correlations between the X-ray and optical properties of the clusters, establish the X-ray luminosity function for galaxy clusters, and discuss the implications of the results for cluster evolution.
NASA Astrophysics Data System (ADS)
Majer, C. L.; Meyer, S.; Konrad, S.; Sarli, E.; Bartelmann, M.
2016-07-01
This paper continues a series in which we intend to show how all observables of galaxy clusters can be combined to recover the two-dimensional, projected gravitational potential of individual clusters. Our goal is to develop a non-parametric algorithm for joint cluster reconstruction taking all cluster observables into account. For this reason we focus on the line-of-sight projected gravitational potential, proportional to the lensing potential, in order to extend existing reconstruction algorithms. In this paper, we begin with the relation between the Compton-y parameter and the Newtonian gravitational potential, assuming hydrostatic equilibrium and a polytropic stratification of the intracluster gas. Extending our first publication we now consider a spheroidal rather than a spherical cluster symmetry. We show how a Richardson-Lucy deconvolution can be used to convert the intensity change of the CMB due to the thermal Sunyaev-Zel'dovich effect into an estimate for the two-dimensional gravitational potential. We apply our reconstruction method to a cluster based on an N-body/hydrodynamical simulation processed with the characteristics (resolution and noise) of the ALMA interferometer for which we achieve a relative error of ≲20 per cent for a large fraction of the virial radius. We further apply our method to an observation of the galaxy cluster RXJ1347 for which we can reconstruct the potential with a relative error of ≲20 per cent for the observable cluster range.
Mohanasundaram, Ranganathan; Periasamy, Pappampalayam Sanmugam
2015-01-01
The current high profile debate with regard to data storage and its growth have become strategic task in the world of networking. It mainly depends on the sensor nodes called producers, base stations, and also the consumers (users and sensor nodes) to retrieve and use the data. The main concern dealt here is to find an optimal data storage position in wireless sensor networks. The works that have been carried out earlier did not utilize swarm intelligence based optimization approaches to find the optimal data storage positions. To achieve this goal, an efficient swam intelligence approach is used to choose suitable positions for a storage node. Thus, hybrid particle swarm optimization algorithm has been used to find the suitable positions for storage nodes while the total energy cost of data transmission is minimized. Clustering-based distributed data storage is utilized to solve clustering problem using fuzzy-C-means algorithm. This research work also considers the data rates and locations of multiple producers and consumers to find optimal data storage positions. The algorithm is implemented in a network simulator and the experimental results show that the proposed clustering and swarm intelligence based ODS strategy is more effective than the earlier approaches.
Genetic algorithm enhanced by machine learning in dynamic aperture optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, Yongjun; Cheng, Weixing; Yu, Li Hua
With the aid of machine learning techniques, the genetic algorithm has been enhanced and applied to the multi-objective optimization problem presented by the dynamic aperture of the National Synchrotron Light Source II (NSLS-II) Storage Ring. During the evolution processes employed by the genetic algorithm, the population is classified into different clusters in the search space. The clusters with top average fitness are given “elite” status. Intervention on the population is implemented by repopulating some potentially competitive candidates based on the experience learned from the accumulated data. These candidates replace randomly selected candidates among the original data pool. The average fitnessmore » of the population is therefore improved while diversity is not lost. Maintaining diversity ensures that the optimization is global rather than local. The quality of the population increases and produces more competitive descendants accelerating the evolution process significantly. When identifying the distribution of optimal candidates, they appear to be located in isolated islands within the search space. Some of these optimal candidates have been experimentally confirmed at the NSLS-II storage ring. Furthermore, the machine learning techniques that exploit the genetic algorithm can also be used in other population-based optimization problems such as particle swarm algorithm.« less
Genetic algorithm enhanced by machine learning in dynamic aperture optimization
NASA Astrophysics Data System (ADS)
Li, Yongjun; Cheng, Weixing; Yu, Li Hua; Rainer, Robert
2018-05-01
With the aid of machine learning techniques, the genetic algorithm has been enhanced and applied to the multi-objective optimization problem presented by the dynamic aperture of the National Synchrotron Light Source II (NSLS-II) Storage Ring. During the evolution processes employed by the genetic algorithm, the population is classified into different clusters in the search space. The clusters with top average fitness are given "elite" status. Intervention on the population is implemented by repopulating some potentially competitive candidates based on the experience learned from the accumulated data. These candidates replace randomly selected candidates among the original data pool. The average fitness of the population is therefore improved while diversity is not lost. Maintaining diversity ensures that the optimization is global rather than local. The quality of the population increases and produces more competitive descendants accelerating the evolution process significantly. When identifying the distribution of optimal candidates, they appear to be located in isolated islands within the search space. Some of these optimal candidates have been experimentally confirmed at the NSLS-II storage ring. The machine learning techniques that exploit the genetic algorithm can also be used in other population-based optimization problems such as particle swarm algorithm.
Genetic algorithm enhanced by machine learning in dynamic aperture optimization
Li, Yongjun; Cheng, Weixing; Yu, Li Hua; ...
2018-05-29
With the aid of machine learning techniques, the genetic algorithm has been enhanced and applied to the multi-objective optimization problem presented by the dynamic aperture of the National Synchrotron Light Source II (NSLS-II) Storage Ring. During the evolution processes employed by the genetic algorithm, the population is classified into different clusters in the search space. The clusters with top average fitness are given “elite” status. Intervention on the population is implemented by repopulating some potentially competitive candidates based on the experience learned from the accumulated data. These candidates replace randomly selected candidates among the original data pool. The average fitnessmore » of the population is therefore improved while diversity is not lost. Maintaining diversity ensures that the optimization is global rather than local. The quality of the population increases and produces more competitive descendants accelerating the evolution process significantly. When identifying the distribution of optimal candidates, they appear to be located in isolated islands within the search space. Some of these optimal candidates have been experimentally confirmed at the NSLS-II storage ring. Furthermore, the machine learning techniques that exploit the genetic algorithm can also be used in other population-based optimization problems such as particle swarm algorithm.« less
Topology control algorithm for wireless sensor networks based on Link forwarding
NASA Astrophysics Data System (ADS)
Pucuo, Cairen; Qi, Ai-qin
2018-03-01
The research of topology control could effectively save energy and increase the service life of network based on wireless sensor. In this paper, a arithmetic called LTHC (link transmit hybrid clustering) based on link transmit is proposed. It decreases expenditure of energy by changing the way of cluster-node’s communication. The idea is to establish a link between cluster and SINK node when the cluster is formed, and link-node must be non-cluster. Through the link, cluster sends information to SINK nodes. For the sake of achieving the uniform distribution of energy on the network, prolongate the network survival time, and improve the purpose of communication, the communication will cut down much more expenditure of energy for cluster which away from SINK node. In the two aspects of improving the traffic and network survival time, we find that the LTCH is far superior to the traditional LEACH by experiments.
NASA Astrophysics Data System (ADS)
Ekin, Ahmet; Jasinschi, Radu; van der Grond, Jeroen; van Buchem, Mark A.; van Muiswinkel, Arianne
2006-03-01
This paper introduces image processing methods to automatically detect the 3D volume-of-interest (VOI) and 2D region-of-interest (ROI) for deep gray matter organs (thalamus, globus pallidus, putamen, and caudate nucleus) of patients with suspected iron deposition from MR dual echo images. Prior to the VOI and ROI detection, cerebrospinal fluid (CSF) region is segmented by a clustering algorithm. For the segmentation, we automatically determine the cluster centers with the mean shift algorithm that can quickly identify the modes of a distribution. After the identification of the modes, we employ the K-Harmonic means clustering algorithm to segment the volumetric MR data into CSF and non-CSF. Having the CSF mask and observing that the frontal lobe of the lateral ventricle has more consistent shape accross age and pathological abnormalities, we propose a shape-directed landmark detection algorithm to detect the VOI in a speedy manner. The proposed landmark detection algorithm utilizes a novel shape model of the front lobe of the lateral ventricle for the slices where thalamus, globus pallidus, putamen, and caudate nucleus are expected to appear. After this step, for each slice in the VOI, we use horizontal and vertical projections of the CSF map to detect the approximate locations of the relevant organs to define the ROI. We demonstrate the robustness of the proposed VOI and ROI localization algorithms to pathologies, including severe amounts of iron accumulation as well as white matter lesions, and anatomical variations. The proposed algorithms achieved very high detection accuracy, 100% in the VOI detection , over a large set of a challenging MR dataset.
Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm
Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong
2016-01-01
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis. PMID:27959895
Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm.
Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong
2016-01-01
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis.
FTUC: A Flooding Tree Uneven Clustering Protocol for a Wireless Sensor Network.
He, Wei; Pillement, Sebastien; Xu, Du
2017-11-23
Clustering is an efficient approach in a wireless sensor network (WSN) to reduce the energy consumption of nodes and to extend the lifetime of the network. Unfortunately, this approach requires that all cluster heads (CHs) transmit their data to the base station (BS), which gives rise to the long distance communications problem, and in multi-hop routing, the CHs near the BS have to forward data from other nodes that lead those CHs to die prematurely, creating the hot zones problem. Unequal clustering has been proposed to solve these problems. Most of the current algorithms elect CH only by considering their competition radius, leading to unevenly distributed cluster heads. Furthermore, global distances values are needed when calculating the competition radius, which is a tedious task in large networks. To face these problems, we propose a flooding tree uneven clustering protocol (FTUC) suited for large networks. Based on the construction of a tree type sub-network to calculate the minimum and maximum distances values of the network, we then apply the unequal cluster theory. We also introduce referenced position circles to evenly elect cluster heads. Therefore, cluster heads are elected depending on the node's residual energy and their distance to a referenced circle. FTUC builds the best inter-cluster communications route by evaluating a cluster head cost function to find the best next hop to the BS. The simulation results show that the FTUC algorithm decreases the energy consumption of the nodes and balances the global energy consumption effectively, thus extending the lifetime of the network.
A Distributed Compressive Sensing Scheme for Event Capture in Wireless Visual Sensor Networks
NASA Astrophysics Data System (ADS)
Hou, Meng; Xu, Sen; Wu, Weiling; Lin, Fei
2018-01-01
Image signals which acquired by wireless visual sensor network can be used for specific event capture. This event capture is realized by image processing at the sink node. A distributed compressive sensing scheme is used for the transmission of these image signals from the camera nodes to the sink node. A measurement and joint reconstruction algorithm for these image signals are proposed in this paper. Make advantage of spatial correlation between images within a sensing area, the cluster head node which as the image decoder can accurately co-reconstruct these image signals. The subjective visual quality and the reconstruction error rate are used for the evaluation of reconstructed image quality. Simulation results show that the joint reconstruction algorithm achieves higher image quality at the same image compressive rate than the independent reconstruction algorithm.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Berlind, Andreas A.; Frieman, Joshua A.; Weinberg, David H.
2006-01-01
We identify galaxy groups and clusters in volume-limited samples of the SDSS redshift survey, using a redshift-space friends-of-friends algorithm. We optimize the friends-of-friends linking lengths to recover galaxy systems that occupy the same dark matter halos, using a set of mock catalogs created by populating halos of N-body simulations with galaxies. Extensive tests with these mock catalogs show that no combination of perpendicular and line-of-sight linking lengths is able to yield groups and clusters that simultaneously recover the true halo multiplicity function, projected size distribution, and velocity dispersion. We adopt a linking length combination that yields, for galaxy groups withmore » ten or more members: a group multiplicity function that is unbiased with respect to the true halo multiplicity function; an unbiased median relation between the multiplicities of groups and their associated halos; a spurious group fraction of less than {approx}1%; a halo completeness of more than {approx}97%; the correct projected size distribution as a function of multiplicity; and a velocity dispersion distribution that is {approx}20% too low at all multiplicities. These results hold over a range of mock catalogs that use different input recipes of populating halos with galaxies. We apply our group-finding algorithm to the SDSS data and obtain three group and cluster catalogs for three volume-limited samples that cover 3495.1 square degrees on the sky. We correct for incompleteness caused by fiber collisions and survey edges, and obtain measurements of the group multiplicity function, with errors calculated from realistic mock catalogs. These multiplicity function measurements provide a key constraint on the relation between galaxy populations and dark matter halos.« less
GDPC: Gravitation-based Density Peaks Clustering algorithm
NASA Astrophysics Data System (ADS)
Jiang, Jianhua; Hao, Dehao; Chen, Yujun; Parmar, Milan; Li, Keqin
2018-07-01
The Density Peaks Clustering algorithm, which we refer to as DPC, is a novel and efficient density-based clustering approach, and it is published in Science in 2014. The DPC has advantages of discovering clusters with varying sizes and varying densities, but has some limitations of detecting the number of clusters and identifying anomalies. We develop an enhanced algorithm with an alternative decision graph based on gravitation theory and nearby distance to identify centroids and anomalies accurately. We apply our method to some UCI and synthetic data sets. We report comparative clustering performances using F-Measure and 2-dimensional vision. We also compare our method to other clustering algorithms, such as K-Means, Affinity Propagation (AP) and DPC. We present F-Measure scores and clustering accuracies of our GDPC algorithm compared to K-Means, AP and DPC on different data sets. We show that the GDPC has the superior performance in its capability of: (1) detecting the number of clusters obviously; (2) aggregating clusters with varying sizes, varying densities efficiently; (3) identifying anomalies accurately.
Investigating the role of small vent volcanism during the development of Tharsis Province, Mars
NASA Astrophysics Data System (ADS)
Richardson, J. A.; Bleacher, J. E.; Connor, C.; Connor, L.; Glaze, L. S.
2014-12-01
Clusters of tens to hundreds of small volcanic vents have recently been recognized as a major component of Tharsis Province volcanism. These volcanic fields are formed from distributed-style, possibly monogenetic, volcanism and are composed of low sloped edifices with diameters of tens of kilometers and heights of tens to hundreds of meters. We report a new catalog of these small volcanic vents, now available through the USGS Astrogeology Science Center. This catalog was created with the use of gridded topographic data from the Mars Orbiter Laser Altimeter (MOLA) and images from the Thermal Emission Imaging System (THEMIS) and the High Resolution Stereo Camera (HRSC). We are now investigating isolated clusters of distributed volcanism in Tharsis with this dataset. We hypothesize that these clusters are formed from significant magmatic events that played a large role in the development of Tharsis. Currently, the catalog contains 1075 unique volcanic vents in the Tharsis Province. With the catalog, potentially isolated volcano clusters are identified with vent density estimation. Vent intensity for clusters is found to be 1 vent per 1000 sq km or less. Crater retention rates for one such cluster, Syria Planum, indicates that these distributed volcanic systems might continue as long as 700 Ma, or that monogenetic volcanic systems overprint older systems. Using a modified basal outlining algorithm with MOLA gridded data, shield volumes are found to be between 1-20 cubic km. Current results show distributed-style volcanism occuring in Tharsis orders of magnitude more dispersed than analogous volcano clusers on Earth, while individual edifices are found to be an order of magnitude larger than volcanoes in Earth clusters. Proof of concept results are reported for three identified clusters: Arsia Mons Caldera, Syria Planum, and Southern Pavonis Mons.
Bacciu, Davide; Starita, Antonina
2008-11-01
Determining a compact neural coding for a set of input stimuli is an issue that encompasses several biological memory mechanisms as well as various artificial neural network models. In particular, establishing the optimal network structure is still an open problem when dealing with unsupervised learning models. In this paper, we introduce a novel learning algorithm, named competitive repetition-suppression (CoRe) learning, inspired by a cortical memory mechanism called repetition suppression (RS). We show how such a mechanism is used, at various levels of the cerebral cortex, to generate compact neural representations of the visual stimuli. From the general CoRe learning model, we derive a clustering algorithm, named CoRe clustering, that can automatically estimate the unknown cluster number from the data without using a priori information concerning the input distribution. We illustrate how CoRe clustering, besides its biological plausibility, posses strong theoretical properties in terms of robustness to noise and outliers, and we provide an error function describing CoRe learning dynamics. Such a description is used to analyze CoRe relationships with the state-of-the art clustering models and to highlight CoRe similitude with rival penalized competitive learning (RPCL), showing how CoRe extends such a model by strengthening the rival penalization estimation by means of loss functions from robust statistics.
Mining the National Career Assessment Examination Result Using Clustering Algorithm
NASA Astrophysics Data System (ADS)
Pagudpud, M. V.; Palaoag, T. T.; Padirayon, L. M.
2018-03-01
Education is an essential process today which elicits authorities to discover and establish innovative strategies for educational improvement. This study applied data mining using clustering technique for knowledge extraction from the National Career Assessment Examination (NCAE) result in the Division of Quirino. The NCAE is an examination given to all grade 9 students in the Philippines to assess their aptitudes in the different domains. Clustering the students is helpful in identifying students’ learning considerations. With the use of the RapidMiner tool, clustering algorithms such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), k-means, k-medoid, expectation maximization clustering, and support vector clustering algorithms were analyzed. The silhouette indexes of the said clustering algorithms were compared, and the result showed that the k-means algorithm with k = 3 and silhouette index equal to 0.196 is the most appropriate clustering algorithm to group the students. Three groups were formed having 477 students in the determined group (cluster 0), 310 proficient students (cluster 1) and 396 developing students (cluster 2). The data mining technique used in this study is essential in extracting useful information from the NCAE result to better understand the abilities of students which in turn is a good basis for adopting teaching strategies.
Automatic Clustering Using Multi-objective Particle Swarm and Simulated Annealing
Abubaker, Ahmad; Baharum, Adam; Alrefaei, Mahmoud
2015-01-01
This paper puts forward a new automatic clustering algorithm based on Multi-Objective Particle Swarm Optimization and Simulated Annealing, “MOPSOSA”. The proposed algorithm is capable of automatic clustering which is appropriate for partitioning datasets to a suitable number of clusters. MOPSOSA combines the features of the multi-objective based particle swarm optimization (PSO) and the Multi-Objective Simulated Annealing (MOSA). Three cluster validity indices were optimized simultaneously to establish the suitable number of clusters and the appropriate clustering for a dataset. The first cluster validity index is centred on Euclidean distance, the second on the point symmetry distance, and the last cluster validity index is based on short distance. A number of algorithms have been compared with the MOPSOSA algorithm in resolving clustering problems by determining the actual number of clusters and optimal clustering. Computational experiments were carried out to study fourteen artificial and five real life datasets. PMID:26132309
A new approach for the assessment of temporal clustering of extratropical wind storms
NASA Astrophysics Data System (ADS)
Schuster, Mareike; Eddounia, Fadoua; Kuhnel, Ivan; Ulbrich, Uwe
2017-04-01
A widely-used methodology to assess the clustering of storms in a region is based on dispersion statistics of a simple homogeneous Poisson process. This clustering measure is determined by the ratio of the variance and the mean of the local storm statistics per grid point. Resulting values larger than 1, i.e. when the variance is larger than the mean, indicate clustering; while values lower than 1 indicate a sequencing of storms that is more regular than a random process. However, a disadvantage of this methodology is that the characteristics are valid for a pre-defined climatological time period, and it is not possible to identify a temporal variability of clustering. Also, the absolute value of the dispersion statistics is not particularly intuitive. We have developed an approach to describe temporal clustering of storms which offers a more intuitive comprehension, and at the same time allows to assess temporal variations. The approach is based on the local distribution of waiting times between the occurrence of two individual storm events, the former being computed through the post-processing of individual windstorm tracks which in turn are obtained by an objective tracking algorithm. Based on this distribution a threshold can be set, either by the waiting time expected from a random process or by a quantile of the observed distribution. Thus, it can be determined if two consecutive wind storm events count as part of a (temporal) cluster. We analyze extratropical wind storms in a reanalysis dataset and compare the results of the traditional clustering measure with our new methodology. We assess what range of clustering events (in terms of duration and frequency) is covered and identify if the historically known clustered seasons are detectable by the new clustering measure in the reanalysis.
NASA Astrophysics Data System (ADS)
Gong, Lina; Xu, Tao; Zhang, Wei; Li, Xuhong; Wang, Xia; Pan, Wenwen
2017-03-01
The traditional microblog recommendation algorithm has the problems of low efficiency and modest effect in the era of big data. In the aim of solving these issues, this paper proposed a mixed recommendation algorithm with user clustering. This paper first introduced the situation of microblog marketing industry. Then, this paper elaborates the user interest modeling process and detailed advertisement recommendation methods. Finally, this paper compared the mixed recommendation algorithm with the traditional classification algorithm and mixed recommendation algorithm without user clustering. The results show that the mixed recommendation algorithm with user clustering has good accuracy and recall rate in the microblog advertisements promotion.
Procedure of Partitioning Data Into Number of Data Sets or Data Group - A Review
NASA Astrophysics Data System (ADS)
Kim, Tai-Hoon
The goal of clustering is to decompose a dataset into similar groups based on a objective function. Some already well established clustering algorithms are there for data clustering. Objective of these data clustering algorithms are to divide the data points of the feature space into a number of groups (or classes) so that a predefined set of criteria are satisfied. The article considers the comparative study about the effectiveness and efficiency of traditional data clustering algorithms. For evaluating the performance of the clustering algorithms, Minkowski score is used here for different data sets.
Android Malware Classification Using K-Means Clustering Algorithm
NASA Astrophysics Data System (ADS)
Hamid, Isredza Rahmi A.; Syafiqah Khalid, Nur; Azma Abdullah, Nurul; Rahman, Nurul Hidayah Ab; Chai Wen, Chuah
2017-08-01
Malware was designed to gain access or damage a computer system without user notice. Besides, attacker exploits malware to commit crime or fraud. This paper proposed Android malware classification approach based on K-Means clustering algorithm. We evaluate the proposed model in terms of accuracy using machine learning algorithms. Two datasets were selected to demonstrate the practicing of K-Means clustering algorithms that are Virus Total and Malgenome dataset. We classify the Android malware into three clusters which are ransomware, scareware and goodware. Nine features were considered for each types of dataset such as Lock Detected, Text Detected, Text Score, Encryption Detected, Threat, Porn, Law, Copyright and Moneypak. We used IBM SPSS Statistic software for data classification and WEKA tools to evaluate the built cluster. The proposed K-Means clustering algorithm shows promising result with high accuracy when tested using Random Forest algorithm.
Distributed Unmixing of Hyperspectral Datawith Sparsity Constraint
NASA Astrophysics Data System (ADS)
Khoshsokhan, S.; Rajabi, R.; Zayyani, H.
2017-09-01
Spectral unmixing (SU) is a data processing problem in hyperspectral remote sensing. The significant challenge in the SU problem is how to identify endmembers and their weights, accurately. For estimation of signature and fractional abundance matrices in a blind problem, nonnegative matrix factorization (NMF) and its developments are used widely in the SU problem. One of the constraints which was added to NMF is sparsity constraint that was regularized by L1/2 norm. In this paper, a new algorithm based on distributed optimization has been used for spectral unmixing. In the proposed algorithm, a network including single-node clusters has been employed. Each pixel in hyperspectral images considered as a node in this network. The distributed unmixing with sparsity constraint has been optimized with diffusion LMS strategy, and then the update equations for fractional abundance and signature matrices are obtained. Simulation results based on defined performance metrics, illustrate advantage of the proposed algorithm in spectral unmixing of hyperspectral data compared with other methods. The results show that the AAD and SAD of the proposed approach are improved respectively about 6 and 27 percent toward distributed unmixing in SNR=25dB.
Computation of neutron fluxes in clusters of fuel pins arranged in hexagonal assemblies (2D and 3D)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prabha, H.; Marleau, G.
2012-07-01
For computations of fluxes, we have used Carvik's method of collision probabilities. This method requires tracking algorithms. An algorithm to compute tracks (in 2D and 3D) has been developed for seven hexagonal geometries with cluster of fuel pins. This has been implemented in the NXT module of the code DRAGON. The flux distribution in cluster of pins has been computed by using this code. For testing the results, they are compared when possible with the EXCELT module of the code DRAGON. Tracks are plotted in the NXT module by using MATLAB, these plots are also presented here. Results are presentedmore » with increasing number of lines to show the convergence of these results. We have numerically computed volumes, surface areas and the percentage errors in these computations. These results show that 2D results converge faster than 3D results. The accuracy on the computation of fluxes up to second decimal is achieved with fewer lines. (authors)« less
Removal of impulse noise clusters from color images with local order statistics
NASA Astrophysics Data System (ADS)
Ruchay, Alexey; Kober, Vitaly
2017-09-01
This paper proposes a novel algorithm for restoring images corrupted with clusters of impulse noise. The noise clusters often occur when the probability of impulse noise is very high. The proposed noise removal algorithm consists of detection of bulky impulse noise in three color channels with local order statistics followed by removal of the detected clusters by means of vector median filtering. With the help of computer simulation we show that the proposed algorithm is able to effectively remove clustered impulse noise. The performance of the proposed algorithm is compared in terms of image restoration metrics with that of common successful algorithms.
Study of parameters of the nearest neighbour shared algorithm on clustering documents
NASA Astrophysics Data System (ADS)
Mustika Rukmi, Alvida; Budi Utomo, Daryono; Imro’atus Sholikhah, Neni
2018-03-01
Document clustering is one way of automatically managing documents, extracting of document topics and fastly filtering information. Preprocess of clustering documents processed by textmining consists of: keyword extraction using Rapid Automatic Keyphrase Extraction (RAKE) and making the document as concept vector using Latent Semantic Analysis (LSA). Furthermore, the clustering process is done so that the documents with the similarity of the topic are in the same cluster, based on the preprocesing by textmining performed. Shared Nearest Neighbour (SNN) algorithm is a clustering method based on the number of "nearest neighbors" shared. The parameters in the SNN Algorithm consist of: k nearest neighbor documents, ɛ shared nearest neighbor documents and MinT minimum number of similar documents, which can form a cluster. Characteristics The SNN algorithm is based on shared ‘neighbor’ properties. Each cluster is formed by keywords that are shared by the documents. SNN algorithm allows a cluster can be built more than one keyword, if the value of the frequency of appearing keywords in document is also high. Determination of parameter values on SNN algorithm affects document clustering results. The higher parameter value k, will increase the number of neighbor documents from each document, cause similarity of neighboring documents are lower. The accuracy of each cluster is also low. The higher parameter value ε, caused each document catch only neighbor documents that have a high similarity to build a cluster. It also causes more unclassified documents (noise). The higher the MinT parameter value cause the number of clusters will decrease, since the number of similar documents can not form clusters if less than MinT. Parameter in the SNN Algorithm determine performance of clustering result and the amount of noise (unclustered documents ). The Silhouette coeffisient shows almost the same result in many experiments, above 0.9, which means that SNN algorithm works well with different parameter values.
Algorithms of maximum likelihood data clustering with applications
NASA Astrophysics Data System (ADS)
Giada, Lorenzo; Marsili, Matteo
2002-12-01
We address the problem of data clustering by introducing an unsupervised, parameter-free approach based on maximum likelihood principle. Starting from the observation that data sets belonging to the same cluster share a common information, we construct an expression for the likelihood of any possible cluster structure. The likelihood in turn depends only on the Pearson's coefficient of the data. We discuss clustering algorithms that provide a fast and reliable approximation to maximum likelihood configurations. Compared to standard clustering methods, our approach has the advantages that (i) it is parameter free, (ii) the number of clusters need not be fixed in advance and (iii) the interpretation of the results is transparent. In order to test our approach and compare it with standard clustering algorithms, we analyze two very different data sets: time series of financial market returns and gene expression data. We find that different maximization algorithms produce similar cluster structures whereas the outcome of standard algorithms has a much wider variability.
A new clustering algorithm applicable to multispectral and polarimetric SAR images
NASA Technical Reports Server (NTRS)
Wong, Yiu-Fai; Posner, Edward C.
1993-01-01
We describe an application of a scale-space clustering algorithm to the classification of a multispectral and polarimetric SAR image of an agricultural site. After the initial polarimetric and radiometric calibration and noise cancellation, we extracted a 12-dimensional feature vector for each pixel from the scattering matrix. The clustering algorithm was able to partition a set of unlabeled feature vectors from 13 selected sites, each site corresponding to a distinct crop, into 13 clusters without any supervision. The cluster parameters were then used to classify the whole image. The classification map is much less noisy and more accurate than those obtained by hierarchical rules. Starting with every point as a cluster, the algorithm works by melting the system to produce a tree of clusters in the scale space. It can cluster data in any multidimensional space and is insensitive to variability in cluster densities, sizes and ellipsoidal shapes. This algorithm, more powerful than existing ones, may be useful for remote sensing for land use.
Design and implementation of a hybrid MPI-CUDA model for the Smith-Waterman algorithm.
Khaled, Heba; Faheem, Hossam El Deen Mostafa; El Gohary, Rania
2015-01-01
This paper provides a novel hybrid model for solving the multiple pair-wise sequence alignment problem combining message passing interface and CUDA, the parallel computing platform and programming model invented by NVIDIA. The proposed model targets homogeneous cluster nodes equipped with similar Graphical Processing Unit (GPU) cards. The model consists of the Master Node Dispatcher (MND) and the Worker GPU Nodes (WGN). The MND distributes the workload among the cluster working nodes and then aggregates the results. The WGN performs the multiple pair-wise sequence alignments using the Smith-Waterman algorithm. We also propose a modified implementation to the Smith-Waterman algorithm based on computing the alignment matrices row-wise. The experimental results demonstrate a considerable reduction in the running time by increasing the number of the working GPU nodes. The proposed model achieved a performance of about 12 Giga cell updates per second when we tested against the SWISS-PROT protein knowledge base running on four nodes.
Generalized fuzzy C-means clustering algorithm with improved fuzzy partitions.
Zhu, Lin; Chung, Fu-Lai; Wang, Shitong
2009-06-01
The fuzziness index m has important influence on the clustering result of fuzzy clustering algorithms, and it should not be forced to fix at the usual value m = 2. In view of its distinctive features in applications and its limitation in having m = 2 only, a recent advance of fuzzy clustering called fuzzy c-means clustering with improved fuzzy partitions (IFP-FCM) is extended in this paper, and a generalized algorithm called GIFP-FCM for more effective clustering is proposed. By introducing a novel membership constraint function, a new objective function is constructed, and furthermore, GIFP-FCM clustering is derived. Meanwhile, from the viewpoints of L(p) norm distance measure and competitive learning, the robustness and convergence of the proposed algorithm are analyzed. Furthermore, the classical fuzzy c-means algorithm (FCM) and IFP-FCM can be taken as two special cases of the proposed algorithm. Several experimental results including its application to noisy image texture segmentation are presented to demonstrate its average advantage over FCM and IFP-FCM in both clustering and robustness capabilities.
Incremental fuzzy C medoids clustering of time series data using dynamic time warping distance
Chen, Jingli; Wu, Shuai; Liu, Zhizhong; Chao, Hao
2018-01-01
Clustering time series data is of great significance since it could extract meaningful statistics and other characteristics. Especially in biomedical engineering, outstanding clustering algorithms for time series may help improve the health level of people. Considering data scale and time shifts of time series, in this paper, we introduce two incremental fuzzy clustering algorithms based on a Dynamic Time Warping (DTW) distance. For recruiting Single-Pass and Online patterns, our algorithms could handle large-scale time series data by splitting it into a set of chunks which are processed sequentially. Besides, our algorithms select DTW to measure distance of pair-wise time series and encourage higher clustering accuracy because DTW could determine an optimal match between any two time series by stretching or compressing segments of temporal data. Our new algorithms are compared to some existing prominent incremental fuzzy clustering algorithms on 12 benchmark time series datasets. The experimental results show that the proposed approaches could yield high quality clusters and were better than all the competitors in terms of clustering accuracy. PMID:29795600
Incremental fuzzy C medoids clustering of time series data using dynamic time warping distance.
Liu, Yongli; Chen, Jingli; Wu, Shuai; Liu, Zhizhong; Chao, Hao
2018-01-01
Clustering time series data is of great significance since it could extract meaningful statistics and other characteristics. Especially in biomedical engineering, outstanding clustering algorithms for time series may help improve the health level of people. Considering data scale and time shifts of time series, in this paper, we introduce two incremental fuzzy clustering algorithms based on a Dynamic Time Warping (DTW) distance. For recruiting Single-Pass and Online patterns, our algorithms could handle large-scale time series data by splitting it into a set of chunks which are processed sequentially. Besides, our algorithms select DTW to measure distance of pair-wise time series and encourage higher clustering accuracy because DTW could determine an optimal match between any two time series by stretching or compressing segments of temporal data. Our new algorithms are compared to some existing prominent incremental fuzzy clustering algorithms on 12 benchmark time series datasets. The experimental results show that the proposed approaches could yield high quality clusters and were better than all the competitors in terms of clustering accuracy.
Multi-Optimisation Consensus Clustering
NASA Astrophysics Data System (ADS)
Li, Jian; Swift, Stephen; Liu, Xiaohui
Ensemble Clustering has been developed to provide an alternative way of obtaining more stable and accurate clustering results. It aims to avoid the biases of individual clustering algorithms. However, it is still a challenge to develop an efficient and robust method for Ensemble Clustering. Based on an existing ensemble clustering method, Consensus Clustering (CC), this paper introduces an advanced Consensus Clustering algorithm called Multi-Optimisation Consensus Clustering (MOCC), which utilises an optimised Agreement Separation criterion and a Multi-Optimisation framework to improve the performance of CC. Fifteen different data sets are used for evaluating the performance of MOCC. The results reveal that MOCC can generate more accurate clustering results than the original CC algorithm.
Semiautomatic mapping of permafrost in the Yukon Flats, Alaska
NASA Astrophysics Data System (ADS)
Gulbrandsen, Mats Lundh; Minsley, Burke J.; Ball, Lyndsay B.; Hansen, Thomas Mejer
2016-12-01
Thawing of permafrost due to global warming can have major impacts on hydrogeological processes, climate feedback, arctic ecology, and local environments. To understand these effects and processes, it is crucial to know the distribution of permafrost. In this study we exploit the fact that airborne electromagnetic (AEM) data are sensitive to the distribution of permafrost and demonstrate how the distribution of permafrost in the Yukon Flats, Alaska, is mapped in an efficient (semiautomatic) way, using a combination of supervised and unsupervised (machine) learning algorithms, i.e., Smart Interpretation and K-means clustering. Clustering is used to sort unfrozen and frozen regions, and Smart Interpretation is used to predict the depth of permafrost based on expert interpretations. This workflow allows, for the first time, a quantitative and objective approach to efficiently map permafrost based on large amounts of AEM data.
Semiautomatic mapping of permafrost in the Yukon Flats, Alaska
Gulbrandsen, Mats Lundh; Minsley, Burke J.; Ball, Lyndsay B.; Hansen, Thomas Mejer
2016-01-01
Thawing of permafrost due to global warming can have major impacts on hydrogeological processes, climate feedback, arctic ecology, and local environments. To understand these effects and processes, it is crucial to know the distribution of permafrost. In this study we exploit the fact that airborne electromagnetic (AEM) data are sensitive to the distribution of permafrost and demonstrate how the distribution of permafrost in the Yukon Flats, Alaska, is mapped in an efficient (semiautomatic) way, using a combination of supervised and unsupervised (machine) learning algorithms, i.e., Smart Interpretation and K-means clustering. Clustering is used to sort unfrozen and frozen regions, and Smart Interpretation is used to predict the depth of permafrost based on expert interpretations. This workflow allows, for the first time, a quantitative and objective approach to efficiently map permafrost based on large amounts of AEM data.
Novel density-based and hierarchical density-based clustering algorithms for uncertain data.
Zhang, Xianchao; Liu, Han; Zhang, Xiaotong
2017-09-01
Uncertain data has posed a great challenge to traditional clustering algorithms. Recently, several algorithms have been proposed for clustering uncertain data, and among them density-based techniques seem promising for handling data uncertainty. However, some issues like losing uncertain information, high time complexity and nonadaptive threshold have not been addressed well in the previous density-based algorithm FDBSCAN and hierarchical density-based algorithm FOPTICS. In this paper, we firstly propose a novel density-based algorithm PDBSCAN, which improves the previous FDBSCAN from the following aspects: (1) it employs a more accurate method to compute the probability that the distance between two uncertain objects is less than or equal to a boundary value, instead of the sampling-based method in FDBSCAN; (2) it introduces new definitions of probability neighborhood, support degree, core object probability, direct reachability probability, thus reducing the complexity and solving the issue of nonadaptive threshold (for core object judgement) in FDBSCAN. Then, we modify the algorithm PDBSCAN to an improved version (PDBSCANi), by using a better cluster assignment strategy to ensure that every object will be assigned to the most appropriate cluster, thus solving the issue of nonadaptive threshold (for direct density reachability judgement) in FDBSCAN. Furthermore, as PDBSCAN and PDBSCANi have difficulties for clustering uncertain data with non-uniform cluster density, we propose a novel hierarchical density-based algorithm POPTICS by extending the definitions of PDBSCAN, adding new definitions of fuzzy core distance and fuzzy reachability distance, and employing a new clustering framework. POPTICS can reveal the cluster structures of the datasets with different local densities in different regions better than PDBSCAN and PDBSCANi, and it addresses the issues in FOPTICS. Experimental results demonstrate the superiority of our proposed algorithms over the existing algorithms in accuracy and efficiency. Copyright © 2017 Elsevier Ltd. All rights reserved.
Heterogeneous Tensor Decomposition for Clustering via Manifold Optimization.
Sun, Yanfeng; Gao, Junbin; Hong, Xia; Mishra, Bamdev; Yin, Baocai
2016-03-01
Tensor clustering is an important tool that exploits intrinsically rich structures in real-world multiarray or Tensor datasets. Often in dealing with those datasets, standard practice is to use subspace clustering that is based on vectorizing multiarray data. However, vectorization of tensorial data does not exploit complete structure information. In this paper, we propose a subspace clustering algorithm without adopting any vectorization process. Our approach is based on a novel heterogeneous Tucker decomposition model taking into account cluster membership information. We propose a new clustering algorithm that alternates between different modes of the proposed heterogeneous tensor model. All but the last mode have closed-form updates. Updating the last mode reduces to optimizing over the multinomial manifold for which we investigate second order Riemannian geometry and propose a trust-region algorithm. Numerical experiments show that our proposed algorithm compete effectively with state-of-the-art clustering algorithms that are based on tensor factorization.
Lo, Kenneth
2011-01-01
Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components. PMID:22125375
Lo, Kenneth; Gottardo, Raphael
2012-01-01
Cluster analysis is the automated search for groups of homogeneous observations in a data set. A popular modeling approach for clustering is based on finite normal mixture models, which assume that each cluster is modeled as a multivariate normal distribution. However, the normality assumption that each component is symmetric is often unrealistic. Furthermore, normal mixture models are not robust against outliers; they often require extra components for modeling outliers and/or give a poor representation of the data. To address these issues, we propose a new class of distributions, multivariate t distributions with the Box-Cox transformation, for mixture modeling. This class of distributions generalizes the normal distribution with the more heavy-tailed t distribution, and introduces skewness via the Box-Cox transformation. As a result, this provides a unified framework to simultaneously handle outlier identification and data transformation, two interrelated issues. We describe an Expectation-Maximization algorithm for parameter estimation along with transformation selection. We demonstrate the proposed methodology with three real data sets and simulation studies. Compared with a wealth of approaches including the skew-t mixture model, the proposed t mixture model with the Box-Cox transformation performs favorably in terms of accuracy in the assignment of observations, robustness against model misspecification, and selection of the number of components.
Network Analysis Tools: from biological networks to clusters and pathways.
Brohée, Sylvain; Faust, Karoline; Lima-Mendez, Gipsi; Vanderstocken, Gilles; van Helden, Jacques
2008-01-01
Network Analysis Tools (NeAT) is a suite of computer tools that integrate various algorithms for the analysis of biological networks: comparison between graphs, between clusters, or between graphs and clusters; network randomization; analysis of degree distribution; network-based clustering and path finding. The tools are interconnected to enable a stepwise analysis of the network through a complete analytical workflow. In this protocol, we present a typical case of utilization, where the tasks above are combined to decipher a protein-protein interaction network retrieved from the STRING database. The results returned by NeAT are typically subnetworks, networks enriched with additional information (i.e., clusters or paths) or tables displaying statistics. Typical networks comprising several thousands of nodes and arcs can be analyzed within a few minutes. The complete protocol can be read and executed in approximately 1 h.
Anandakrishnan, Ramu; Onufriev, Alexey
2008-03-01
In statistical mechanics, the equilibrium properties of a physical system of particles can be calculated as the statistical average over accessible microstates of the system. In general, these calculations are computationally intractable since they involve summations over an exponentially large number of microstates. Clustering algorithms are one of the methods used to numerically approximate these sums. The most basic clustering algorithms first sub-divide the system into a set of smaller subsets (clusters). Then, interactions between particles within each cluster are treated exactly, while all interactions between different clusters are ignored. These smaller clusters have far fewer microstates, making the summation over these microstates, tractable. These algorithms have been previously used for biomolecular computations, but remain relatively unexplored in this context. Presented here, is a theoretical analysis of the error and computational complexity for the two most basic clustering algorithms that were previously applied in the context of biomolecular electrostatics. We derive a tight, computationally inexpensive, error bound for the equilibrium state of a particle computed via these clustering algorithms. For some practical applications, it is the root mean square error, which can be significantly lower than the error bound, that may be more important. We how that there is a strong empirical relationship between error bound and root mean square error, suggesting that the error bound could be used as a computationally inexpensive metric for predicting the accuracy of clustering algorithms for practical applications. An example of error analysis for such an application-computation of average charge of ionizable amino-acids in proteins-is given, demonstrating that the clustering algorithm can be accurate enough for practical purposes.
NASA Astrophysics Data System (ADS)
Abdul-Nasir, Aimi Salihah; Mashor, Mohd Yusoff; Halim, Nurul Hazwani Abd; Mohamed, Zeehaida
2015-05-01
Malaria is a life-threatening parasitic infectious disease that corresponds for nearly one million deaths each year. Due to the requirement of prompt and accurate diagnosis of malaria, the current study has proposed an unsupervised pixel segmentation based on clustering algorithm in order to obtain the fully segmented red blood cells (RBCs) infected with malaria parasites based on the thin blood smear images of P. vivax species. In order to obtain the segmented infected cell, the malaria images are first enhanced by using modified global contrast stretching technique. Then, an unsupervised segmentation technique based on clustering algorithm has been applied on the intensity component of malaria image in order to segment the infected cell from its blood cells background. In this study, cascaded moving k-means (MKM) and fuzzy c-means (FCM) clustering algorithms has been proposed for malaria slide image segmentation. After that, median filter algorithm has been applied to smooth the image as well as to remove any unwanted regions such as small background pixels from the image. Finally, seeded region growing area extraction algorithm has been applied in order to remove large unwanted regions that are still appeared on the image due to their size in which cannot be cleaned by using median filter. The effectiveness of the proposed cascaded MKM and FCM clustering algorithms has been analyzed qualitatively and quantitatively by comparing the proposed cascaded clustering algorithm with MKM and FCM clustering algorithms. Overall, the results indicate that segmentation using the proposed cascaded clustering algorithm has produced the best segmentation performances by achieving acceptable sensitivity as well as high specificity and accuracy values compared to the segmentation results provided by MKM and FCM algorithms.
Fong, Simon; Deb, Suash; Yang, Xin-She; Zhuang, Yan
2014-01-01
Traditional K-means clustering algorithms have the drawback of getting stuck at local optima that depend on the random values of initial centroids. Optimization algorithms have their advantages in guiding iterative computation to search for global optima while avoiding local optima. The algorithms help speed up the clustering process by converging into a global optimum early with multiple search agents in action. Inspired by nature, some contemporary optimization algorithms which include Ant, Bat, Cuckoo, Firefly, and Wolf search algorithms mimic the swarming behavior allowing them to cooperatively steer towards an optimal objective within a reasonable time. It is known that these so-called nature-inspired optimization algorithms have their own characteristics as well as pros and cons in different applications. When these algorithms are combined with K-means clustering mechanism for the sake of enhancing its clustering quality by avoiding local optima and finding global optima, the new hybrids are anticipated to produce unprecedented performance. In this paper, we report the results of our evaluation experiments on the integration of nature-inspired optimization methods into K-means algorithms. In addition to the standard evaluation metrics in evaluating clustering quality, the extended K-means algorithms that are empowered by nature-inspired optimization methods are applied on image segmentation as a case study of application scenario.
Deb, Suash; Yang, Xin-She
2014-01-01
Traditional K-means clustering algorithms have the drawback of getting stuck at local optima that depend on the random values of initial centroids. Optimization algorithms have their advantages in guiding iterative computation to search for global optima while avoiding local optima. The algorithms help speed up the clustering process by converging into a global optimum early with multiple search agents in action. Inspired by nature, some contemporary optimization algorithms which include Ant, Bat, Cuckoo, Firefly, and Wolf search algorithms mimic the swarming behavior allowing them to cooperatively steer towards an optimal objective within a reasonable time. It is known that these so-called nature-inspired optimization algorithms have their own characteristics as well as pros and cons in different applications. When these algorithms are combined with K-means clustering mechanism for the sake of enhancing its clustering quality by avoiding local optima and finding global optima, the new hybrids are anticipated to produce unprecedented performance. In this paper, we report the results of our evaluation experiments on the integration of nature-inspired optimization methods into K-means algorithms. In addition to the standard evaluation metrics in evaluating clustering quality, the extended K-means algorithms that are empowered by nature-inspired optimization methods are applied on image segmentation as a case study of application scenario. PMID:25202730
Lei, Yang; Yu, Dai; Bin, Zhang; Yang, Yang
2017-01-01
Clustering algorithm as a basis of data analysis is widely used in analysis systems. However, as for the high dimensions of the data, the clustering algorithm may overlook the business relation between these dimensions especially in the medical fields. As a result, usually the clustering result may not meet the business goals of the users. Then, in the clustering process, if it can combine the knowledge of the users, that is, the doctor's knowledge or the analysis intent, the clustering result can be more satisfied. In this paper, we propose an interactive K -means clustering method to improve the user's satisfactions towards the result. The core of this method is to get the user's feedback of the clustering result, to optimize the clustering result. Then, a particle swarm optimization algorithm is used in the method to optimize the parameters, especially the weight settings in the clustering algorithm to make it reflect the user's business preference as possible. After that, based on the parameter optimization and adjustment, the clustering result can be closer to the user's requirement. Finally, we take an example in the breast cancer, to testify our method. The experiments show the better performance of our algorithm.
Network Modeling and Energy-Efficiency Optimization for Advanced Machine-to-Machine Sensor Networks
Jung, Sungmo; Kim, Jong Hyun; Kim, Seoksoo
2012-01-01
Wireless machine-to-machine sensor networks with multiple radio interfaces are expected to have several advantages, including high spatial scalability, low event detection latency, and low energy consumption. Here, we propose a network model design method involving network approximation and an optimized multi-tiered clustering algorithm that maximizes node lifespan by minimizing energy consumption in a non-uniformly distributed network. Simulation results show that the cluster scales and network parameters determined with the proposed method facilitate a more efficient performance compared to existing methods. PMID:23202190
Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming.
Wang, Haizhou; Song, Mingzhou
2011-12-01
The heuristic k -means algorithm, widely used for cluster analysis, does not guarantee optimality. We developed a dynamic programming algorithm for optimal one-dimensional clustering. The algorithm is implemented as an R package called Ckmeans.1d.dp . We demonstrate its advantage in optimality and runtime over the standard iterative k -means algorithm.
Inference from clustering with application to gene-expression microarrays.
Dougherty, Edward R; Barrera, Junior; Brun, Marcel; Kim, Seungchan; Cesar, Roberto M; Chen, Yidong; Bittner, Michael; Trent, Jeffrey M
2002-01-01
There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.
Implementation of spectral clustering on microarray data of carcinoma using k-means algorithm
NASA Astrophysics Data System (ADS)
Frisca, Bustamam, Alhadi; Siswantining, Titin
2017-03-01
Clustering is one of data analysis methods that aims to classify data which have similar characteristics in the same group. Spectral clustering is one of the most popular modern clustering algorithms. As an effective clustering technique, spectral clustering method emerged from the concepts of spectral graph theory. Spectral clustering method needs partitioning algorithm. There are some partitioning methods including PAM, SOM, Fuzzy c-means, and k-means. Based on the research that has been done by Capital and Choudhury in 2013, when using Euclidian distance k-means algorithm provide better accuracy than PAM algorithm. So in this paper we use k-means as our partition algorithm. The major advantage of spectral clustering is in reducing data dimension, especially in this case to reduce the dimension of large microarray dataset. Microarray data is a small-sized chip made of a glass plate containing thousands and even tens of thousands kinds of genes in the DNA fragments derived from doubling cDNA. Application of microarray data is widely used to detect cancer, for the example is carcinoma, in which cancer cells express the abnormalities in his genes. The purpose of this research is to classify the data that have high similarity in the same group and the data that have low similarity in the others. In this research, Carcinoma microarray data using 7457 genes. The result of partitioning using k-means algorithm is two clusters.
Using Grey Wolf Algorithm to Solve the Capacitated Vehicle Routing Problem
NASA Astrophysics Data System (ADS)
Korayem, L.; Khorsid, M.; Kassem, S. S.
2015-05-01
The capacitated vehicle routing problem (CVRP) is a class of the vehicle routing problems (VRPs). In CVRP a set of identical vehicles having fixed capacities are required to fulfill customers' demands for a single commodity. The main objective is to minimize the total cost or distance traveled by the vehicles while satisfying a number of constraints, such as: the capacity constraint of each vehicle, logical flow constraints, etc. One of the methods employed in solving the CVRP is the cluster-first route-second method. It is a technique based on grouping of customers into a number of clusters, where each cluster is served by one vehicle. Once clusters are formed, a route determining the best sequence to visit customers is established within each cluster. The recently bio-inspired grey wolf optimizer (GWO), introduced in 2014, has proven to be efficient in solving unconstrained, as well as, constrained optimization problems. In the current research, our main contributions are: combining GWO with the traditional K-means clustering algorithm to generate the ‘K-GWO’ algorithm, deriving a capacitated version of the K-GWO algorithm by incorporating a capacity constraint into the aforementioned algorithm, and finally, developing 2 new clustering heuristics. The resulting algorithm is used in the clustering phase of the cluster-first route-second method to solve the CVR problem. The algorithm is tested on a number of benchmark problems with encouraging results.
Cluster-based control of a separating flow over a smoothly contoured ramp
NASA Astrophysics Data System (ADS)
Kaiser, Eurika; Noack, Bernd R.; Spohn, Andreas; Cattafesta, Louis N.; Morzyński, Marek
2017-12-01
The ability to manipulate and control fluid flows is of great importance in many scientific and engineering applications. The proposed closed-loop control framework addresses a key issue of model-based control: The actuation effect often results from slow dynamics of strongly nonlinear interactions which the flow reveals at timescales much longer than the prediction horizon of any model. Hence, we employ a probabilistic approach based on a cluster-based discretization of the Liouville equation for the evolution of the probability distribution. The proposed methodology frames high-dimensional, nonlinear dynamics into low-dimensional, probabilistic, linear dynamics which considerably simplifies the optimal control problem while preserving nonlinear actuation mechanisms. The data-driven approach builds upon a state space discretization using a clustering algorithm which groups kinematically similar flow states into a low number of clusters. The temporal evolution of the probability distribution on this set of clusters is then described by a control-dependent Markov model. This Markov model can be used as predictor for the ergodic probability distribution for a particular control law. This probability distribution approximates the long-term behavior of the original system on which basis the optimal control law is determined. We examine how the approach can be used to improve the open-loop actuation in a separating flow dominated by Kelvin-Helmholtz shedding. For this purpose, the feature space, in which the model is learned, and the admissible control inputs are tailored to strongly oscillatory flows.
Chemodynamical Clustering Applied to APOGEE Data: Rediscovering Globular Clusters
NASA Astrophysics Data System (ADS)
Chen, Boquan; D’Onghia, Elena; Pardy, Stephen A.; Pasquali, Anna; Bertelli Motta, Clio; Hanlon, Bret; Grebel, Eva K.
2018-06-01
We have developed a novel technique based on a clustering algorithm that searches for kinematically and chemically clustered stars in the APOGEE DR12 Cannon data. As compared to classical chemical tagging, the kinematic information included in our methodology allows us to identify stars that are members of known globular clusters with greater confidence. We apply our algorithm to the entire APOGEE catalog of 150,615 stars whose chemical abundances are derived by the Cannon. Our methodology found anticorrelations between the elements Al and Mg, Na and O, and C and N previously identified in the optical spectra in globular clusters, even though we omit these elements in our algorithm. Our algorithm identifies globular clusters without a priori knowledge of their locations in the sky. Thus, not only does this technique promise to discover new globular clusters, but it also allows us to identify candidate streams of kinematically and chemically clustered stars in the Milky Way.
Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems
NASA Technical Reports Server (NTRS)
Ecer, A.; Chien, Y. P.; Boenisch, T.; Akay, H. U.
2000-01-01
The developed methodology is aimed at improving the efficiency of executing block-structured algorithms on parallel, distributed, heterogeneous computers. The basic approach of these algorithms is to divide the flow domain into many sub- domains called blocks, and solve the governing equations over these blocks. Dynamic load balancing problem is defined as the efficient distribution of the blocks among the available processors over a period of several hours of computations. In environments with computers of different architecture, operating systems, CPU speed, memory size, load, and network speed, balancing the loads and managing the communication between processors becomes crucial. Load balancing software tools for mutually dependent parallel processes have been created to efficiently utilize an advanced computation environment and algorithms. These tools are dynamic in nature because of the chances in the computer environment during execution time. More recently, these tools were extended to a second operating system: NT. In this paper, the problems associated with this application will be discussed. Also, the developed algorithms were combined with the load sharing capability of LSF to efficiently utilize workstation clusters for parallel computing. Finally, results will be presented on running a NASA based code ADPAC to demonstrate the developed tools for dynamic load balancing.
Clustering performance comparison using K-means and expectation maximization algorithms.
Jung, Yong Gyu; Kang, Min Soo; Heo, Jun
2014-11-14
Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K -means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K -means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.
Cooperative Learning for Distributed In-Network Traffic Classification
NASA Astrophysics Data System (ADS)
Joseph, S. B.; Loo, H. R.; Ismail, I.; Andromeda, T.; Marsono, M. N.
2017-04-01
Inspired by the concept of autonomic distributed/decentralized network management schemes, we consider the issue of information exchange among distributed network nodes to network performance and promote scalability for in-network monitoring. In this paper, we propose a cooperative learning algorithm for propagation and synchronization of network information among autonomic distributed network nodes for online traffic classification. The results show that network nodes with sharing capability perform better with a higher average accuracy of 89.21% (sharing data) and 88.37% (sharing clusters) compared to 88.06% for nodes without cooperative learning capability. The overall performance indicates that cooperative learning is promising for distributed in-network traffic classification.
Basic firefly algorithm for document clustering
NASA Astrophysics Data System (ADS)
Mohammed, Athraa Jasim; Yusof, Yuhanis; Husni, Husniza
2015-12-01
The Document clustering plays significant role in Information Retrieval (IR) where it organizes documents prior to the retrieval process. To date, various clustering algorithms have been proposed and this includes the K-means and Particle Swarm Optimization. Even though these algorithms have been widely applied in many disciplines due to its simplicity, such an approach tends to be trapped in a local minimum during its search for an optimal solution. To address the shortcoming, this paper proposes a Basic Firefly (Basic FA) algorithm to cluster text documents. The algorithm employs the Average Distance to Document Centroid (ADDC) as the objective function of the search. Experiments utilizing the proposed algorithm were conducted on the 20Newsgroups benchmark dataset. Results demonstrate that the Basic FA generates a more robust and compact clusters than the ones produced by K-means and Particle Swarm Optimization (PSO).
Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra.
Rieder, Vera; Schork, Karin U; Kerschke, Laura; Blank-Landeshammer, Bernhard; Sickmann, Albert; Rahnenführer, Jörg
2017-11-03
In proteomics, liquid chromatography-tandem mass spectrometry (LC-MS/MS) is established for identifying peptides and proteins. Duplicated spectra, that is, multiple spectra of the same peptide, occur both in single MS/MS runs and in large spectral libraries. Clustering tandem mass spectra is used to find consensus spectra, with manifold applications. First, it speeds up database searches, as performed for instance by Mascot. Second, it helps to identify novel peptides across species. Third, it is used for quality control to detect wrongly annotated spectra. We compare different clustering algorithms based on the cosine distance between spectra. CAST, MS-Cluster, and PRIDE Cluster are popular algorithms to cluster tandem mass spectra. We add well-known algorithms for large data sets, hierarchical clustering, DBSCAN, and connected components of a graph, as well as the new method N-Cluster. All algorithms are evaluated on real data with varied parameter settings. Cluster results are compared with each other and with peptide annotations based on validation measures such as purity. Quality control, regarding the detection of wrongly (un)annotated spectra, is discussed for exemplary resulting clusters. N-Cluster proves to be highly competitive. All clustering results benefit from the so-called DISMS2 filter that integrates additional information, for example, on precursor mass.
Weighted graph cuts without eigenvectors a multilevel approach.
Dhillon, Inderjit S; Guan, Yuqiang; Kulis, Brian
2007-11-01
A variety of clustering algorithms have recently been proposed to handle data that is not linearly separable; spectral clustering and kernel k-means are two of the main methods. In this paper, we discuss an equivalence between the objective functions used in these seemingly different methods--in particular, a general weighted kernel k-means objective is mathematically equivalent to a weighted graph clustering objective. We exploit this equivalence to develop a fast, high-quality multilevel algorithm that directly optimizes various weighted graph clustering objectives, such as the popular ratio cut, normalized cut, and ratio association criteria. This eliminates the need for any eigenvector computation for graph clustering problems, which can be prohibitive for very large graphs. Previous multilevel graph partitioning methods, such as Metis, have suffered from the restriction of equal-sized clusters; our multilevel algorithm removes this restriction by using kernel k-means to optimize weighted graph cuts. Experimental results show that our multilevel algorithm outperforms a state-of-the-art spectral clustering algorithm in terms of speed, memory usage, and quality. We demonstrate that our algorithm is applicable to large-scale clustering tasks such as image segmentation, social network analysis and gene network analysis.
Algorithms development for the GEM-based detection system
NASA Astrophysics Data System (ADS)
Czarski, T.; Chernyshova, M.; Malinowski, K.; Pozniak, K. T.; Kasprowicz, G.; Kolasinski, P.; Krawczyk, R.; Wojenski, A.; Zabolotny, W.
2016-09-01
The measurement system based on GEM - Gas Electron Multiplier detector - is developed for soft X-ray diagnostics of tokamak plasmas. The multi-channel setup is designed for estimation of the energy and the position distribution of an Xray source. The focal measuring issue is the charge cluster identification by its value and position estimation. The fast and accurate mode of the serial data acquisition is applied for the dynamic plasma diagnostics. The charge clusters are counted in the space determined by 2D position, charge value and time intervals. Radiation source characteristics are presented by histograms for a selected range of position, time intervals and cluster charge values corresponding to the energy spectra.
NASA Astrophysics Data System (ADS)
Boyko, Oleksiy; Zheleznyak, Mark
2015-04-01
The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
Model-based clustering for RNA-seq data.
Si, Yaqing; Liu, Peng; Li, Pinghua; Brutnell, Thomas P
2014-01-15
RNA-seq technology has been widely adopted as an attractive alternative to microarray-based methods to study global gene expression. However, robust statistical tools to analyze these complex datasets are still lacking. By grouping genes with similar expression profiles across treatments, cluster analysis provides insight into gene functions and networks, and hence is an important technique for RNA-seq data analysis. In this manuscript, we derive clustering algorithms based on appropriate probability models for RNA-seq data. An expectation-maximization algorithm and another two stochastic versions of expectation-maximization algorithms are described. In addition, a strategy for initialization based on likelihood is proposed to improve the clustering algorithms. Moreover, we present a model-based hybrid-hierarchical clustering method to generate a tree structure that allows visualization of relationships among clusters as well as flexibility of choosing the number of clusters. Results from both simulation studies and analysis of a maize RNA-seq dataset show that our proposed methods provide better clustering results than alternative methods such as the K-means algorithm and hierarchical clustering methods that are not based on probability models. An R package, MBCluster.Seq, has been developed to implement our proposed algorithms. This R package provides fast computation and is publicly available at http://www.r-project.org
Mohanasundaram, Ranganathan; Periasamy, Pappampalayam Sanmugam
2015-01-01
The current high profile debate with regard to data storage and its growth have become strategic task in the world of networking. It mainly depends on the sensor nodes called producers, base stations, and also the consumers (users and sensor nodes) to retrieve and use the data. The main concern dealt here is to find an optimal data storage position in wireless sensor networks. The works that have been carried out earlier did not utilize swarm intelligence based optimization approaches to find the optimal data storage positions. To achieve this goal, an efficient swam intelligence approach is used to choose suitable positions for a storage node. Thus, hybrid particle swarm optimization algorithm has been used to find the suitable positions for storage nodes while the total energy cost of data transmission is minimized. Clustering-based distributed data storage is utilized to solve clustering problem using fuzzy-C-means algorithm. This research work also considers the data rates and locations of multiple producers and consumers to find optimal data storage positions. The algorithm is implemented in a network simulator and the experimental results show that the proposed clustering and swarm intelligence based ODS strategy is more effective than the earlier approaches. PMID:25734182
A cluster pattern algorithm for the analysis of multiparametric cell assays.
Kaufman, Menachem; Bloch, David; Zurgil, Naomi; Shafran, Yana; Deutsch, Mordechai
2005-09-01
The issue of multiparametric analysis of complex single cell assays of both static and flow cytometry (SC and FC, respectively) has become common in recent years. In such assays, the analysis of changes, applying common statistical parameters and tests, often fails to detect significant differences between the investigated samples. The cluster pattern similarity (CPS) measure between two sets of gated clusters is based on computing the difference between their density distribution functions' set points. The CPS was applied for the discrimination between two observations in a four-dimensional parameter space. The similarity coefficient (r) ranges between 0 (perfect similarity) to 1 (dissimilar). Three CPS validation tests were carried out: on the same stock samples of fluorescent beads, yielding very low r's (0, 0.066); and on two cell models: mitogenic stimulation of peripheral blood mononuclear cells (PBMC), and apoptosis induction in Jurkat T cell line by H2O2. In both latter cases, r indicated similarity (r < 0.23) within the same group, and dissimilarity (r > 0.48) otherwise. This classification and algorithm approach offers a measure of similarity between samples. It relies on the multidimensional pattern of the sample parameters. The algorithm compensates for environmental drifts in this apparatus and assay; it also may be applied to more than four dimensions.
An improved initialization center k-means clustering algorithm based on distance and density
NASA Astrophysics Data System (ADS)
Duan, Yanling; Liu, Qun; Xia, Shuyin
2018-04-01
Aiming at the problem of the random initial clustering center of k means algorithm that the clustering results are influenced by outlier data sample and are unstable in multiple clustering, a method of central point initialization method based on larger distance and higher density is proposed. The reciprocal of the weighted average of distance is used to represent the sample density, and the data sample with the larger distance and the higher density are selected as the initial clustering centers to optimize the clustering results. Then, a clustering evaluation method based on distance and density is designed to verify the feasibility of the algorithm and the practicality, the experimental results on UCI data sets show that the algorithm has a certain stability and practicality.
Diametrical clustering for identifying anti-correlated gene clusters.
Dhillon, Inderjit S; Marcotte, Edward M; Roshan, Usman
2003-09-01
Clustering genes based upon their expression patterns allows us to predict gene function. Most existing clustering algorithms cluster genes together when their expression patterns show high positive correlation. However, it has been observed that genes whose expression patterns are strongly anti-correlated can also be functionally similar. Biologically, this is not unintuitive-genes responding to the same stimuli, regardless of the nature of the response, are more likely to operate in the same pathways. We present a new diametrical clustering algorithm that explicitly identifies anti-correlated clusters of genes. Our algorithm proceeds by iteratively (i). re-partitioning the genes and (ii). computing the dominant singular vector of each gene cluster; each singular vector serving as the prototype of a 'diametric' cluster. We empirically show the effectiveness of the algorithm in identifying diametrical or anti-correlated clusters. Testing the algorithm on yeast cell cycle data, fibroblast gene expression data, and DNA microarray data from yeast mutants reveals that opposed cellular pathways can be discovered with this method. We present systems whose mRNA expression patterns, and likely their functions, oppose the yeast ribosome and proteosome, along with evidence for the inverse transcriptional regulation of a number of cellular systems.
A novel harmony search-K means hybrid algorithm for clustering gene expression data
Nazeer, KA Abdul; Sebastian, MP; Kumar, SD Madhu
2013-01-01
Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms. PMID:23390351
A novel harmony search-K means hybrid algorithm for clustering gene expression data.
Nazeer, Ka Abdul; Sebastian, Mp; Kumar, Sd Madhu
2013-01-01
Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms.
Adaboost multi-view face detection based on YCgCr skin color model
NASA Astrophysics Data System (ADS)
Lan, Qi; Xu, Zhiyong
2016-09-01
Traditional Adaboost face detection algorithm uses Haar-like features training face classifiers, whose detection error rate is low in the face region. While under the complex background, the classifiers will make wrong detection easily to the background regions with the similar faces gray level distribution, which leads to the error detection rate of traditional Adaboost algorithm is high. As one of the most important features of a face, skin in YCgCr color space has good clustering. We can fast exclude the non-face areas through the skin color model. Therefore, combining with the advantages of the Adaboost algorithm and skin color detection algorithm, this paper proposes Adaboost face detection algorithm method that bases on YCgCr skin color model. Experiments show that, compared with traditional algorithm, the method we proposed has improved significantly in the detection accuracy and errors.
Unweighted least squares phase unwrapping by means of multigrid techniques
NASA Astrophysics Data System (ADS)
Pritt, Mark D.
1995-11-01
We present a multigrid algorithm for unweighted least squares phase unwrapping. This algorithm applies Gauss-Seidel relaxation schemes to solve the Poisson equation on smaller, coarser grids and transfers the intermediate results to the finer grids. This approach forms the basis of our multigrid algorithm for weighted least squares phase unwrapping, which is described in a separate paper. The key idea of our multigrid approach is to maintain the partial derivatives of the phase data in separate arrays and to correct these derivatives at the boundaries of the coarser grids. This maintains the boundary conditions necessary for rapid convergence to the correct solution. Although the multigrid algorithm is an iterative algorithm, we demonstrate that it is nearly as fast as the direct Fourier-based method. We also describe how to parallelize the algorithm for execution on a distributed-memory parallel processor computer or a network-cluster of workstations.
THE SWIFT AGN AND CLUSTER SURVEY. II. CLUSTER CONFIRMATION WITH SDSS DATA
DOE Office of Scientific and Technical Information (OSTI.GOV)
Griffin, Rhiannon D.; Dai, Xinyu; Kochanek, Christopher S.
2016-01-15
We study 203 (of 442) Swift AGN and Cluster Survey extended X-ray sources located in the SDSS DR8 footprint to search for galaxy over-densities in three-dimensional space using SDSS galaxy photometric redshifts and positions near the Swift cluster candidates. We find 104 Swift clusters with a >3σ galaxy over-density. The remaining targets are potentially located at higher redshifts and require deeper optical follow-up observations for confirmation as galaxy clusters. We present a series of cluster properties including the redshift, brightest cluster galaxy (BCG) magnitude, BCG-to-X-ray center offset, optical richness, and X-ray luminosity. We also detect red sequences in ∼85% ofmore » the 104 confirmed clusters. The X-ray luminosity and optical richness for the SDSS confirmed Swift clusters are correlated and follow previously established relations. The distribution of the separations between the X-ray centroids and the most likely BCG is also consistent with expectation. We compare the observed redshift distribution of the sample with a theoretical model, and find that our sample is complete for z ≲ 0.3 and is still 80% complete up to z ≃ 0.4, consistent with the SDSS survey depth. These analysis results suggest that our Swift cluster selection algorithm has yielded a statistically well-defined cluster sample for further study of cluster evolution and cosmology. We also match our SDSS confirmed Swift clusters to existing cluster catalogs, and find 42, 23, and 1 matches in optical, X-ray, and Sunyaev–Zel’dovich catalogs, respectively, and so the majority of these clusters are new detections.« less
m-BIRCH: an online clustering approach for computer vision applications
NASA Astrophysics Data System (ADS)
Madan, Siddharth K.; Dana, Kristin J.
2015-03-01
We adapt a classic online clustering algorithm called Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH), to incrementally cluster large datasets of features commonly used in multimedia and computer vision. We call the adapted version modified-BIRCH (m-BIRCH). The algorithm uses only a fraction of the dataset memory to perform clustering, and updates the clustering decisions when new data comes in. Modifications made in m-BIRCH enable data driven parameter selection and effectively handle varying density regions in the feature space. Data driven parameter selection automatically controls the level of coarseness of the data summarization. Effective handling of varying density regions is necessary to well represent the different density regions in data summarization. We use m-BIRCH to cluster 840K color SIFT descriptors, and 60K outlier corrupted grayscale patches. We use the algorithm to cluster datasets consisting of challenging non-convex clustering patterns. Our implementation of the algorithm provides an useful clustering tool and is made publicly available.
A Self-Adaptive Fuzzy c-Means Algorithm for Determining the Optimal Number of Clusters
Wang, Zhihao; Yi, Jing
2016-01-01
For the shortcoming of fuzzy c-means algorithm (FCM) needing to know the number of clusters in advance, this paper proposed a new self-adaptive method to determine the optimal number of clusters. Firstly, a density-based algorithm was put forward. The algorithm, according to the characteristics of the dataset, automatically determined the possible maximum number of clusters instead of using the empirical rule n and obtained the optimal initial cluster centroids, improving the limitation of FCM that randomly selected cluster centroids lead the convergence result to the local minimum. Secondly, this paper, by introducing a penalty function, proposed a new fuzzy clustering validity index based on fuzzy compactness and separation, which ensured that when the number of clusters verged on that of objects in the dataset, the value of clustering validity index did not monotonically decrease and was close to zero, so that the optimal number of clusters lost robustness and decision function. Then, based on these studies, a self-adaptive FCM algorithm was put forward to estimate the optimal number of clusters by the iterative trial-and-error process. At last, experiments were done on the UCI, KDD Cup 1999, and synthetic datasets, which showed that the method not only effectively determined the optimal number of clusters, but also reduced the iteration of FCM with the stable clustering result. PMID:28042291
The Mucciardi-Gose Clustering Algorithm and Its Applications in Automatic Pattern Recognition.
A procedure known as the Mucciardi- Gose clustering algorithm, CLUSTR, for determining the geometrical or statistical relationships among groups of N...discussion of clustering algorithms is given; the particular advantages of the Mucciardi- Gose procedure are described. The mathematical basis for, and the
Security clustering algorithm based on reputation in hierarchical peer-to-peer network
NASA Astrophysics Data System (ADS)
Chen, Mei; Luo, Xin; Wu, Guowen; Tan, Yang; Kita, Kenji
2013-03-01
For the security problems of the hierarchical P2P network (HPN), the paper presents a security clustering algorithm based on reputation (CABR). In the algorithm, we take the reputation mechanism for ensuring the security of transaction and use cluster for managing the reputation mechanism. In order to improve security, reduce cost of network brought by management of reputation and enhance stability of cluster, we select reputation, the historical average online time, and the network bandwidth as the basic factors of the comprehensive performance of node. Simulation results showed that the proposed algorithm improved the security, reduced the network overhead, and enhanced stability of cluster.
Shah, Sohil Atul
2017-01-01
Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank. PMID:28851838
NASA Astrophysics Data System (ADS)
Zhang, Haiying; Bai, Jiaojiao; Li, Zhengjie; Liu, Yan; Liu, Kunhong
2017-06-01
The detection and discrimination of infrared small dim targets is a challenge in automatic target recognition (ATR), because there is no salient information of size, shape and texture. Many researchers focus on mining more discriminative information of targets in temporal-spatial. However, such information may not be available with the change of imaging environments, and the targets size and intensity keep changing in different imaging distance. So in this paper, we propose a novel research scheme using density-based clustering and backtracking strategy. In this scheme, the speeded up robust feature (SURF) detector is applied to capture candidate targets in single frame at first. And then, these points are mapped into one frame, so that target traces form a local aggregation pattern. In order to isolate the targets from noises, a newly proposed density-based clustering algorithm, fast search and find of density peak (FSFDP for short), is employed to cluster targets by the spatial intensive distribution. Two important factors of the algorithm, percent and γ , are exploited fully to determine the clustering scale automatically, so as to extract the trace with highest clutter suppression ratio. And at the final step, a backtracking algorithm is designed to detect and discriminate target trace as well as to eliminate clutter. The consistence and continuity of the short-time target trajectory in temporal-spatial is incorporated into the bounding function to speed up the pruning. Compared with several state-of-arts methods, our algorithm is more effective for the dim targets with lower signal-to clutter ratio (SCR). Furthermore, it avoids constructing the candidate target trajectory searching space, so its time complexity is limited to a polynomial level. The extensive experimental results show that it has superior performance in probability of detection (Pd) and false alarm suppressing rate aiming at variety of complex backgrounds.
The Universe at Moderate Redshift
NASA Technical Reports Server (NTRS)
Cen, Renyue; Ostriker, Jeremiah P.
1997-01-01
The report covers the work done in the past year and a wide range of fields including properties of clusters of galaxies; topological properties of galaxy distributions in terms of galaxy types; patterns of gravitational nonlinear clustering process; development of a ray tracing algorithm to study the gravitational lensing phenomenon by galaxies, clusters and large-scale structure, one of whose applications being the effects of weak gravitational lensing by large-scale structure on the determination of q(0); the origin of magnetic fields on the galactic and cluster scales; the topological properties of Ly(alpha) clouds the Ly(alpha) optical depth distribution; clustering properties of Ly(alpha) clouds; and a determination (lower bound) of Omega(b) based on the observed Ly(alpha) forest flux distribution. In the coming year, we plan to continue the investigation of Ly(alpha) clouds using larger dynamic range (about a factor of two) and better simulations (with more input physics included) than what we have now. We will study the properties of galaxies on 1 - 100h(sup -1) Mpc scales using our state-of-the-art large scale galaxy formation simulations of various cosmological models, which will have a resolution about a factor of 5 (in each dimension) better than our current, best simulations. We will plan to study the properties of X-ray clusters using unprecedented, very high dynamic range (20,000) simulations which will enable us to resolve the cores of clusters while keeping the simulation volume sufficiently large to ensure a statistically fair sample of the objects of interest. The details of the last year's works are now described.
Determining open cluster membership. A Bayesian framework for quantitative member classification
NASA Astrophysics Data System (ADS)
Stott, Jonathan J.
2018-01-01
Aims: My goal is to develop a quantitative algorithm for assessing open cluster membership probabilities. The algorithm is designed to work with single-epoch observations. In its simplest form, only one set of program images and one set of reference images are required. Methods: The algorithm is based on a two-stage joint astrometric and photometric assessment of cluster membership probabilities. The probabilities were computed within a Bayesian framework using any available prior information. Where possible, the algorithm emphasizes simplicity over mathematical sophistication. Results: The algorithm was implemented and tested against three observational fields using published survey data. M 67 and NGC 654 were selected as cluster examples while a third, cluster-free, field was used for the final test data set. The algorithm shows good quantitative agreement with the existing surveys and has a false-positive rate significantly lower than the astrometric or photometric methods used individually.
Random Walk Quantum Clustering Algorithm Based on Space
NASA Astrophysics Data System (ADS)
Xiao, Shufen; Dong, Yumin; Ma, Hongyang
2018-01-01
In the random quantum walk, which is a quantum simulation of the classical walk, data points interacted when selecting the appropriate walk strategy by taking advantage of quantum-entanglement features; thus, the results obtained when the quantum walk is used are different from those when the classical walk is adopted. A new quantum walk clustering algorithm based on space is proposed by applying the quantum walk to clustering analysis. In this algorithm, data points are viewed as walking participants, and similar data points are clustered using the walk function in the pay-off matrix according to a certain rule. The walk process is simplified by implementing a space-combining rule. The proposed algorithm is validated by a simulation test and is proved superior to existing clustering algorithms, namely, Kmeans, PCA + Kmeans, and LDA-Km. The effects of some of the parameters in the proposed algorithm on its performance are also analyzed and discussed. Specific suggestions are provided.
Kilborn, Joshua P; Jones, David L; Peebles, Ernst B; Naar, David F
2017-04-01
Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing-based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance-based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.
Robust generative asymmetric GMM for brain MR image segmentation.
Ji, Zexuan; Xia, Yong; Zheng, Yuhui
2017-11-01
Accurate segmentation of brain tissues from magnetic resonance (MR) images based on the unsupervised statistical models such as Gaussian mixture model (GMM) has been widely studied during last decades. However, most GMM based segmentation methods suffer from limited accuracy due to the influences of noise and intensity inhomogeneity in brain MR images. To further improve the accuracy for brain MR image segmentation, this paper presents a Robust Generative Asymmetric GMM (RGAGMM) for simultaneous brain MR image segmentation and intensity inhomogeneity correction. First, we develop an asymmetric distribution to fit the data shapes, and thus construct a spatial constrained asymmetric model. Then, we incorporate two pseudo-likelihood quantities and bias field estimation into the model's log-likelihood, aiming to exploit the neighboring priors of within-cluster and between-cluster and to alleviate the impact of intensity inhomogeneity, respectively. Finally, an expectation maximization algorithm is derived to iteratively maximize the approximation of the data log-likelihood function to overcome the intensity inhomogeneity in the image and segment the brain MR images simultaneously. To demonstrate the performances of the proposed algorithm, we first applied the proposed algorithm to a synthetic brain MR image to show the intermediate illustrations and the estimated distribution of the proposed algorithm. The next group of experiments is carried out in clinical 3T-weighted brain MR images which contain quite serious intensity inhomogeneity and noise. Then we quantitatively compare our algorithm to state-of-the-art segmentation approaches by using Dice coefficient (DC) on benchmark images obtained from IBSR and BrainWeb with different level of noise and intensity inhomogeneity. The comparison results on various brain MR images demonstrate the superior performances of the proposed algorithm in dealing with the noise and intensity inhomogeneity. In this paper, the RGAGMM algorithm is proposed which can simply and efficiently incorporate spatial constraints into an EM framework to simultaneously segment brain MR images and estimate the intensity inhomogeneity. The proposed algorithm is flexible to fit the data shapes, and can simultaneously overcome the influence of noise and intensity inhomogeneity, and hence is capable of improving over 5% segmentation accuracy comparing with several state-of-the-art algorithms. Copyright © 2017 Elsevier B.V. All rights reserved.
Random Partition Distribution Indexed by Pairwise Information
Dahl, David B.; Day, Ryan; Tsai, Jerry W.
2017-01-01
We propose a random partition distribution indexed by pairwise similarity information such that partitions compatible with the similarities are given more probability. The use of pairwise similarities, in the form of distances, is common in some clustering algorithms (e.g., hierarchical clustering), but we show how to use this type of information to define a prior partition distribution for flexible Bayesian modeling. A defining feature of the distribution is that it allocates probability among partitions within a given number of subsets, but it does not shift probability among sets of partitions with different numbers of subsets. Our distribution places more probability on partitions that group similar items yet keeps the total probability of partitions with a given number of subsets constant. The distribution of the number of subsets (and its moments) is available in closed-form and is not a function of the similarities. Our formulation has an explicit probability mass function (with a tractable normalizing constant) so the full suite of MCMC methods may be used for posterior inference. We compare our distribution with several existing partition distributions, showing that our formulation has attractive properties. We provide three demonstrations to highlight the features and relative performance of our distribution. PMID:29276318
NASA Astrophysics Data System (ADS)
Liu, Guo-Chin; Ichiki, Kiyotomo; Tashiro, Hiroyuki; Sugiyama, Naoshi
2016-07-01
Scattering of cosmic microwave background (CMB) radiation in galaxy clusters induces polarization signals determined by the quadrupole anisotropy in the photon distribution at the location of clusters. This `remote quadrupole' derived from the measurements of the induced polarization in galaxy clusters provides an opportunity to reconstruct local CMB temperature anisotropies. In this Letter, we develop an algorithm of the reconstruction through the estimation of the underlying primordial gravitational potential, which is the origin of the CMB temperature and polarization fluctuations and CMB induced polarization in galaxy clusters. We found a nice reconstruction for the quadrupole and octopole components of the CMB temperature anisotropies with the assistance of the CMB induced polarization signals. The reconstruction can be an important consistency test on the puzzles of CMB anomalies, especially for the low-quadrupole and axis-of-evil problems reported in Wilkinson Microwave Anisotropy Probe and Planck data.
Contributions to "k"-Means Clustering and Regression via Classification Algorithms
ERIC Educational Resources Information Center
Salman, Raied
2012-01-01
The dissertation deals with clustering algorithms and transforming regression problems into classification problems. The main contributions of the dissertation are twofold; first, to improve (speed up) the clustering algorithms and second, to develop a strict learning environment for solving regression problems as classification tasks by using…
The Spatial Distribution of Resolved Young Stars in Blue Compact Dwarf Galaxies
NASA Astrophysics Data System (ADS)
Murphy, K.; Crone, M. M.
2002-12-01
We present the first results from a survey of the distribution of resolved young stars in Blue Compact Dwarf Galaxies. In order to identify the dominant physical processes driving star formation in these puzzling galaxies, we use a multi-scale cluster-finding algorithm to quantify the characteristic scales and properties of star-forming regions, from sizes smaller than 10 pc up to the size of each entire galaxy. This project was partially funded by the Lubin Chair at Skidmore College.
Multi scales based sparse matrix spectral clustering image segmentation
NASA Astrophysics Data System (ADS)
Liu, Zhongmin; Chen, Zhicai; Li, Zhanming; Hu, Wenjin
2018-04-01
In image segmentation, spectral clustering algorithms have to adopt the appropriate scaling parameter to calculate the similarity matrix between the pixels, which may have a great impact on the clustering result. Moreover, when the number of data instance is large, computational complexity and memory use of the algorithm will greatly increase. To solve these two problems, we proposed a new spectral clustering image segmentation algorithm based on multi scales and sparse matrix. We devised a new feature extraction method at first, then extracted the features of image on different scales, at last, using the feature information to construct sparse similarity matrix which can improve the operation efficiency. Compared with traditional spectral clustering algorithm, image segmentation experimental results show our algorithm have better degree of accuracy and robustness.
An AK-LDMeans algorithm based on image clustering
NASA Astrophysics Data System (ADS)
Chen, Huimin; Li, Xingwei; Zhang, Yongbin; Chen, Nan
2018-03-01
Clustering is an effective analytical technique for handling unmarked data for value mining. Its ultimate goal is to mark unclassified data quickly and correctly. We use the roadmap for the current image processing as the experimental background. In this paper, we propose an AK-LDMeans algorithm to automatically lock the K value by designing the Kcost fold line, and then use the long-distance high-density method to select the clustering centers to further replace the traditional initial clustering center selection method, which further improves the efficiency and accuracy of the traditional K-Means Algorithm. And the experimental results are compared with the current clustering algorithm and the results are obtained. The algorithm can provide effective reference value in the fields of image processing, machine vision and data mining.
Hierarchical trie packet classification algorithm based on expectation-maximization clustering.
Bi, Xia-An; Zhao, Junxia
2017-01-01
With the development of computer network bandwidth, packet classification algorithms which are able to deal with large-scale rule sets are in urgent need. Among the existing algorithms, researches on packet classification algorithms based on hierarchical trie have become an important packet classification research branch because of their widely practical use. Although hierarchical trie is beneficial to save large storage space, it has several shortcomings such as the existence of backtracking and empty nodes. This paper proposes a new packet classification algorithm, Hierarchical Trie Algorithm Based on Expectation-Maximization Clustering (HTEMC). Firstly, this paper uses the formalization method to deal with the packet classification problem by means of mapping the rules and data packets into a two-dimensional space. Secondly, this paper uses expectation-maximization algorithm to cluster the rules based on their aggregate characteristics, and thereby diversified clusters are formed. Thirdly, this paper proposes a hierarchical trie based on the results of expectation-maximization clustering. Finally, this paper respectively conducts simulation experiments and real-environment experiments to compare the performances of our algorithm with other typical algorithms, and analyzes the results of the experiments. The hierarchical trie structure in our algorithm not only adopts trie path compression to eliminate backtracking, but also solves the problem of low efficiency of trie updates, which greatly improves the performance of the algorithm.
Energy Aware Cluster-Based Routing in Flying Ad-Hoc Networks.
Aadil, Farhan; Raza, Ali; Khan, Muhammad Fahad; Maqsood, Muazzam; Mehmood, Irfan; Rho, Seungmin
2018-05-03
Flying ad-hoc networks (FANETs) are a very vibrant research area nowadays. They have many military and civil applications. Limited battery energy and the high mobility of micro unmanned aerial vehicles (UAVs) represent their two main problems, i.e., short flight time and inefficient routing. In this paper, we try to address both of these problems by means of efficient clustering. First, we adjust the transmission power of the UAVs by anticipating their operational requirements. Optimal transmission range will have minimum packet loss ratio (PLR) and better link quality, which ultimately save the energy consumed during communication. Second, we use a variant of the K-Means Density clustering algorithm for selection of cluster heads. Optimal cluster heads enhance the cluster lifetime and reduce the routing overhead. The proposed model outperforms the state of the art artificial intelligence techniques such as Ant Colony Optimization-based clustering algorithm and Grey Wolf Optimization-based clustering algorithm. The performance of the proposed algorithm is evaluated in term of number of clusters, cluster building time, cluster lifetime and energy consumption.
A fuzzy clustering algorithm to detect planar and quadric shapes
NASA Technical Reports Server (NTRS)
Krishnapuram, Raghu; Frigui, Hichem; Nasraoui, Olfa
1992-01-01
In this paper, we introduce a new fuzzy clustering algorithm to detect an unknown number of planar and quadric shapes in noisy data. The proposed algorithm is computationally and implementationally simple, and it overcomes many of the drawbacks of the existing algorithms that have been proposed for similar tasks. Since the clustering is performed in the original image space, and since no features need to be computed, this approach is particularly suited for sparse data. The algorithm may also be used in pattern recognition applications.
Locating inefficient links in a large-scale transportation network
NASA Astrophysics Data System (ADS)
Sun, Li; Liu, Like; Xu, Zhongzhi; Jie, Yang; Wei, Dong; Wang, Pu
2015-02-01
Based on data from geographical information system (GIS) and daily commuting origin destination (OD) matrices, we estimated the distribution of traffic flow in the San Francisco road network and studied Braess's paradox in a large-scale transportation network with realistic travel demand. We measured the variation of total travel time Δ T when a road segment is closed, and found that | Δ T | follows a power-law distribution if Δ T < 0 or Δ T > 0. This implies that most roads have a negligible effect on the efficiency of the road network, while the failure of a few crucial links would result in severe travel delays, and closure of a few inefficient links would counter-intuitively reduce travel costs considerably. Generating three theoretical networks, we discovered that the heterogeneously distributed travel demand may be the origin of the observed power-law distributions of | Δ T | . Finally, a genetic algorithm was used to pinpoint inefficient link clusters in the road network. We found that closing specific road clusters would further improve the transportation efficiency.
U.S. stock market interaction network as learned by the Boltzmann machine
Borysov, Stanislav S.; Roudi, Yasser; Balatsky, Alexander V.
2015-12-07
Here, we study historical dynamics of joint equilibrium distribution of stock returns in the U.S. stock market using the Boltzmann distribution model being parametrized by external fields and pairwise couplings. Within Boltzmann learning framework for statistical inference, we analyze historical behavior of the parameters inferred using exact and approximate learning algorithms. Since the model and inference methods require use of binary variables, effect of this mapping of continuous returns to the discrete domain is studied. The presented results show that binarization preserves the correlation structure of the market. Properties of distributions of external fields and couplings as well as themore » market interaction network and industry sector clustering structure are studied for different historical dates and moving window sizes. We demonstrate that the observed positive heavy tail in distribution of couplings is related to the sparse clustering structure of the market. We also show that discrepancies between the model’s parameters might be used as a precursor of financial instabilities.« less
A Fast Density-Based Clustering Algorithm for Real-Time Internet of Things Stream
Ying Wah, Teh
2014-01-01
Data streams are continuously generated over time from Internet of Things (IoT) devices. The faster all of this data is analyzed, its hidden trends and patterns discovered, and new strategies created, the faster action can be taken, creating greater value for organizations. Density-based method is a prominent class in clustering data streams. It has the ability to detect arbitrary shape clusters, to handle outlier, and it does not need the number of clusters in advance. Therefore, density-based clustering algorithm is a proper choice for clustering IoT streams. Recently, several density-based algorithms have been proposed for clustering data streams. However, density-based clustering in limited time is still a challenging issue. In this paper, we propose a density-based clustering algorithm for IoT streams. The method has fast processing time to be applicable in real-time application of IoT devices. Experimental results show that the proposed approach obtains high quality results with low computation time on real and synthetic datasets. PMID:25110753
A fast density-based clustering algorithm for real-time Internet of Things stream.
Amini, Amineh; Saboohi, Hadi; Wah, Teh Ying; Herawan, Tutut
2014-01-01
Data streams are continuously generated over time from Internet of Things (IoT) devices. The faster all of this data is analyzed, its hidden trends and patterns discovered, and new strategies created, the faster action can be taken, creating greater value for organizations. Density-based method is a prominent class in clustering data streams. It has the ability to detect arbitrary shape clusters, to handle outlier, and it does not need the number of clusters in advance. Therefore, density-based clustering algorithm is a proper choice for clustering IoT streams. Recently, several density-based algorithms have been proposed for clustering data streams. However, density-based clustering in limited time is still a challenging issue. In this paper, we propose a density-based clustering algorithm for IoT streams. The method has fast processing time to be applicable in real-time application of IoT devices. Experimental results show that the proposed approach obtains high quality results with low computation time on real and synthetic datasets.
Punzo, Antonio; Ingrassia, Salvatore; Maruotti, Antonello
2018-04-22
A time-varying latent variable model is proposed to jointly analyze multivariate mixed-support longitudinal data. The proposal can be viewed as an extension of hidden Markov regression models with fixed covariates (HMRMFCs), which is the state of the art for modelling longitudinal data, with a special focus on the underlying clustering structure. HMRMFCs are inadequate for applications in which a clustering structure can be identified in the distribution of the covariates, as the clustering is independent from the covariates distribution. Here, hidden Markov regression models with random covariates are introduced by explicitly specifying state-specific distributions for the covariates, with the aim of improving the recovering of the clusters in the data with respect to a fixed covariates paradigm. The hidden Markov regression models with random covariates class is defined focusing on the exponential family, in a generalized linear model framework. Model identifiability conditions are sketched, an expectation-maximization algorithm is outlined for parameter estimation, and various implementation and operational issues are discussed. Properties of the estimators of the regression coefficients, as well as of the hidden path parameters, are evaluated through simulation experiments and compared with those of HMRMFCs. The method is applied to physical activity data. Copyright © 2018 John Wiley & Sons, Ltd.
A clustering algorithm for determining community structure in complex networks
NASA Astrophysics Data System (ADS)
Jin, Hong; Yu, Wei; Li, ShiJun
2018-02-01
Clustering algorithms are attractive for the task of community detection in complex networks. DENCLUE is a representative density based clustering algorithm which has a firm mathematical basis and good clustering properties allowing for arbitrarily shaped clusters in high dimensional datasets. However, this method cannot be directly applied to community discovering due to its inability to deal with network data. Moreover, it requires a careful selection of the density parameter and the noise threshold. To solve these issues, a new community detection method is proposed in this paper. First, we use a spectral analysis technique to map the network data into a low dimensional Euclidean Space which can preserve node structural characteristics. Then, DENCLUE is applied to detect the communities in the network. A mathematical method named Sheather-Jones plug-in is chosen to select the density parameter which can describe the intrinsic clustering structure accurately. Moreover, every node on the network is meaningful so there were no noise nodes as a result the noise threshold can be ignored. We test our algorithm on both benchmark and real-life networks, and the results demonstrate the effectiveness of our algorithm over other popularity density based clustering algorithms adopted to community detection.
Yi, Chucai; Tian, Yingli
2012-09-01
In this paper, we propose a novel framework to extract text regions from scene images with complex backgrounds and multiple text appearances. This framework consists of three main steps: boundary clustering (BC), stroke segmentation, and string fragment classification. In BC, we propose a new bigram-color-uniformity-based method to model both text and attachment surface, and cluster edge pixels based on color pairs and spatial positions into boundary layers. Then, stroke segmentation is performed at each boundary layer by color assignment to extract character candidates. We propose two algorithms to combine the structural analysis of text stroke with color assignment and filter out background interferences. Further, we design a robust string fragment classification based on Gabor-based text features. The features are obtained from feature maps of gradient, stroke distribution, and stroke width. The proposed framework of text localization is evaluated on scene images, born-digital images, broadcast video images, and images of handheld objects captured by blind persons. Experimental results on respective datasets demonstrate that the framework outperforms state-of-the-art localization algorithms.
Hierarchical Solution of the Traveling Salesman Problem with Random Dyadic Tilings
NASA Astrophysics Data System (ADS)
Kalmár-Nagy, Tamás; Bak, Bendegúz Dezső
We propose a hierarchical heuristic approach for solving the Traveling Salesman Problem (TSP) in the unit square. The points are partitioned with a random dyadic tiling and clusters are formed by the points located in the same tile. Each cluster is represented by its geometrical barycenter and a “coarse” TSP solution is calculated for these barycenters. Midpoints are placed at the middle of each edge in the coarse solution. Near-optimal (or optimal) minimum tours are computed for each cluster. The tours are concatenated using the midpoints yielding a solution for the original TSP. The method is tested on random TSPs (independent, identically distributed points in the unit square) up to 10,000 points as well as on a popular benchmark problem (att532 — coordinates of 532 American cities). Our solutions are 8-13% longer than the optimal ones. We also present an optimization algorithm for the partitioning to improve our solutions. This algorithm further reduces the solution errors (by several percent using 1000 iteration steps). The numerical experiments demonstrate the viability of the approach.
Collaborative filtering recommendation model based on fuzzy clustering algorithm
NASA Astrophysics Data System (ADS)
Yang, Ye; Zhang, Yunhua
2018-05-01
As one of the most widely used algorithms in recommender systems, collaborative filtering algorithm faces two serious problems, which are the sparsity of data and poor recommendation effect in big data environment. In traditional clustering analysis, the object is strictly divided into several classes and the boundary of this division is very clear. However, for most objects in real life, there is no strict definition of their forms and attributes of their class. Concerning the problems above, this paper proposes to improve the traditional collaborative filtering model through the hybrid optimization of implicit semantic algorithm and fuzzy clustering algorithm, meanwhile, cooperating with collaborative filtering algorithm. In this paper, the fuzzy clustering algorithm is introduced to fuzzy clustering the information of project attribute, which makes the project belong to different project categories with different membership degrees, and increases the density of data, effectively reduces the sparsity of data, and solves the problem of low accuracy which is resulted from the inaccuracy of similarity calculation. Finally, this paper carries out empirical analysis on the MovieLens dataset, and compares it with the traditional user-based collaborative filtering algorithm. The proposed algorithm has greatly improved the recommendation accuracy.
NASA Astrophysics Data System (ADS)
Feng, Jian-xin; Tang, Jia-fu; Wang, Guang-xing
2007-04-01
On the basis of the analysis of clustering algorithm that had been proposed for MANET, a novel clustering strategy was proposed in this paper. With the trust defined by statistical hypothesis in probability theory and the cluster head selected by node trust and node mobility, this strategy can realize the function of the malicious nodes detection which was neglected by other clustering algorithms and overcome the deficiency of being incapable of implementing the relative mobility metric of corresponding nodes in the MOBIC algorithm caused by the fact that the receiving power of two consecutive HELLO packet cannot be measured. It's an effective solution to cluster MANET securely.
NASA Astrophysics Data System (ADS)
Dalkilic, Turkan Erbay; Apaydin, Aysen
2009-11-01
In a regression analysis, it is assumed that the observations come from a single class in a data cluster and the simple functional relationship between the dependent and independent variables can be expressed using the general model; Y=f(X)+[epsilon]. However; a data cluster may consist of a combination of observations that have different distributions that are derived from different clusters. When faced with issues of estimating a regression model for fuzzy inputs that have been derived from different distributions, this regression model has been termed the [`]switching regression model' and it is expressed with . Here li indicates the class number of each independent variable and p is indicative of the number of independent variables [J.R. Jang, ANFIS: Adaptive-network-based fuzzy inference system, IEEE Transaction on Systems, Man and Cybernetics 23 (3) (1993) 665-685; M. Michel, Fuzzy clustering and switching regression models using ambiguity and distance rejects, Fuzzy Sets and Systems 122 (2001) 363-399; E.Q. Richard, A new approach to estimating switching regressions, Journal of the American Statistical Association 67 (338) (1972) 306-310]. In this study, adaptive networks have been used to construct a model that has been formed by gathering obtained models. There are methods that suggest the class numbers of independent variables heuristically. Alternatively, in defining the optimal class number of independent variables, the use of suggested validity criterion for fuzzy clustering has been aimed. In the case that independent variables have an exponential distribution, an algorithm has been suggested for defining the unknown parameter of the switching regression model and for obtaining the estimated values after obtaining an optimal membership function, which is suitable for exponential distribution.
NASA Astrophysics Data System (ADS)
Cheng, W. Y.; Kim, D.; Rowe, A.; Park, S.
2017-12-01
Despite the impact of mesoscale convective organization on the properties of convection (e.g., mixing between updrafts and environment), parameterizing the degree of convective organization has only recently been attempted in cumulus parameterization schemes (e.g., Unified Convection Scheme UNICON). Additionally, challenges remain in determining the degree of convective organization from observations and in comparing directly with the organization metrics in model simulations. This study addresses the need to objectively quantify the degree of mesoscale convective organization using high quality S-PolKa radar data from the DYNAMO field campaign. One of the most noticeable aspects of mesoscale convective organization in radar data is the degree of convective clustering, which can be characterized by the number and size distribution of convective echoes and the distance between them. We propose a method of defining contiguous convective echoes (CCEs) using precipitating convective echoes identified by a rain type classification algorithm. Two classification algorithms, Steiner et al. (1995) and Powell et al. (2016), are tested and evaluated against high-resolution WRF simulations to determine which method better represents the degree of convective clustering. Our results suggest that the CCEs based on Powell et al.'s algorithm better represent the dynamical properties of the convective updrafts and thus provide the basis of a metric for convective organization. Furthermore, through a comparison with the observational data, the WRF simulations driven by the DYNAMO large-scale forcing, similarly applied to UNICON Single Column Model simulations, will allow us to evaluate the ability of both WRF and UNICON to simulate convective clustering. This evaluation is based on the physical processes that are explicitly represented in WRF and UNICON, including the mechanisms leading to convective clustering, and the feedback to the convective properties.
Measuring Constraint-Set Utility for Partitional Clustering Algorithms
NASA Technical Reports Server (NTRS)
Davidson, Ian; Wagstaff, Kiri L.; Basu, Sugato
2006-01-01
Clustering with constraints is an active area of machine learning and data mining research. Previous empirical work has convincingly shown that adding constraints to clustering improves the performance of a variety of algorithms. However, in most of these experiments, results are averaged over different randomly chosen constraint sets from a given set of labels, thereby masking interesting properties of individual sets. We demonstrate that constraint sets vary significantly in how useful they are for constrained clustering; some constraint sets can actually decrease algorithm performance. We create two quantitative measures, informativeness and coherence, that can be used to identify useful constraint sets. We show that these measures can also help explain differences in performance for four particular constrained clustering algorithms.
An Improved Clustering Algorithm of Tunnel Monitoring Data for Cloud Computing
Zhong, Luo; Tang, KunHao; Li, Lin; Yang, Guang; Ye, JingJing
2014-01-01
With the rapid development of urban construction, the number of urban tunnels is increasing and the data they produce become more and more complex. It results in the fact that the traditional clustering algorithm cannot handle the mass data of the tunnel. To solve this problem, an improved parallel clustering algorithm based on k-means has been proposed. It is a clustering algorithm using the MapReduce within cloud computing that deals with data. It not only has the advantage of being used to deal with mass data but also is more efficient. Moreover, it is able to compute the average dissimilarity degree of each cluster in order to clean the abnormal data. PMID:24982971
Efficient Record Linkage Algorithms Using Complete Linkage Clustering.
Mamun, Abdullah-Al; Aseltine, Robert; Rajasekaran, Sanguthevar
2016-01-01
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.
Efficient Record Linkage Algorithms Using Complete Linkage Clustering
Mamun, Abdullah-Al; Aseltine, Robert; Rajasekaran, Sanguthevar
2016-01-01
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times. PMID:27124604
NASA Astrophysics Data System (ADS)
Thanos, Konstantinos-Georgios; Thomopoulos, Stelios C. A.
2014-06-01
The study in this paper belongs to a more general research of discovering facial sub-clusters in different ethnicity face databases. These new sub-clusters along with other metadata (such as race, sex, etc.) lead to a vector for each face in the database where each vector component represents the likelihood of participation of a given face to each cluster. This vector is then used as a feature vector in a human identification and tracking system based on face and other biometrics. The first stage in this system involves a clustering method which evaluates and compares the clustering results of five different clustering algorithms (average, complete, single hierarchical algorithm, k-means and DIGNET), and selects the best strategy for each data collection. In this paper we present the comparative performance of clustering results of DIGNET and four clustering algorithms (average, complete, single hierarchical and k-means) on fabricated 2D and 3D samples, and on actual face images from various databases, using four different standard metrics. These metrics are the silhouette figure, the mean silhouette coefficient, the Hubert test Γ coefficient, and the classification accuracy for each clustering result. The results showed that, in general, DIGNET gives more trustworthy results than the other algorithms when the metrics values are above a specific acceptance threshold. However when the evaluation results metrics have values lower than the acceptance threshold but not too low (too low corresponds to ambiguous results or false results), then it is necessary for the clustering results to be verified by the other algorithms.
Tang, Jiqiang; Yang, Wu; Zhu, Lingyun; Wang, Dong; Feng, Xin
2017-04-26
In recent years, Wireless Sensor Networks with a Mobile Sink (WSN-MS) have been an active research topic due to the widespread use of mobile devices. However, how to get the balance between data delivery latency and energy consumption becomes a key issue of WSN-MS. In this paper, we study the clustering approach by jointly considering the Route planning for mobile sink and Clustering Problem (RCP) for static sensor nodes. We solve the RCP problem by using the minimum travel route clustering approach, which applies the minimum travel route of the mobile sink to guide the clustering process. We formulate the RCP problem as an Integer Non-Linear Programming (INLP) problem to shorten the travel route of the mobile sink under three constraints: the communication hops constraint, the travel route constraint and the loop avoidance constraint. We then propose an Imprecise Induction Algorithm (IIA) based on the property that the solution with a small hop count is more feasible than that with a large hop count. The IIA algorithm includes three processes: initializing travel route planning with a Traveling Salesman Problem (TSP) algorithm, transforming the cluster head to a cluster member and transforming the cluster member to a cluster head. Extensive experimental results show that the IIA algorithm could automatically adjust cluster heads according to the maximum hops parameter and plan a shorter travel route for the mobile sink. Compared with the Shortest Path Tree-based Data-Gathering Algorithm (SPT-DGA), the IIA algorithm has the characteristics of shorter route length, smaller cluster head count and faster convergence rate.
Efficient implementation of parallel three-dimensional FFT on clusters of PCs
NASA Astrophysics Data System (ADS)
Takahashi, Daisuke
2003-05-01
In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of PCs. The three-dimensional FFT algorithm can be altered into a block three-dimensional FFT algorithm to reduce the number of cache misses. We show that the block three-dimensional FFT algorithm improves performance by utilizing the cache memory effectively. We use the block three-dimensional FFT algorithm to implement the parallel three-dimensional FFT algorithm. We succeeded in obtaining performance of over 1.3 GFLOPS on an 8-node dual Pentium III 1 GHz PC SMP cluster.
Blessy, S A Praylin Selva; Sulochana, C Helen
2015-01-01
Segmentation of brain tumor from Magnetic Resonance Imaging (MRI) becomes very complicated due to the structural complexities of human brain and the presence of intensity inhomogeneities. To propose a method that effectively segments brain tumor from MR images and to evaluate the performance of unsupervised optimal fuzzy clustering (UOFC) algorithm for segmentation of brain tumor from MR images. Segmentation is done by preprocessing the MR image to standardize intensity inhomogeneities followed by feature extraction, feature fusion and clustering. Different validation measures are used to evaluate the performance of the proposed method using different clustering algorithms. The proposed method using UOFC algorithm produces high sensitivity (96%) and low specificity (4%) compared to other clustering methods. Validation results clearly show that the proposed method with UOFC algorithm effectively segments brain tumor from MR images.
An incremental DPMM-based method for trajectory clustering, modeling, and retrieval.
Hu, Weiming; Li, Xi; Tian, Guodong; Maybank, Stephen; Zhang, Zhongfei
2013-05-01
Trajectory analysis is the basis for many applications, such as indexing of motion events in videos, activity recognition, and surveillance. In this paper, the Dirichlet process mixture model (DPMM) is applied to trajectory clustering, modeling, and retrieval. We propose an incremental version of a DPMM-based clustering algorithm and apply it to cluster trajectories. An appropriate number of trajectory clusters is determined automatically. When trajectories belonging to new clusters arrive, the new clusters can be identified online and added to the model without any retraining using the previous data. A time-sensitive Dirichlet process mixture model (tDPMM) is applied to each trajectory cluster for learning the trajectory pattern which represents the time-series characteristics of the trajectories in the cluster. Then, a parameterized index is constructed for each cluster. A novel likelihood estimation algorithm for the tDPMM is proposed, and a trajectory-based video retrieval model is developed. The tDPMM-based probabilistic matching method and the DPMM-based model growing method are combined to make the retrieval model scalable and adaptable. Experimental comparisons with state-of-the-art algorithms demonstrate the effectiveness of our algorithm.
NASA Astrophysics Data System (ADS)
Di, Nur Faraidah Muhammad; Satari, Siti Zanariah
2017-05-01
Outlier detection in linear data sets has been done vigorously but only a small amount of work has been done for outlier detection in circular data. In this study, we proposed multiple outliers detection in circular regression models based on the clustering algorithm. Clustering technique basically utilizes distance measure to define distance between various data points. Here, we introduce the similarity distance based on Euclidean distance for circular model and obtain a cluster tree using the single linkage clustering algorithm. Then, a stopping rule for the cluster tree based on the mean direction and circular standard deviation of the tree height is proposed. We classify the cluster group that exceeds the stopping rule as potential outlier. Our aim is to demonstrate the effectiveness of proposed algorithms with the similarity distances in detecting the outliers. It is found that the proposed methods are performed well and applicable for circular regression model.
NASA Astrophysics Data System (ADS)
Li, Hongsong; Lyu, Hang; Liao, Ningfang; Wu, Wenmin
2016-12-01
The bidirectional reflectance distribution function (BRDF) data in the ultraviolet (UV) band are valuable for many applications including cultural heritage, material analysis, surface characterization, and trace detection. We present a BRDF measurement instrument working in the near- and middle-UV spectral range. The instrument includes a collimated UV light source, a rotation stage, a UV imaging spectrometer, and a control computer. The data captured by the proposed instrument describe spatial, spectral, and angular variations of the light scattering from a sample surface. Such a multidimensional dataset of an example sample is captured by the proposed instrument and analyzed by a k-mean clustering algorithm to separate surface regions with same material but different surface roughnesses. The clustering results show that the angular dimension of the dataset can be exploited for surface roughness characterization. The two clustered BRDFs are fitted to a theoretical BRDF model. The fitting results show good agreement between the measurement data and the theoretical model.
A Fast Implementation of the ISOCLUS Algorithm
NASA Technical Reports Server (NTRS)
Memarsadeghi, Nargess; Mount, David M.; Netanyahu, Nathan S.; LeMoigne, Jacqueline
2003-01-01
Unsupervised clustering is a fundamental tool in numerous image processing and remote sensing applications. For example, unsupervised clustering is often used to obtain vegetation maps of an area of interest. This approach is useful when reliable training data are either scarce or expensive, and when relatively little a priori information about the data is available. Unsupervised clustering methods play a significant role in the pursuit of unsupervised classification. One of the most popular and widely used clustering schemes for remote sensing applications is the ISOCLUS algorithm, which is based on the ISODATA method. The algorithm is given a set of n data points (or samples) in d-dimensional space, an integer k indicating the initial number of clusters, and a number of additional parameters. The general goal is to compute a set of cluster centers in d-space. Although there is no specific optimization criterion, the algorithm is similar in spirit to the well known k-means clustering method in which the objective is to minimize the average squared distance of each point to its nearest center, called the average distortion. One significant feature of ISOCLUS over k-means is that clusters may be merged or split, and so the final number of clusters may be different from the number k supplied as part of the input. This algorithm will be described in later in this paper. The ISOCLUS algorithm can run very slowly, particularly on large data sets. Given its wide use in remote sensing, its efficient computation is an important goal. We have developed a fast implementation of the ISOCLUS algorithm. Our improvement is based on a recent acceleration to the k-means algorithm, the filtering algorithm, by Kanungo et al.. They showed that, by storing the data in a kd-tree, it was possible to significantly reduce the running time of k-means. We have adapted this method for the ISOCLUS algorithm. For technical reasons, which are explained later, it is necessary to make a minor modification to the ISOCLUS specification. We provide empirical evidence, on both synthetic and Landsat image data sets, that our algorithm's performance is essentially the same as that of ISOCLUS, but with significantly lower running times. We show that our algorithm runs from 3 to 30 times faster than a straightforward implementation of ISOCLUS. Our adaptation of the filtering algorithm involves the efficient computation of a number of cluster statistics that are needed for ISOCLUS, but not for k-means.
Coverage maximization under resource constraints using a nonuniform proliferating random walk.
Saha, Sudipta; Ganguly, Niloy
2013-02-01
Information management services on networks, such as search and dissemination, play a key role in any large-scale distributed system. One of the most desirable features of these services is the maximization of the coverage, i.e., the number of distinctly visited nodes under constraints of network resources as well as time. However, redundant visits of nodes by different message packets (modeled, e.g., as walkers) initiated by the underlying algorithms for these services cause wastage of network resources. In this work, using results from analytical studies done in the past on a K-random-walk-based algorithm, we identify that redundancy quickly increases with an increase in the density of the walkers. Based on this postulate, we design a very simple distributed algorithm which dynamically estimates the density of the walkers and thereby carefully proliferates walkers in sparse regions. We use extensive computer simulations to test our algorithm in various kinds of network topologies whereby we find it to be performing particularly well in networks that are highly clustered as well as sparse.
NASA Astrophysics Data System (ADS)
Hernandez, F. J.; Lopez, A. M.; Vanacore, E. A.
2017-12-01
The presence of earthquake swarms and clusters in the north and northeast of the island of Puerto Rico in the northeastern Caribbean have been recorded by the Puerto Rico Seismic Network (PRSN) since it started operations in 1974. Although clusters in the Puerto Rico-Virgin Island (PRVI) block have been observed for over forty years, the nature of their enigmatic occurrence is still poorly understood. In this study, the entire seismic catalog of the PRSN, of approximately 31,000 seismic events, has been limited to a sub-set of 18,000 events located all along north of Puerto Rico in an effort to characterize and understand the underlying mechanism of these clusters. This research uses two de-clustering methods to identify cluster events in the PRVI block. The first method, known as Model Independent Stochastic Declustering (MISD), filters the catalog sub-set into cluster and background seismic events, while the second method uses a spatio-temporal algorithm applied to the catalog in order to link the separate seismic events into clusters. After using these two methods, identified clusters were classified into either earthquake swarms or seismic sequences. Once identified, each cluster was analyzed to identify correlations against other clusters in their geographic region. Results from this research seek to : (1) unravel their earthquake clustering behavior through the use of different statistical methods and (2) better understand the mechanism for these clustering of earthquakes. Preliminary results have allowed to identify and classify 128 clusters categorized in 11 distinctive regions based on their centers, and their spatio-temporal distribution have been used to determine intra- and interplate dynamics.
NASA Astrophysics Data System (ADS)
Farsadnia, Farhad; Ghahreman, Bijan
2016-04-01
Hydrologic homogeneous group identification is considered both fundamental and applied research in hydrology. Clustering methods are among conventional methods to assess the hydrological homogeneous regions. Recently, Self-Organizing feature Map (SOM) method has been applied in some studies. However, the main problem of this method is the interpretation on the output map of this approach. Therefore, SOM is used as input to other clustering algorithms. The aim of this study is to apply a two-level Self-Organizing feature map and Ward hierarchical clustering method to determine the hydrologic homogenous regions in North and Razavi Khorasan provinces. At first by principal component analysis, we reduced SOM input matrix dimension, then the SOM was used to form a two-dimensional features map. To determine homogeneous regions for flood frequency analysis, SOM output nodes were used as input into the Ward method. Generally, the regions identified by the clustering algorithms are not statistically homogeneous. Consequently, they have to be adjusted to improve their homogeneity. After adjustment of the homogeneity regions by L-moment tests, five hydrologic homogeneous regions were identified. Finally, adjusted regions were created by a two-level SOM and then the best regional distribution function and associated parameters were selected by the L-moment approach. The results showed that the combination of self-organizing maps and Ward hierarchical clustering by principal components as input is more effective than the hierarchical method, by principal components or standardized inputs to achieve hydrologic homogeneous regions.
Reducing Earth Topography Resolution for SMAP Mission Ground Tracks Using K-Means Clustering
NASA Technical Reports Server (NTRS)
Rizvi, Farheen
2013-01-01
The K-means clustering algorithm is used to reduce Earth topography resolution for the SMAP mission ground tracks. As SMAP propagates in orbit, knowledge of the radar antenna footprints on Earth is required for the antenna misalignment calibration. Each antenna footprint contains a latitude and longitude location pair on the Earth surface. There are 400 pairs in one data set for the calibration model. It is computationally expensive to calculate corresponding Earth elevation for these data pairs. Thus, the antenna footprint resolution is reduced. Similar topographical data pairs are grouped together with the K-means clustering algorithm. The resolution is reduced to the mean of each topographical cluster called the cluster centroid. The corresponding Earth elevation for each cluster centroid is assigned to the entire group. Results show that 400 data points are reduced to 60 while still maintaining algorithm performance and computational efficiency. In this work, sensitivity analysis is also performed to show a trade-off between algorithm performance versus computational efficiency as the number of cluster centroids and algorithm iterations are increased.
Accurate Grid-based Clustering Algorithm with Diagonal Grid Searching and Merging
NASA Astrophysics Data System (ADS)
Liu, Feng; Ye, Chengcheng; Zhu, Erzhou
2017-09-01
Due to the advent of big data, data mining technology has attracted more and more attentions. As an important data analysis method, grid clustering algorithm is fast but with relatively lower accuracy. This paper presents an improved clustering algorithm combined with grid and density parameters. The algorithm first divides the data space into the valid meshes and invalid meshes through grid parameters. Secondly, from the starting point located at the first point of the diagonal of the grids, the algorithm takes the direction of “horizontal right, vertical down” to merge the valid meshes. Furthermore, by the boundary grid processing, the invalid grids are searched and merged when the adjacent left, above, and diagonal-direction grids are all the valid ones. By doing this, the accuracy of clustering is improved. The experimental results have shown that the proposed algorithm is accuracy and relatively faster when compared with some popularly used algorithms.
NASA Astrophysics Data System (ADS)
Chuan, Zun Liang; Ismail, Noriszura; Shinyie, Wendy Ling; Lit Ken, Tan; Fam, Soo-Fen; Senawi, Azlyna; Yusoff, Wan Nur Syahidah Wan
2018-04-01
Due to the limited of historical precipitation records, agglomerative hierarchical clustering algorithms widely used to extrapolate information from gauged to ungauged precipitation catchments in yielding a more reliable projection of extreme hydro-meteorological events such as extreme precipitation events. However, identifying the optimum number of homogeneous precipitation catchments accurately based on the dendrogram resulted using agglomerative hierarchical algorithms are very subjective. The main objective of this study is to propose an efficient regionalized algorithm to identify the homogeneous precipitation catchments for non-stationary precipitation time series. The homogeneous precipitation catchments are identified using average linkage hierarchical clustering algorithm associated multi-scale bootstrap resampling, while uncentered correlation coefficient as the similarity measure. The regionalized homogeneous precipitation is consolidated using K-sample Anderson Darling non-parametric test. The analysis result shows the proposed regionalized algorithm performed more better compared to the proposed agglomerative hierarchical clustering algorithm in previous studies.
Conformational Clusters of Phosphorylated Tyrosine.
Abdelrasoul, Maha; Ponniah, Komala; Mao, Alice; Warden, Meghan S; Elhefnawy, Wessam; Li, Yaohang; Pascal, Steven M
2017-12-06
Tyrosine phosphorylation plays an important role in many cellular and intercellular processes including signal transduction, subcellular localization, and regulation of enzymatic activity. In 1999, Blom et al., using the limited number of protein data bank (PDB) structures available at that time, reported that the side chain structures of phosphorylated tyrosine (pY) are partitioned into two conserved conformational clusters ( Blom, N.; Gammeltoft, S.; Brunak, S. J. Mol. Biol. 1999 , 294 , 1351 - 1362 ). We have used the spectral clustering algorithm to cluster the increasingly growing number of protein structures with pY sites, and have found that the pY residues cluster into three distinct side chain conformations. Two of these pY conformational clusters associate strongly with a narrow range of tyrosine backbone conformation. The novel cluster also highly correlates with the identity of the n + 1 residue, and is strongly associated with a sequential pYpY conformation which places two adjacent pY side chains in a specific relative orientation. Further analysis shows that the three pY clusters are associated with distinct distributions of cognate protein kinases.
Robust MST-Based Clustering Algorithm.
Liu, Qidong; Zhang, Ruisheng; Zhao, Zhili; Wang, Zhenghai; Jiao, Mengyao; Wang, Guangjing
2018-06-01
Minimax similarity stresses the connectedness of points via mediating elements rather than favoring high mutual similarity. The grouping principle yields superior clustering results when mining arbitrarily-shaped clusters in data. However, it is not robust against noises and outliers in the data. There are two main problems with the grouping principle: first, a single object that is far away from all other objects defines a separate cluster, and second, two connected clusters would be regarded as two parts of one cluster. In order to solve such problems, we propose robust minimum spanning tree (MST)-based clustering algorithm in this letter. First, we separate the connected objects by applying a density-based coarsening phase, resulting in a low-rank matrix in which the element denotes the supernode by combining a set of nodes. Then a greedy method is presented to partition those supernodes through working on the low-rank matrix. Instead of removing the longest edges from MST, our algorithm groups the data set based on the minimax similarity. Finally, the assignment of all data points can be achieved through their corresponding supernodes. Experimental results on many synthetic and real-world data sets show that our algorithm consistently outperforms compared clustering algorithms.
NASA Astrophysics Data System (ADS)
Brenden, T. O.; Clark, R. D.; Wiley, M. J.; Seelbach, P. W.; Wang, L.
2005-05-01
Remote sensing and geographic information systems have made it possible to attribute variables for streams at increasingly detailed resolutions (e.g., individual river reaches). Nevertheless, management decisions still must be made at large scales because land and stream managers typically lack sufficient resources to manage on an individual reach basis. Managers thus require a method for identifying stream management units that are ecologically similar and that can be expected to respond similarly to management decisions. We have developed a spatially-constrained clustering algorithm that can merge neighboring river reaches with similar ecological characteristics into larger management units. The clustering algorithm is based on the Cluster Affinity Search Technique (CAST), which was developed for clustering gene expression data. Inputs to the clustering algorithm are the neighbor relationships of the reaches that comprise the digital river network, the ecological attributes of the reaches, and an affinity value, which identifies the minimum similarity for merging river reaches. In this presentation, we describe the clustering algorithm in greater detail and contrast its use with other methods (expert opinion, classification approach, regular clustering) for identifying management units using several Michigan watersheds as a backdrop.
On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms.
Chen, Chunlei; He, Li; Zhang, Huixiang; Zheng, Hao; Wang, Lei
2017-01-01
Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions.
Hierarchical trie packet classification algorithm based on expectation-maximization clustering
Bi, Xia-an; Zhao, Junxia
2017-01-01
With the development of computer network bandwidth, packet classification algorithms which are able to deal with large-scale rule sets are in urgent need. Among the existing algorithms, researches on packet classification algorithms based on hierarchical trie have become an important packet classification research branch because of their widely practical use. Although hierarchical trie is beneficial to save large storage space, it has several shortcomings such as the existence of backtracking and empty nodes. This paper proposes a new packet classification algorithm, Hierarchical Trie Algorithm Based on Expectation-Maximization Clustering (HTEMC). Firstly, this paper uses the formalization method to deal with the packet classification problem by means of mapping the rules and data packets into a two-dimensional space. Secondly, this paper uses expectation-maximization algorithm to cluster the rules based on their aggregate characteristics, and thereby diversified clusters are formed. Thirdly, this paper proposes a hierarchical trie based on the results of expectation-maximization clustering. Finally, this paper respectively conducts simulation experiments and real-environment experiments to compare the performances of our algorithm with other typical algorithms, and analyzes the results of the experiments. The hierarchical trie structure in our algorithm not only adopts trie path compression to eliminate backtracking, but also solves the problem of low efficiency of trie updates, which greatly improves the performance of the algorithm. PMID:28704476
NASA Astrophysics Data System (ADS)
Gendron, Marlin Lee
During Mine Warfare (MIW) operations, MIW analysts perform change detection by visually comparing historical sidescan sonar imagery (SSI) collected by a sidescan sonar with recently collected SSI in an attempt to identify objects (which might be explosive mines) placed at sea since the last time the area was surveyed. This dissertation presents a data structure and three algorithms, developed by the author, that are part of an automated change detection and classification (ACDC) system. MIW analysts at the Naval Oceanographic Office, to reduce the amount of time to perform change detection, are currently using ACDC. The dissertation introductory chapter gives background information on change detection, ACDC, and describes how SSI is produced from raw sonar data. Chapter 2 presents the author's Geospatial Bitmap (GB) data structure, which is capable of storing information geographically and is utilized by the three algorithms. This chapter shows that a GB data structure used in a polygon-smoothing algorithm ran between 1.3--48.4x faster than a sparse matrix data structure. Chapter 3 describes the GB clustering algorithm, which is the author's repeatable, order-independent method for clustering. Results from tests performed in this chapter show that the time to cluster a set of points is not affected by the distribution or the order of the points. In Chapter 4, the author presents his real-time computer-aided detection (CAD) algorithm that automatically detects mine-like objects on the seafloor in SSI. The author ran his GB-based CAD algorithm on real SSI data, and results of these tests indicate that his real-time CAD algorithm performs comparably to or better than other non-real-time CAD algorithms. The author presents his computer-aided search (CAS) algorithm in Chapter 5. CAS helps MIW analysts locate mine-like features that are geospatially close to previously detected features. A comparison between the CAS and a great circle distance algorithm shows that the CAS performs geospatial searching 1.75x faster on large data sets. Finally, the concluding chapter of this dissertation gives important details on how the completed ACDC system will function, and discusses the author's future research to develop additional algorithms and data structures for ACDC.
A fast parallel clustering algorithm for molecular simulation trajectories.
Zhao, Yutong; Sheong, Fu Kit; Sun, Jian; Sander, Pedro; Huang, Xuhui
2013-01-15
We implemented a GPU-powered parallel k-centers algorithm to perform clustering on the conformations of molecular dynamics (MD) simulations. The algorithm is up to two orders of magnitude faster than the CPU implementation. We tested our algorithm on four protein MD simulation datasets ranging from the small Alanine Dipeptide to a 370-residue Maltose Binding Protein (MBP). It is capable of grouping 250,000 conformations of the MBP into 4000 clusters within 40 seconds. To achieve this, we effectively parallelized the code on the GPU and utilize the triangle inequality of metric spaces. Furthermore, the algorithm's running time is linear with respect to the number of cluster centers. In addition, we found the triangle inequality to be less effective in higher dimensions and provide a mathematical rationale. Finally, using Alanine Dipeptide as an example, we show a strong correlation between cluster populations resulting from the k-centers algorithm and the underlying density. © 2012 Wiley Periodicals, Inc. Copyright © 2012 Wiley Periodicals, Inc.
A Class of Manifold Regularized Multiplicative Update Algorithms for Image Clustering.
Yang, Shangming; Yi, Zhang; He, Xiaofei; Li, Xuelong
2015-12-01
Multiplicative update algorithms are important tools for information retrieval, image processing, and pattern recognition. However, when the graph regularization is added to the cost function, different classes of sample data may be mapped to the same subspace, which leads to the increase of data clustering error rate. In this paper, an improved nonnegative matrix factorization (NMF) cost function is introduced. Based on the cost function, a class of novel graph regularized NMF algorithms is developed, which results in a class of extended multiplicative update algorithms with manifold structure regularization. Analysis shows that in the learning, the proposed algorithms can efficiently minimize the rank of the data representation matrix. Theoretical results presented in this paper are confirmed by simulations. For different initializations and data sets, variation curves of cost functions and decomposition data are presented to show the convergence features of the proposed update rules. Basis images, reconstructed images, and clustering results are utilized to present the efficiency of the new algorithms. Last, the clustering accuracies of different algorithms are also investigated, which shows that the proposed algorithms can achieve state-of-the-art performance in applications of image clustering.
Long-term surface EMG monitoring using K-means clustering and compressive sensing
NASA Astrophysics Data System (ADS)
Balouchestani, Mohammadreza; Krishnan, Sridhar
2015-05-01
In this work, we present an advanced K-means clustering algorithm based on Compressed Sensing theory (CS) in combination with the K-Singular Value Decomposition (K-SVD) method for Clustering of long-term recording of surface Electromyography (sEMG) signals. The long-term monitoring of sEMG signals aims at recording of the electrical activity produced by muscles which are very useful procedure for treatment and diagnostic purposes as well as for detection of various pathologies. The proposed algorithm is examined for three scenarios of sEMG signals including healthy person (sEMG-Healthy), a patient with myopathy (sEMG-Myopathy), and a patient with neuropathy (sEMG-Neuropathr), respectively. The proposed algorithm can easily scan large sEMG datasets of long-term sEMG recording. We test the proposed algorithm with Principal Component Analysis (PCA) and Linear Correlation Coefficient (LCC) dimensionality reduction methods. Then, the output of the proposed algorithm is fed to K-Nearest Neighbours (K-NN) and Probabilistic Neural Network (PNN) classifiers in order to calclute the clustering performance. The proposed algorithm achieves a classification accuracy of 99.22%. This ability allows reducing 17% of Average Classification Error (ACE), 9% of Training Error (TE), and 18% of Root Mean Square Error (RMSE). The proposed algorithm also reduces 14% clustering energy consumption compared to the existing K-Means clustering algorithm.
NASA Astrophysics Data System (ADS)
Fan, Tian-E.; Shao, Gui-Fang; Ji, Qing-Shuang; Zheng, Ji-Wen; Liu, Tun-dong; Wen, Yu-Hua
2016-11-01
Theoretically, the determination of the structure of a cluster is to search the global minimum on its potential energy surface. The global minimization problem is often nondeterministic-polynomial-time (NP) hard and the number of local minima grows exponentially with the cluster size. In this article, a multi-populations multi-strategies differential evolution algorithm has been proposed to search the globally stable structure of Fe and Cr nanoclusters. The algorithm combines a multi-populations differential evolution with an elite pool scheme to keep the diversity of the solutions and avoid prematurely trapping into local optima. Moreover, multi-strategies such as growing method in initialization and three differential strategies in mutation are introduced to improve the convergence speed and lower the computational cost. The accuracy and effectiveness of our algorithm have been verified by comparing the results of Fe clusters with Cambridge Cluster Database. Meanwhile, the performance of our algorithm has been analyzed by comparing the convergence rate and energy evaluations with the classical DE algorithm. The multi-populations, multi-strategies mutation and growing method in initialization in our algorithm have been considered respectively. Furthermore, the structural growth pattern of Cr clusters has been predicted by this algorithm. The results show that the lowest-energy structure of Cr clusters contains many icosahedra, and the number of the icosahedral rings rises with increasing size.
Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra".
Griss, Johannes; Perez-Riverol, Yasset; The, Matthew; Käll, Lukas; Vizcaíno, Juan Antonio
2018-05-04
In the recent benchmarking article entitled "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra", Rieder et al. compared several different approaches to cluster MS/MS spectra. While we certainly recognize the value of the manuscript, here, we report some shortcomings detected in the original analyses. For most analyses, the authors clustered only single MS/MS runs. In one of the reported analyses, three MS/MS runs were processed together, which already led to computational performance issues in many of the tested approaches. This fact highlights the difficulties of using many of the tested algorithms on the nowadays produced average proteomics data sets. Second, the authors only processed identified spectra when merging MS runs. Thereby, all unidentified spectra that are of lower quality were already removed from the data set and could not influence the clustering results. Next, we found that the authors did not analyze the effect of chimeric spectra on the clustering results. In our analysis, we found that 3% of the spectra in the used data sets were chimeric, and this had marked effects on the behavior of the different clustering algorithms tested. Finally, the authors' choice to evaluate the MS-Cluster and spectra-cluster algorithms using a precursor tolerance of 5 Da for high-resolution Orbitrap data only was, in our opinion, not adequate to assess the performance of MS/MS clustering approaches.
A Novel Artificial Bee Colony Based Clustering Algorithm for Categorical Data
2015-01-01
Data with categorical attributes are ubiquitous in the real world. However, existing partitional clustering algorithms for categorical data are prone to fall into local optima. To address this issue, in this paper we propose a novel clustering algorithm, ABC-K-Modes (Artificial Bee Colony clustering based on K-Modes), based on the traditional k-modes clustering algorithm and the artificial bee colony approach. In our approach, we first introduce a one-step k-modes procedure, and then integrate this procedure with the artificial bee colony approach to deal with categorical data. In the search process performed by scout bees, we adopt the multi-source search inspired by the idea of batch processing to accelerate the convergence of ABC-K-Modes. The performance of ABC-K-Modes is evaluated by a series of experiments in comparison with that of the other popular algorithms for categorical data. PMID:25993469
A dynamic scheduling algorithm for singe-arm two-cluster tools with flexible processing times
NASA Astrophysics Data System (ADS)
Li, Xin; Fung, Richard Y. K.
2018-02-01
This article presents a dynamic algorithm for job scheduling in two-cluster tools producing multi-type wafers with flexible processing times. Flexible processing times mean that the actual times for processing wafers should be within given time intervals. The objective of the work is to minimize the completion time of the newly inserted wafer. To deal with this issue, a two-cluster tool is decomposed into three reduced single-cluster tools (RCTs) in a series based on a decomposition approach proposed in this article. For each single-cluster tool, a dynamic scheduling algorithm based on temporal constraints is developed to schedule the newly inserted wafer. Three experiments have been carried out to test the dynamic scheduling algorithm proposed, comparing with the results the 'earliest starting time' heuristic (EST) adopted in previous literature. The results show that the dynamic algorithm proposed in this article is effective and practical.
A novel artificial bee colony based clustering algorithm for categorical data.
Ji, Jinchao; Pang, Wei; Zheng, Yanlin; Wang, Zhe; Ma, Zhiqiang
2015-01-01
Data with categorical attributes are ubiquitous in the real world. However, existing partitional clustering algorithms for categorical data are prone to fall into local optima. To address this issue, in this paper we propose a novel clustering algorithm, ABC-K-Modes (Artificial Bee Colony clustering based on K-Modes), based on the traditional k-modes clustering algorithm and the artificial bee colony approach. In our approach, we first introduce a one-step k-modes procedure, and then integrate this procedure with the artificial bee colony approach to deal with categorical data. In the search process performed by scout bees, we adopt the multi-source search inspired by the idea of batch processing to accelerate the convergence of ABC-K-Modes. The performance of ABC-K-Modes is evaluated by a series of experiments in comparison with that of the other popular algorithms for categorical data.
Zhu, Bohui; Ding, Yongsheng; Hao, Kuangrong
2013-01-01
This paper presents a novel maximum margin clustering method with immune evolution (IEMMC) for automatic diagnosis of electrocardiogram (ECG) arrhythmias. This diagnostic system consists of signal processing, feature extraction, and the IEMMC algorithm for clustering of ECG arrhythmias. First, raw ECG signal is processed by an adaptive ECG filter based on wavelet transforms, and waveform of the ECG signal is detected; then, features are extracted from ECG signal to cluster different types of arrhythmias by the IEMMC algorithm. Three types of performance evaluation indicators are used to assess the effect of the IEMMC method for ECG arrhythmias, such as sensitivity, specificity, and accuracy. Compared with K-means and iterSVR algorithms, the IEMMC algorithm reflects better performance not only in clustering result but also in terms of global search ability and convergence ability, which proves its effectiveness for the detection of ECG arrhythmias. PMID:23690875
The Ordered Clustered Travelling Salesman Problem: A Hybrid Genetic Algorithm
Ahmed, Zakir Hussain
2014-01-01
The ordered clustered travelling salesman problem is a variation of the usual travelling salesman problem in which a set of vertices (except the starting vertex) of the network is divided into some prespecified clusters. The objective is to find the least cost Hamiltonian tour in which vertices of any cluster are visited contiguously and the clusters are visited in the prespecified order. The problem is NP-hard, and it arises in practical transportation and sequencing problems. This paper develops a hybrid genetic algorithm using sequential constructive crossover, 2-opt search, and a local search for obtaining heuristic solution to the problem. The efficiency of the algorithm has been examined against two existing algorithms for some asymmetric and symmetric TSPLIB instances of various sizes. The computational results show that the proposed algorithm is very effective in terms of solution quality and computational time. Finally, we present solution to some more symmetric TSPLIB instances. PMID:24701148
A genetic graph-based approach for partitional clustering.
Menéndez, Héctor D; Barrero, David F; Camacho, David
2014-05-01
Clustering is one of the most versatile tools for data analysis. In the recent years, clustering that seeks the continuity of data (in opposition to classical centroid-based approaches) has attracted an increasing research interest. It is a challenging problem with a remarkable practical interest. The most popular continuity clustering method is the spectral clustering (SC) algorithm, which is based on graph cut: It initially generates a similarity graph using a distance measure and then studies its graph spectrum to find the best cut. This approach is sensitive to the parameters of the metric, and a correct parameter choice is critical to the quality of the cluster. This work proposes a new algorithm, inspired by SC, that reduces the parameter dependency while maintaining the quality of the solution. The new algorithm, named genetic graph-based clustering (GGC), takes an evolutionary approach introducing a genetic algorithm (GA) to cluster the similarity graph. The experimental validation shows that GGC increases robustness of SC and has competitive performance in comparison with classical clustering methods, at least, in the synthetic and real dataset used in the experiments.
Exploratory Item Classification Via Spectral Graph Clustering
Chen, Yunxiao; Li, Xiaoou; Liu, Jingchen; Xu, Gongjun; Ying, Zhiliang
2017-01-01
Large-scale assessments are supported by a large item pool. An important task in test development is to assign items into scales that measure different characteristics of individuals, and a popular approach is cluster analysis of items. Classical methods in cluster analysis, such as the hierarchical clustering, K-means method, and latent-class analysis, often induce a high computational overhead and have difficulty handling missing data, especially in the presence of high-dimensional responses. In this article, the authors propose a spectral clustering algorithm for exploratory item cluster analysis. The method is computationally efficient, effective for data with missing or incomplete responses, easy to implement, and often outperforms traditional clustering algorithms in the context of high dimensionality. The spectral clustering algorithm is based on graph theory, a branch of mathematics that studies the properties of graphs. The algorithm first constructs a graph of items, characterizing the similarity structure among items. It then extracts item clusters based on the graphical structure, grouping similar items together. The proposed method is evaluated through simulations and an application to the revised Eysenck Personality Questionnaire. PMID:29033476
Symmetric nonnegative matrix factorization: algorithms and applications to probabilistic clustering.
He, Zhaoshui; Xie, Shengli; Zdunek, Rafal; Zhou, Guoxu; Cichocki, Andrzej
2011-12-01
Nonnegative matrix factorization (NMF) is an unsupervised learning method useful in various applications including image processing and semantic analysis of documents. This paper focuses on symmetric NMF (SNMF), which is a special case of NMF decomposition. Three parallel multiplicative update algorithms using level 3 basic linear algebra subprograms directly are developed for this problem. First, by minimizing the Euclidean distance, a multiplicative update algorithm is proposed, and its convergence under mild conditions is proved. Based on it, we further propose another two fast parallel methods: α-SNMF and β -SNMF algorithms. All of them are easy to implement. These algorithms are applied to probabilistic clustering. We demonstrate their effectiveness for facial image clustering, document categorization, and pattern clustering in gene expression.
Aerosol Models for the CALIPSO Lidar Inversion Algorithms
NASA Technical Reports Server (NTRS)
Omar, Ali H.; Winker, David M.; Won, Jae-Gwang
2003-01-01
We use measurements and models to develop aerosol models for use in the inversion algorithms for the Cloud Aerosol Lidar and Imager Pathfinder Spaceborne Observations (CALIPSO). Radiance measurements and inversions of the AErosol RObotic NETwork (AERONET1, 2) are used to group global atmospheric aerosols using optical and microphysical parameters. This study uses more than 105 records of radiance measurements, aerosol size distributions, and complex refractive indices to generate the optical properties of the aerosol at more 200 sites worldwide. These properties together with the radiance measurements are then classified using classical clustering methods to group the sites according to the type of aerosol with the greatest frequency of occurrence at each site. Six significant clusters are identified: desert dust, biomass burning, urban industrial pollution, rural background, marine, and dirty pollution. Three of these are used in the CALIPSO aerosol models to characterize desert dust, biomass burning, and polluted continental aerosols. The CALIPSO aerosol model also uses the coarse mode of desert dust and the fine mode of biomass burning to build a polluted dust model. For marine aerosol, the CALIPSO aerosol model uses measurements from the SEAS experiment 3. In addition to categorizing the aerosol types, the cluster analysis provides all the column optical and microphysical properties for each cluster.
Tang, Jiqiang; Yang, Wu; Zhu, Lingyun; Wang, Dong; Feng, Xin
2017-01-01
In recent years, Wireless Sensor Networks with a Mobile Sink (WSN-MS) have been an active research topic due to the widespread use of mobile devices. However, how to get the balance between data delivery latency and energy consumption becomes a key issue of WSN-MS. In this paper, we study the clustering approach by jointly considering the Route planning for mobile sink and Clustering Problem (RCP) for static sensor nodes. We solve the RCP problem by using the minimum travel route clustering approach, which applies the minimum travel route of the mobile sink to guide the clustering process. We formulate the RCP problem as an Integer Non-Linear Programming (INLP) problem to shorten the travel route of the mobile sink under three constraints: the communication hops constraint, the travel route constraint and the loop avoidance constraint. We then propose an Imprecise Induction Algorithm (IIA) based on the property that the solution with a small hop count is more feasible than that with a large hop count. The IIA algorithm includes three processes: initializing travel route planning with a Traveling Salesman Problem (TSP) algorithm, transforming the cluster head to a cluster member and transforming the cluster member to a cluster head. Extensive experimental results show that the IIA algorithm could automatically adjust cluster heads according to the maximum hops parameter and plan a shorter travel route for the mobile sink. Compared with the Shortest Path Tree-based Data-Gathering Algorithm (SPT-DGA), the IIA algorithm has the characteristics of shorter route length, smaller cluster head count and faster convergence rate. PMID:28445434
Nidheesh, N; Abdul Nazeer, K A; Ameer, P M
2017-12-01
Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data. Copyright © 2017 Elsevier Ltd. All rights reserved.
The ellipticity of galaxy cluster haloes from satellite galaxies and weak lensing
Shin, Tae-hyeon; Clampitt, Joseph; Jain, Bhuvnesh; ...
2018-01-04
Here, we study the ellipticity of galaxy cluster haloes as characterized by the distribution of cluster galaxies and as measured with weak lensing. We use Monte Carlo simulations of elliptical cluster density profiles to estimate and correct for Poisson noise bias, edge bias and projection effects. We apply our methodology to 10 428 Sloan Digital Sky Survey clusters identified by the redMaPPer algorithm with richness above 20. We find a mean ellipticity =0.271 ± 0.002 (stat) ±0.031 (sys) corresponding to an axis ratio = 0.573 ± 0.002 (stat) ±0.039 (sys). We compare this ellipticity of the satellites to the halomore » shape, through a stacked lensing measurement using optimal estimators of the lensing quadrupole based on Clampitt and Jain (2016). We find a best-fitting axis ratio of 0.56 ± 0.09 (stat) ±0.03 (sys), consistent with the ellipticity of the satellite distribution. Thus, cluster galaxies trace the shape of the dark matter halo to within our estimated uncertainties. Finally, we restack the satellite and lensing ellipticity measurements along the major axis of the cluster central galaxy's light distribution. From the lensing measurements, we infer a misalignment angle with an root-mean-square of 30° ± 10° when stacking on the central galaxy. We discuss applications of halo shape measurements to test the effects of the baryonic gas and active galactic nucleus feedback, as well as dark matter and gravity. The major improvements in signal-to-noise ratio expected with the ongoing Dark Energy Survey and future surveys from Large Synoptic Survey Telescope, Euclid, and Wide Field Infrared Survey Telescope will make halo shapes a useful probe of these effects.« less
Anisotropic thermal conduction with magnetic fields in galaxy clusters
NASA Astrophysics Data System (ADS)
Arth, Alexander; Dolag, Klaus; Beck, Alexander; Petkova, Margarita; Lesch, Harald
2015-08-01
Magnetic fields play an important role for the propagation and diffusion of charged particles, which are responsible for thermal conduction. In this poster, we present an implementation of thermal conduction including the anisotropic effects of magnetic fields for smoothed particle hydrodynamics (SPH). The anisotropic thermal conduction is mainly proceeding parallel to magnetic fields and suppressed perpendicular to the fields. We derive the SPH formalism for the anisotropic heat transport and solve the corresponding equation with an implicit conjugate gradient scheme. We discuss several issues of unphysical heat transport in the cases of extreme ansiotropies or unmagnetized regions and present possible numerical workarounds. We implement our algorithm into the cosmological simulation code GADGET and study its behaviour in several test cases. In general, we reproduce the analytical solutions of our idealised test problems, and obtain good results in cosmological simulations of galaxy cluster formations. Within galaxy clusters, the anisotropic conduction produces a net heat transport similar to an isotropic Spitzer conduction model with low efficiency. In contrast to isotropic conduction our new formalism allows small-scale structure in the temperature distribution to remain stable, because of their decoupling caused by magnetic field lines. Compared to observations, strong isotropic conduction leads to an oversmoothed temperature distribution within clusters, while the results obtained with anisotropic thermal conduction reproduce the observed temperature fluctuations well. A proper treatment of heat transport is crucial especially in the outskirts of clusters and also in high density regions. It's connection to the local dynamical state of the cluster also might contribute to the observed bimodal distribution of cool core and non cool core clusters. Our new scheme significantly advances the modelling of thermal conduction in numerical simulations and overall gives better results compared to observations.
The ellipticity of galaxy cluster haloes from satellite galaxies and weak lensing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shin, Tae-hyeon; Clampitt, Joseph; Jain, Bhuvnesh
Here, we study the ellipticity of galaxy cluster haloes as characterized by the distribution of cluster galaxies and as measured with weak lensing. We use Monte Carlo simulations of elliptical cluster density profiles to estimate and correct for Poisson noise bias, edge bias and projection effects. We apply our methodology to 10 428 Sloan Digital Sky Survey clusters identified by the redMaPPer algorithm with richness above 20. We find a mean ellipticity =0.271 ± 0.002 (stat) ±0.031 (sys) corresponding to an axis ratio = 0.573 ± 0.002 (stat) ±0.039 (sys). We compare this ellipticity of the satellites to the halomore » shape, through a stacked lensing measurement using optimal estimators of the lensing quadrupole based on Clampitt and Jain (2016). We find a best-fitting axis ratio of 0.56 ± 0.09 (stat) ±0.03 (sys), consistent with the ellipticity of the satellite distribution. Thus, cluster galaxies trace the shape of the dark matter halo to within our estimated uncertainties. Finally, we restack the satellite and lensing ellipticity measurements along the major axis of the cluster central galaxy's light distribution. From the lensing measurements, we infer a misalignment angle with an root-mean-square of 30° ± 10° when stacking on the central galaxy. We discuss applications of halo shape measurements to test the effects of the baryonic gas and active galactic nucleus feedback, as well as dark matter and gravity. The major improvements in signal-to-noise ratio expected with the ongoing Dark Energy Survey and future surveys from Large Synoptic Survey Telescope, Euclid, and Wide Field Infrared Survey Telescope will make halo shapes a useful probe of these effects.« less
The ellipticity of galaxy cluster haloes from satellite galaxies and weak lensing
NASA Astrophysics Data System (ADS)
Shin, Tae-hyeon; Clampitt, Joseph; Jain, Bhuvnesh; Bernstein, Gary; Neil, Andrew; Rozo, Eduardo; Rykoff, Eli
2018-04-01
We study the ellipticity of galaxy cluster haloes as characterized by the distribution of cluster galaxies and as measured with weak lensing. We use Monte Carlo simulations of elliptical cluster density profiles to estimate and correct for Poisson noise bias, edge bias and projection effects. We apply our methodology to 10 428 Sloan Digital Sky Survey clusters identified by the redMaPPer algorithm with richness above 20. We find a mean ellipticity =0.271 ± 0.002 (stat) ±0.031 (sys) corresponding to an axis ratio = 0.573 ± 0.002 (stat) ±0.039 (sys). We compare this ellipticity of the satellites to the halo shape, through a stacked lensing measurement using optimal estimators of the lensing quadrupole based on Clampitt and Jain (2016). We find a best-fitting axis ratio of 0.56 ± 0.09 (stat) ±0.03 (sys), consistent with the ellipticity of the satellite distribution. Thus, cluster galaxies trace the shape of the dark matter halo to within our estimated uncertainties. Finally, we restack the satellite and lensing ellipticity measurements along the major axis of the cluster central galaxy's light distribution. From the lensing measurements, we infer a misalignment angle with an root-mean-square of 30° ± 10° when stacking on the central galaxy. We discuss applications of halo shape measurements to test the effects of the baryonic gas and active galactic nucleus feedback, as well as dark matter and gravity. The major improvements in signal-to-noise ratio expected with the ongoing Dark Energy Survey and future surveys from Large Synoptic Survey Telescope, Euclid, and Wide Field Infrared Survey Telescope will make halo shapes a useful probe of these effects.
Determining the Number of Clusters in a Data Set Without Graphical Interpretation
NASA Technical Reports Server (NTRS)
Aguirre, Nathan S.; Davies, Misty D.
2011-01-01
Cluster analysis is a data mining technique that is meant ot simplify the process of classifying data points. The basic clustering process requires an input of data points and the number of clusters wanted. The clustering algorithm will then pick starting C points for the clusters, which can be either random spatial points or random data points. It then assigns each data point to the nearest C point where "nearest usually means Euclidean distance, but some algorithms use another criterion. The next step is determining whether the clustering arrangement this found is within a certain tolerance. If it falls within this tolerance, the process ends. Otherwise the C points are adjusted based on how many data points are in each cluster, and the steps repeat until the algorithm converges,
Bhattacharya, Anindya; De, Rajat K
2010-08-01
Distance based clustering algorithms can group genes that show similar expression values under multiple experimental conditions. They are unable to identify a group of genes that have similar pattern of variation in their expression values. Previously we developed an algorithm called divisive correlation clustering algorithm (DCCA) to tackle this situation, which is based on the concept of correlation clustering. But this algorithm may also fail for certain cases. In order to overcome these situations, we propose a new clustering algorithm, called average correlation clustering algorithm (ACCA), which is able to produce better clustering solution than that produced by some others. ACCA is able to find groups of genes having more common transcription factors and similar pattern of variation in their expression values. Moreover, ACCA is more efficient than DCCA with respect to the time of execution. Like DCCA, we use the concept of correlation clustering concept introduced by Bansal et al. ACCA uses the correlation matrix in such a way that all genes in a cluster have the highest average correlation values with the genes in that cluster. We have applied ACCA and some well-known conventional methods including DCCA to two artificial and nine gene expression datasets, and compared the performance of the algorithms. The clustering results of ACCA are found to be more significantly relevant to the biological annotations than those of the other methods. Analysis of the results show the superiority of ACCA over some others in determining a group of genes having more common transcription factors and with similar pattern of variation in their expression profiles. Availability of the software: The software has been developed using C and Visual Basic languages, and can be executed on the Microsoft Windows platforms. The software may be downloaded as a zip file from http://www.isical.ac.in/~rajat. Then it needs to be installed. Two word files (included in the zip file) need to be consulted before installation and execution of the software. Copyright 2010 Elsevier Inc. All rights reserved.
Reducing the time requirement of k-means algorithm.
Osamor, Victor Chukwudi; Adebiyi, Ezekiel Femi; Oyelade, Jelilli Olarenwaju; Doumbia, Seydou
2012-01-01
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space R(d) and an integer k. The problem is to determine a set of k points in R(d), called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARI(HA)). We found that when k is close to d, the quality is good (ARI(HA)>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARI(HA)>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.
Reducing the Time Requirement of k-Means Algorithm
Osamor, Victor Chukwudi; Adebiyi, Ezekiel Femi; Oyelade, Jelilli Olarenwaju; Doumbia, Seydou
2012-01-01
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the quality is good (ARIHA>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data. PMID:23239974
Novel probabilistic neuroclassifier
NASA Astrophysics Data System (ADS)
Hong, Jiang; Serpen, Gursel
2003-09-01
A novel probabilistic potential function neural network classifier algorithm to deal with classes which are multi-modally distributed and formed from sets of disjoint pattern clusters is proposed in this paper. The proposed classifier has a number of desirable properties which distinguish it from other neural network classifiers. A complete description of the algorithm in terms of its architecture and the pseudocode is presented. Simulation analysis of the newly proposed neuro-classifier algorithm on a set of benchmark problems is presented. Benchmark problems tested include IRIS, Sonar, Vowel Recognition, Two-Spiral, Wisconsin Breast Cancer, Cleveland Heart Disease and Thyroid Gland Disease. Simulation results indicate that the proposed neuro-classifier performs consistently better for a subset of problems for which other neural classifiers perform relatively poorly.
Block clustering based on difference of convex functions (DC) programming and DC algorithms.
Le, Hoai Minh; Le Thi, Hoai An; Dinh, Tao Pham; Huynh, Van Ngai
2013-10-01
We investigate difference of convex functions (DC) programming and the DC algorithm (DCA) to solve the block clustering problem in the continuous framework, which traditionally requires solving a hard combinatorial optimization problem. DC reformulation techniques and exact penalty in DC programming are developed to build an appropriate equivalent DC program of the block clustering problem. They lead to an elegant and explicit DCA scheme for the resulting DC program. Computational experiments show the robustness and efficiency of the proposed algorithm and its superiority over standard algorithms such as two-mode K-means, two-mode fuzzy clustering, and block classification EM.
Online clustering algorithms for radar emitter classification.
Liu, Jun; Lee, Jim P Y; Senior; Li, Lingjie; Luo, Zhi-Quan; Wong, K Max
2005-08-01
Radar emitter classification is a special application of data clustering for classifying unknown radar emitters from received radar pulse samples. The main challenges of this task are the high dimensionality of radar pulse samples, small sample group size, and closely located radar pulse clusters. In this paper, two new online clustering algorithms are developed for radar emitter classification: One is model-based using the Minimum Description Length (MDL) criterion and the other is based on competitive learning. Computational complexity is analyzed for each algorithm and then compared. Simulation results show the superior performance of the model-based algorithm over competitive learning in terms of better classification accuracy, flexibility, and stability.
CAMPAIGN: an open-source library of GPU-accelerated data clustering algorithms.
Kohlhoff, Kai J; Sosnick, Marc H; Hsu, William T; Pande, Vijay S; Altman, Russ B
2011-08-15
Data clustering techniques are an essential component of a good data analysis toolbox. Many current bioinformatics applications are inherently compute-intense and work with very large datasets. Sequential algorithms are inadequate for providing the necessary performance. For this reason, we have created Clustering Algorithms for Massively Parallel Architectures, Including GPU Nodes (CAMPAIGN), a central resource for data clustering algorithms and tools that are implemented specifically for execution on massively parallel processing architectures. CAMPAIGN is a library of data clustering algorithms and tools, written in 'C for CUDA' for Nvidia GPUs. The library provides up to two orders of magnitude speed-up over respective CPU-based clustering algorithms and is intended as an open-source resource. New modules from the community will be accepted into the library and the layout of it is such that it can easily be extended to promising future platforms such as OpenCL. Releases of the CAMPAIGN library are freely available for download under the LGPL from https://simtk.org/home/campaign. Source code can also be obtained through anonymous subversion access as described on https://simtk.org/scm/?group_id=453. kjk33@cantab.net.
Research on the precise positioning of customers in large data environment
NASA Astrophysics Data System (ADS)
Zhou, Xu; He, Lili
2018-04-01
Customer positioning has always been a problem that enterprises focus on. In this paper, FCM clustering algorithm is used to cluster customer groups. However, due to the traditional FCM clustering algorithm, which is susceptible to the influence of the initial clustering center and easy to fall into the local optimal problem, the short board of FCM is solved by the gray optimization algorithm (GWO) to achieve efficient and accurate handling of a large number of retailer data.
An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China.
Zou, Hui; Zou, Zhihong; Wang, Xiaojing
2015-11-12
The increase and the complexity of data caused by the uncertain environment is today's reality. In order to identify water quality effectively and reliably, this paper presents a modified fast clustering algorithm for water quality analysis. The algorithm has adopted a varying weights K-means cluster algorithm to analyze water monitoring data. The varying weights scheme was the best weighting indicator selected by a modified indicator weight self-adjustment algorithm based on K-means, which is named MIWAS-K-means. The new clustering algorithm avoids the margin of the iteration not being calculated in some cases. With the fast clustering analysis, we can identify the quality of water samples. The algorithm is applied in water quality analysis of the Haihe River (China) data obtained by the monitoring network over a period of eight years (2006-2013) with four indicators at seven different sites (2078 samples). Both the theoretical and simulated results demonstrate that the algorithm is efficient and reliable for water quality analysis of the Haihe River. In addition, the algorithm can be applied to more complex data matrices with high dimensionality.
Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale
Kobourov, Stephen; Gallant, Mike; Börner, Katy
2016-01-01
Overview Notions of community quality underlie the clustering of networks. While studies surrounding network clustering are increasingly common, a precise understanding of the realtionship between different cluster quality metrics is unknown. In this paper, we examine the relationship between stand-alone cluster quality metrics and information recovery metrics through a rigorous analysis of four widely-used network clustering algorithms—Louvain, Infomap, label propagation, and smart local moving. We consider the stand-alone quality metrics of modularity, conductance, and coverage, and we consider the information recovery metrics of adjusted Rand score, normalized mutual information, and a variant of normalized mutual information used in previous work. Our study includes both synthetic graphs and empirical data sets of sizes varying from 1,000 to 1,000,000 nodes. Cluster Quality Metrics We find significant differences among the results of the different cluster quality metrics. For example, clustering algorithms can return a value of 0.4 out of 1 on modularity but score 0 out of 1 on information recovery. We find conductance, though imperfect, to be the stand-alone quality metric that best indicates performance on the information recovery metrics. Additionally, our study shows that the variant of normalized mutual information used in previous work cannot be assumed to differ only slightly from traditional normalized mutual information. Network Clustering Algorithms Smart local moving is the overall best performing algorithm in our study, but discrepancies between cluster evaluation metrics prevent us from declaring it an absolutely superior algorithm. Interestingly, Louvain performed better than Infomap in nearly all the tests in our study, contradicting the results of previous work in which Infomap was superior to Louvain. We find that although label propagation performs poorly when clusters are less clearly defined, it scales efficiently and accurately to large graphs with well-defined clusters. PMID:27391786
Guasom Analysis Of The Alhambra Survey
NASA Astrophysics Data System (ADS)
Garabato, Daniel; Manteiga, Minia; Dafonte, Carlos; Álvarez, Marco A.
2017-10-01
GUASOM is a data mining tool designed for knowledge discovery in large astronomical spectrophotometric archives developed in the framework of Gaia DPAC (Data Processing and Analysis Consortium). Our tool is based on a type of unsupervised learning Artificial Neural Networks named Self-organizing maps (SOMs). SOMs permit the grouping and visualization of big amount of data for which there is no a priori knowledge and hence they are very useful for analyzing the huge amount of information present in modern spectrophotometric surveys. SOMs are used to organize the information in clusters of objects, as homogeneously as possible according to their spectral energy distributions, and to project them onto a 2D grid where the data structure can be visualized. Each cluster has a representative, called prototype which is a virtual pattern that better represents or resembles the set of input patterns belonging to such a cluster. Prototypes make easier the task of determining the physical nature and properties of the objects populating each cluster. Our algorithm has been tested on the ALHAMBRA survey spectrophotometric observations, here we present our results concerning the survey segmentation, visualization of the data structure, separation between types of objects (stars and galaxies), data homogeneity of neurons, cluster prototypes, redshift distribution and crossmatch with other databases (Simbad).
A mesh partitioning algorithm for preserving spatial locality in arbitrary geometries
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nivarti, Girish V., E-mail: g.nivarti@alumni.ubc.ca; Salehi, M. Mahdi; Bushe, W. Kendal
2015-01-15
Highlights: •An algorithm for partitioning computational meshes is proposed. •The Morton order space-filling curve is modified to achieve improved locality. •A spatial locality metric is defined to compare results with existing approaches. •Results indicate improved performance of the algorithm in complex geometries. -- Abstract: A space-filling curve (SFC) is a proximity preserving linear mapping of any multi-dimensional space and is widely used as a clustering tool. Equi-sized partitioning of an SFC ignores the loss in clustering quality that occurs due to inaccuracies in the mapping. Often, this results in poor locality within partitions, especially for the conceptually simple, Morton ordermore » curves. We present a heuristic that improves partition locality in arbitrary geometries by slicing a Morton order curve at points where spatial locality is sacrificed. In addition, we develop algorithms that evenly distribute points to the extent possible while maintaining spatial locality. A metric is defined to estimate relative inter-partition contact as an indicator of communication in parallel computing architectures. Domain partitioning tests have been conducted on geometries relevant to turbulent reactive flow simulations. The results obtained highlight the performance of our method as an unsupervised and computationally inexpensive domain partitioning tool.« less
Luo, Ze; Baoping, Yan; Takekawa, John Y.; Prosser, Diann J.
2012-01-01
We propose a new method to help ornithologists and ecologists discover shared segments on the migratory pathway of the bar-headed geese by time-based plane-sweeping trajectory clustering. We present a density-based time parameterized line segment clustering algorithm, which extends traditional comparable clustering algorithms from temporal and spatial dimensions. We present a time-based plane-sweeping trajectory clustering algorithm to reveal the dynamic evolution of spatial-temporal object clusters and discover common motion patterns of bar-headed geese in the process of migration. Experiments are performed on GPS-based satellite telemetry data from bar-headed geese and results demonstrate our algorithms can correctly discover shared segments of the bar-headed geese migratory pathway. We also present findings on the migratory behavior of bar-headed geese determined from this new analytical approach.
Computational gene expression profiling under salt stress reveals patterns of co-expression
Sanchita; Sharma, Ashok
2016-01-01
Plants respond differently to environmental conditions. Among various abiotic stresses, salt stress is a condition where excess salt in soil causes inhibition of plant growth. To understand the response of plants to the stress conditions, identification of the responsible genes is required. Clustering is a data mining technique used to group the genes with similar expression. The genes of a cluster show similar expression and function. We applied clustering algorithms on gene expression data of Solanum tuberosum showing differential expression in Capsicum annuum under salt stress. The clusters, which were common in multiple algorithms were taken further for analysis. Principal component analysis (PCA) further validated the findings of other cluster algorithms by visualizing their clusters in three-dimensional space. Functional annotation results revealed that most of the genes were involved in stress related responses. Our findings suggest that these algorithms may be helpful in the prediction of the function of co-expressed genes. PMID:26981411
A scalable and practical one-pass clustering algorithm for recommender system
NASA Astrophysics Data System (ADS)
Khalid, Asra; Ghazanfar, Mustansar Ali; Azam, Awais; Alahmari, Saad Ali
2015-12-01
KMeans clustering-based recommendation algorithms have been proposed claiming to increase the scalability of recommender systems. One potential drawback of these algorithms is that they perform training offline and hence cannot accommodate the incremental updates with the arrival of new data, making them unsuitable for the dynamic environments. From this line of research, a new clustering algorithm called One-Pass is proposed, which is a simple, fast, and accurate. We show empirically that the proposed algorithm outperforms K-Means in terms of recommendation and training time while maintaining a good level of accuracy.
Ju, Chunhua; Xu, Chonghuan
2013-01-01
Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods.
Ju, Chunhua
2013-01-01
Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods. PMID:24381525
Convalescing Cluster Configuration Using a Superlative Framework
Sabitha, R.; Karthik, S.
2015-01-01
Competent data mining methods are vital to discover knowledge from databases which are built as a result of enormous growth of data. Various techniques of data mining are applied to obtain knowledge from these databases. Data clustering is one such descriptive data mining technique which guides in partitioning data objects into disjoint segments. K-means algorithm is a versatile algorithm among the various approaches used in data clustering. The algorithm and its diverse adaptation methods suffer certain problems in their performance. To overcome these issues a superlative algorithm has been proposed in this paper to perform data clustering. The specific feature of the proposed algorithm is discretizing the dataset, thereby improving the accuracy of clustering, and also adopting the binary search initialization method to generate cluster centroids. The generated centroids are fed as input to K-means approach which iteratively segments the data objects into respective clusters. The clustered results are measured for accuracy and validity. Experiments conducted by testing the approach on datasets from the UC Irvine Machine Learning Repository evidently show that the accuracy and validity measure is higher than the other two approaches, namely, simple K-means and Binary Search method. Thus, the proposed approach proves that discretization process will improve the efficacy of descriptive data mining tasks. PMID:26543895
On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms
He, Li; Zheng, Hao; Wang, Lei
2017-01-01
Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions. PMID:29123546
A curvature-based weighted fuzzy c-means algorithm for point clouds de-noising
NASA Astrophysics Data System (ADS)
Cui, Xin; Li, Shipeng; Yan, Xiutian; He, Xinhua
2018-04-01
In order to remove the noise of three-dimensional scattered point cloud and smooth the data without damnify the sharp geometric feature simultaneity, a novel algorithm is proposed in this paper. The feature-preserving weight is added to fuzzy c-means algorithm which invented a curvature weighted fuzzy c-means clustering algorithm. Firstly, the large-scale outliers are removed by the statistics of r radius neighboring points. Then, the algorithm estimates the curvature of the point cloud data by using conicoid parabolic fitting method and calculates the curvature feature value. Finally, the proposed clustering algorithm is adapted to calculate the weighted cluster centers. The cluster centers are regarded as the new points. The experimental results show that this approach is efficient to different scale and intensities of noise in point cloud with a high precision, and perform a feature-preserving nature at the same time. Also it is robust enough to different noise model.
Liu, L L; Liu, M J; Ma, M
2015-09-28
The central task of this study was to mine the gene-to-medium relationship. Adequate knowledge of this relationship could potentially improve the accuracy of differentially expressed gene mining. One of the approaches to differentially expressed gene mining uses conventional clustering algorithms to identify the gene-to-medium relationship. Compared to conventional clustering algorithms, self-organization maps (SOMs) identify the nonlinear aspects of the gene-to-medium relationships by mapping the input space into another higher dimensional feature space. However, SOMs are not suitable for huge datasets consisting of millions of samples. Therefore, a new computational model, the Function Clustering Self-Organization Maps (FCSOMs), was developed. FCSOMs take advantage of the theory of granular computing as well as advanced statistical learning methodologies, and are built specifically for each information granule (a function cluster of genes), which are intelligently partitioned by the clustering algorithm provided by the DAVID_6.7 software platform. However, only the gene functions, and not their expression values, are considered in the fuzzy clustering algorithm of DAVID. Compared to the clustering algorithm of DAVID, these experimental results show a marked improvement in the accuracy of classification with the application of FCSOMs. FCSOMs can handle huge datasets and their complex classification problems, as each FCSOM (modeled for each function cluster) can be easily parallelized.
A Self-Organizing Spatial Clustering Approach to Support Large-Scale Network RTK Systems.
Shen, Lili; Guo, Jiming; Wang, Lei
2018-06-06
The network real-time kinematic (RTK) technique can provide centimeter-level real time positioning solutions and play a key role in geo-spatial infrastructure. With ever-increasing popularity, network RTK systems will face issues in the support of large numbers of concurrent users. In the past, high-precision positioning services were oriented towards professionals and only supported a few concurrent users. Currently, precise positioning provides a spatial foundation for artificial intelligence (AI), and countless smart devices (autonomous cars, unmanned aerial-vehicles (UAVs), robotic equipment, etc.) require precise positioning services. Therefore, the development of approaches to support large-scale network RTK systems is urgent. In this study, we proposed a self-organizing spatial clustering (SOSC) approach which automatically clusters online users to reduce the computational load on the network RTK system server side. The experimental results indicate that both the SOSC algorithm and the grid algorithm can reduce the computational load efficiently, while the SOSC algorithm gives a more elastic and adaptive clustering solution with different datasets. The SOSC algorithm determines the cluster number and the mean distance to cluster center (MDTCC) according to the data set, while the grid approaches are all predefined. The side-effects of clustering algorithms on the user side are analyzed with real global navigation satellite system (GNSS) data sets. The experimental results indicate that 10 km can be safely used as the cluster radius threshold for the SOSC algorithm without significantly reducing the positioning precision and reliability on the user side.
The applicability and effectiveness of cluster analysis
NASA Technical Reports Server (NTRS)
Ingram, D. S.; Actkinson, A. L.
1973-01-01
An insight into the characteristics which determine the performance of a clustering algorithm is presented. In order for the techniques which are examined to accurately cluster data, two conditions must be simultaneously satisfied. First the data must have a particular structure, and second the parameters chosen for the clustering algorithm must be correct. By examining the structure of the data from the Cl flight line, it is clear that no single set of parameters can be used to accurately cluster all the different crops. The effectiveness of either a noniterative or iterative clustering algorithm to accurately cluster data representative of the Cl flight line is questionable. Thus extensive a prior knowledge is required in order to use cluster analysis in its present form for applications like assisting in the definition of field boundaries and evaluating the homogeneity of a field. New or modified techniques are necessary for clustering to be a reliable tool.
Neji, Radhouène; Besbes, Ahmed; Komodakis, Nikos; Deux, Jean-François; Maatouk, Mezri; Rahmouni, Alain; Bassez, Guillaume; Fleury, Gilles; Paragios, Nikos
2009-01-01
In this paper, we present a manifold clustering method fo the classification of fibers obtained from diffusion tensor images (DTI) of the human skeletal muscle. Using a linear programming formulation of prototype-based clustering, we propose a novel fiber classification algorithm over manifolds that circumvents the necessity to embed the data in low dimensional spaces and determines automatically the number of clusters. Furthermore, we propose the use of angular Hilbertian metrics between multivariate normal distributions to define a family of distances between tensors that we generalize to fibers. These metrics are used to approximate the geodesic distances over the fiber manifold. We also discuss the case where only geodesic distances to a reduced set of landmark fibers are available. The experimental validation of the method is done using a manually annotated significant dataset of DTI of the calf muscle for healthy and diseased subjects.
Integrated Efforts for Analysis of Geophysical Measurements and Models.
1997-09-26
12b. DISTRIBUTION CODE 13. ABSTRACT ( Maximum 200 words) This contract supported investigations of integrated applications of physics, ephemerides...REGIONS AND GPS DATA VALIDATIONS 20 2.5 PL-SCINDA: VISUALIZATION AND ANALYSIS TECHNIQUES 22 2.5.1 View Controls 23 2.5.2 Map Selection...and IR data, about cloudy pixels. Clustering and maximum likelihood classification algorithms categorize up to four cloud layers into stratiform or
Hsu, Arthur L; Tang, Sen-Lin; Halgamuge, Saman K
2003-11-01
Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). JAVA software of dynamic SOM tree algorithm is available upon request for academic use. A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf
Internal Cluster Validation on Earthquake Data in the Province of Bengkulu
NASA Astrophysics Data System (ADS)
Rini, D. S.; Novianti, P.; Fransiska, H.
2018-04-01
K-means method is an algorithm for cluster n object based on attribute to k partition, where k < n. There is a deficiency of algorithms that is before the algorithm is executed, k points are initialized randomly so that the resulting data clustering can be different. If the random value for initialization is not good, the clustering becomes less optimum. Cluster validation is a technique to determine the optimum cluster without knowing prior information from data. There are two types of cluster validation, which are internal cluster validation and external cluster validation. This study aims to examine and apply some internal cluster validation, including the Calinski-Harabasz (CH) Index, Sillhouette (S) Index, Davies-Bouldin (DB) Index, Dunn Index (D), and S-Dbw Index on earthquake data in the Bengkulu Province. The calculation result of optimum cluster based on internal cluster validation is CH index, S index, and S-Dbw index yield k = 2, DB Index with k = 6 and Index D with k = 15. Optimum cluster (k = 6) based on DB Index gives good results for clustering earthquake in the Bengkulu Province.
NASA Technical Reports Server (NTRS)
Mach, Douglas M.; Christian, Hugh J.; Blakeslee, Richard; Boccippio, Dennis J.; Goodman, Steve J.; Boeck, William
2006-01-01
We describe the clustering algorithm used by the Lightning Imaging Sensor (LIS) and the Optical Transient Detector (OTD) for combining the lightning pulse data into events, groups, flashes, and areas. Events are single pixels that exceed the LIS/OTD background level during a single frame (2 ms). Groups are clusters of events that occur within the same frame and in adjacent pixels. Flashes are clusters of groups that occur within 330 ms and either 5.5 km (for LIS) or 16.5 km (for OTD) of each other. Areas are clusters of flashes that occur within 16.5 km of each other. Many investigators are utilizing the LIS/OTD flash data; therefore, we test how variations in the algorithms for the event group and group-flash clustering affect the flash count for a subset of the LIS data. We divided the subset into areas with low (1-3), medium (4-15), high (16-63), and very high (64+) flashes to see how changes in the clustering parameters affect the flash rates in these different sizes of areas. We found that as long as the cluster parameters are within about a factor of two of the current values, the flash counts do not change by more than about 20%. Therefore, the flash clustering algorithm used by the LIS and OTD sensors create flash rates that are relatively insensitive to reasonable variations in the clustering algorithms.
NASA Astrophysics Data System (ADS)
Rahman, Md. Habibur; Matin, M. A.; Salma, Umma
2017-12-01
The precipitation patterns of seventeen locations in Bangladesh from 1961 to 2014 were studied using a cluster analysis and metric multidimensional scaling. In doing so, the current research applies four major hierarchical clustering methods to precipitation in conjunction with different dissimilarity measures and metric multidimensional scaling. A variety of clustering algorithms were used to provide multiple clustering dendrograms for a mixture of distance measures. The dendrogram of pre-monsoon rainfall for the seventeen locations formed five clusters. The pre-monsoon precipitation data for the areas of Srimangal and Sylhet were located in two clusters across the combination of five dissimilarity measures and four hierarchical clustering algorithms. The single linkage algorithm with Euclidian and Manhattan distances, the average linkage algorithm with the Minkowski distance, and Ward's linkage algorithm provided similar results with regard to monsoon precipitation. The results of the post-monsoon and winter precipitation data are shown in different types of dendrograms with disparate combinations of sub-clusters. The schematic geometrical representations of the precipitation data using metric multidimensional scaling showed that the post-monsoon rainfall of Cox's Bazar was located far from those of the other locations. The results of a box-and-whisker plot, different clustering techniques, and metric multidimensional scaling indicated that the precipitation behaviour of Srimangal and Sylhet during the pre-monsoon season, Cox's Bazar and Sylhet during the monsoon season, Maijdi Court and Cox's Bazar during the post-monsoon season, and Cox's Bazar and Khulna during the winter differed from those at other locations in Bangladesh.
An adaptive clustering algorithm for image matching based on corner feature
NASA Astrophysics Data System (ADS)
Wang, Zhe; Dong, Min; Mu, Xiaomin; Wang, Song
2018-04-01
The traditional image matching algorithm always can not balance the real-time and accuracy better, to solve the problem, an adaptive clustering algorithm for image matching based on corner feature is proposed in this paper. The method is based on the similarity of the matching pairs of vector pairs, and the adaptive clustering is performed on the matching point pairs. Harris corner detection is carried out first, the feature points of the reference image and the perceived image are extracted, and the feature points of the two images are first matched by Normalized Cross Correlation (NCC) function. Then, using the improved algorithm proposed in this paper, the matching results are clustered to reduce the ineffective operation and improve the matching speed and robustness. Finally, the Random Sample Consensus (RANSAC) algorithm is used to match the matching points after clustering. The experimental results show that the proposed algorithm can effectively eliminate the most wrong matching points while the correct matching points are retained, and improve the accuracy of RANSAC matching, reduce the computation load of whole matching process at the same time.
Service-Aware Clustering: An Energy-Efficient Model for the Internet-of-Things
Bagula, Antoine; Abidoye, Ademola Philip; Zodi, Guy-Alain Lusilao
2015-01-01
Current generation wireless sensor routing algorithms and protocols have been designed based on a myopic routing approach, where the motes are assumed to have the same sensing and communication capabilities. Myopic routing is not a natural fit for the IoT, as it may lead to energy imbalance and subsequent short-lived sensor networks, routing the sensor readings over the most service-intensive sensor nodes, while leaving the least active nodes idle. This paper revisits the issue of energy efficiency in sensor networks to propose a clustering model where sensor devices’ service delivery is mapped into an energy awareness model, used to design a clustering algorithm that finds service-aware clustering (SAC) configurations in IoT settings. The performance evaluation reveals the relative energy efficiency of the proposed SAC algorithm compared to related routing algorithms in terms of energy consumption, the sensor nodes’ life span and its traffic engineering efficiency in terms of throughput and delay. These include the well-known low energy adaptive clustering hierarchy (LEACH) and LEACH-centralized (LEACH-C) algorithms, as well as the most recent algorithms, such as DECSA and MOCRN. PMID:26703619
Service-Aware Clustering: An Energy-Efficient Model for the Internet-of-Things.
Bagula, Antoine; Abidoye, Ademola Philip; Zodi, Guy-Alain Lusilao
2015-12-23
Current generation wireless sensor routing algorithms and protocols have been designed based on a myopic routing approach, where the motes are assumed to have the same sensing and communication capabilities. Myopic routing is not a natural fit for the IoT, as it may lead to energy imbalance and subsequent short-lived sensor networks, routing the sensor readings over the most service-intensive sensor nodes, while leaving the least active nodes idle. This paper revisits the issue of energy efficiency in sensor networks to propose a clustering model where sensor devices' service delivery is mapped into an energy awareness model, used to design a clustering algorithm that finds service-aware clustering (SAC) configurations in IoT settings. The performance evaluation reveals the relative energy efficiency of the proposed SAC algorithm compared to related routing algorithms in terms of energy consumption, the sensor nodes' life span and its traffic engineering efficiency in terms of throughput and delay. These include the well-known low energy adaptive clustering hierarchy (LEACH) and LEACH-centralized (LEACH-C) algorithms, as well as the most recent algorithms, such as DECSA and MOCRN.
Convex Clustering: An Attractive Alternative to Hierarchical Clustering
Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth
2015-01-01
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340
Convex clustering: an attractive alternative to hierarchical clustering.
Chen, Gary K; Chi, Eric C; Ranola, John Michael O; Lange, Kenneth
2015-05-01
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/.
Mustapha, Ibrahim; Ali, Borhanuddin Mohd; Rasid, Mohd Fadlee A.; Sali, Aduwati; Mohamad, Hafizal
2015-01-01
It is well-known that clustering partitions network into logical groups of nodes in order to achieve energy efficiency and to enhance dynamic channel access in cognitive radio through cooperative sensing. While the topic of energy efficiency has been well investigated in conventional wireless sensor networks, the latter has not been extensively explored. In this paper, we propose a reinforcement learning-based spectrum-aware clustering algorithm that allows a member node to learn the energy and cooperative sensing costs for neighboring clusters to achieve an optimal solution. Each member node selects an optimal cluster that satisfies pairwise constraints, minimizes network energy consumption and enhances channel sensing performance through an exploration technique. We first model the network energy consumption and then determine the optimal number of clusters for the network. The problem of selecting an optimal cluster is formulated as a Markov Decision Process (MDP) in the algorithm and the obtained simulation results show convergence, learning and adaptability of the algorithm to dynamic environment towards achieving an optimal solution. Performance comparisons of our algorithm with the Groupwise Spectrum Aware (GWSA)-based algorithm in terms of Sum of Square Error (SSE), complexity, network energy consumption and probability of detection indicate improved performance from the proposed approach. The results further reveal that an energy savings of 9% and a significant Primary User (PU) detection improvement can be achieved with the proposed approach. PMID:26287191
Mustapha, Ibrahim; Mohd Ali, Borhanuddin; Rasid, Mohd Fadlee A; Sali, Aduwati; Mohamad, Hafizal
2015-08-13
It is well-known that clustering partitions network into logical groups of nodes in order to achieve energy efficiency and to enhance dynamic channel access in cognitive radio through cooperative sensing. While the topic of energy efficiency has been well investigated in conventional wireless sensor networks, the latter has not been extensively explored. In this paper, we propose a reinforcement learning-based spectrum-aware clustering algorithm that allows a member node to learn the energy and cooperative sensing costs for neighboring clusters to achieve an optimal solution. Each member node selects an optimal cluster that satisfies pairwise constraints, minimizes network energy consumption and enhances channel sensing performance through an exploration technique. We first model the network energy consumption and then determine the optimal number of clusters for the network. The problem of selecting an optimal cluster is formulated as a Markov Decision Process (MDP) in the algorithm and the obtained simulation results show convergence, learning and adaptability of the algorithm to dynamic environment towards achieving an optimal solution. Performance comparisons of our algorithm with the Groupwise Spectrum Aware (GWSA)-based algorithm in terms of Sum of Square Error (SSE), complexity, network energy consumption and probability of detection indicate improved performance from the proposed approach. The results further reveal that an energy savings of 9% and a significant Primary User (PU) detection improvement can be achieved with the proposed approach.
Balouchestani, Mohammadreza; Krishnan, Sridhar
2014-01-01
Long-term recording of Electrocardiogram (ECG) signals plays an important role in health care systems for diagnostic and treatment purposes of heart diseases. Clustering and classification of collecting data are essential parts for detecting concealed information of P-QRS-T waves in the long-term ECG recording. Currently used algorithms do have their share of drawbacks: 1) clustering and classification cannot be done in real time; 2) they suffer from huge energy consumption and load of sampling. These drawbacks motivated us in developing novel optimized clustering algorithm which could easily scan large ECG datasets for establishing low power long-term ECG recording. In this paper, we present an advanced K-means clustering algorithm based on Compressed Sensing (CS) theory as a random sampling procedure. Then, two dimensionality reduction methods: Principal Component Analysis (PCA) and Linear Correlation Coefficient (LCC) followed by sorting the data using the K-Nearest Neighbours (K-NN) and Probabilistic Neural Network (PNN) classifiers are applied to the proposed algorithm. We show our algorithm based on PCA features in combination with K-NN classifier shows better performance than other methods. The proposed algorithm outperforms existing algorithms by increasing 11% classification accuracy. In addition, the proposed algorithm illustrates classification accuracy for K-NN and PNN classifiers, and a Receiver Operating Characteristics (ROC) area of 99.98%, 99.83%, and 99.75% respectively.
Distributed cluster management techniques for unattended ground sensor networks
NASA Astrophysics Data System (ADS)
Essawy, Magdi A.; Stelzig, Chad A.; Bevington, James E.; Minor, Sharon
2005-05-01
Smart Sensor Networks are becoming important target detection and tracking tools. The challenging problems in such networks include the sensor fusion, data management and communication schemes. This work discusses techniques used to distribute sensor management and multi-target tracking responsibilities across an ad hoc, self-healing cluster of sensor nodes. Although miniaturized computing resources possess the ability to host complex tracking and data fusion algorithms, there still exist inherent bandwidth constraints on the RF channel. Therefore, special attention is placed on the reduction of node-to-node communications within the cluster by minimizing unsolicited messaging, and distributing the sensor fusion and tracking tasks onto local portions of the network. Several challenging problems are addressed in this work including track initialization and conflict resolution, track ownership handling, and communication control optimization. Emphasis is also placed on increasing the overall robustness of the sensor cluster through independent decision capabilities on all sensor nodes. Track initiation is performed using collaborative sensing within a neighborhood of sensor nodes, allowing each node to independently determine if initial track ownership should be assumed. This autonomous track initiation prevents the formation of duplicate tracks while eliminating the need for a central "management" node to assign tracking responsibilities. Track update is performed as an ownership node requests sensor reports from neighboring nodes based on track error covariance and the neighboring nodes geo-positional location. Track ownership is periodically recomputed using propagated track states to determine which sensing node provides the desired coverage characteristics. High fidelity multi-target simulation results are presented, indicating the distribution of sensor management and tracking capabilities to not only reduce communication bandwidth consumption, but to also simplify multi-target tracking within the cluster.
Kim, Hyoungrae; Jang, Cheongyun; Yadav, Dharmendra K; Kim, Mi-Hyun
2017-03-23
The accuracy of any 3D-QSAR, Pharmacophore and 3D-similarity based chemometric target fishing models are highly dependent on a reasonable sample of active conformations. Since a number of diverse conformational sampling algorithm exist, which exhaustively generate enough conformers, however model building methods relies on explicit number of common conformers. In this work, we have attempted to make clustering algorithms, which could find reasonable number of representative conformer ensembles automatically with asymmetric dissimilarity matrix generated from openeye tool kit. RMSD was the important descriptor (variable) of each column of the N × N matrix considered as N variables describing the relationship (network) between the conformer (in a row) and the other N conformers. This approach used to evaluate the performance of the well-known clustering algorithms by comparison in terms of generating representative conformer ensembles and test them over different matrix transformation functions considering the stability. In the network, the representative conformer group could be resampled for four kinds of algorithms with implicit parameters. The directed dissimilarity matrix becomes the only input to the clustering algorithms. Dunn index, Davies-Bouldin index, Eta-squared values and omega-squared values were used to evaluate the clustering algorithms with respect to the compactness and the explanatory power. The evaluation includes the reduction (abstraction) rate of the data, correlation between the sizes of the population and the samples, the computational complexity and the memory usage as well. Every algorithm could find representative conformers automatically without any user intervention, and they reduced the data to 14-19% of the original values within 1.13 s per sample at the most. The clustering methods are simple and practical as they are fast and do not ask for any explicit parameters. RCDTC presented the maximum Dunn and omega-squared values of the four algorithms in addition to consistent reduction rate between the population size and the sample size. The performance of the clustering algorithms was consistent over different transformation functions. Moreover, the clustering method can also be applied to molecular dynamics sampling simulation results.
A Modified MinMax k-Means Algorithm Based on PSO.
Wang, Xiaoyan; Bai, Yanping
The MinMax k -means algorithm is widely used to tackle the effect of bad initialization by minimizing the maximum intraclustering errors. Two parameters, including the exponent parameter and memory parameter, are involved in the executive process. Since different parameters have different clustering errors, it is crucial to choose appropriate parameters. In the original algorithm, a practical framework is given. Such framework extends the MinMax k -means to automatically adapt the exponent parameter to the data set. It has been believed that if the maximum exponent parameter has been set, then the programme can reach the lowest intraclustering errors. However, our experiments show that this is not always correct. In this paper, we modified the MinMax k -means algorithm by PSO to determine the proper values of parameters which can subject the algorithm to attain the lowest clustering errors. The proposed clustering method is tested on some favorite data sets in several different initial situations and is compared to the k -means algorithm and the original MinMax k -means algorithm. The experimental results indicate that our proposed algorithm can reach the lowest clustering errors automatically.
R, GeethaRamani; Balasubramanian, Lakshmi
2018-07-01
Macula segmentation and fovea localization is one of the primary tasks in retinal analysis as they are responsible for detailed vision. Existing approaches required segmentation of retinal structures viz. optic disc and blood vessels for this purpose. This work avoids knowledge of other retinal structures and attempts data mining techniques to segment macula. Unsupervised clustering algorithm is exploited for this purpose. Selection of initial cluster centres has a great impact on performance of clustering algorithms. A heuristic based clustering in which initial centres are selected based on measures defining statistical distribution of data is incorporated in the proposed methodology. The initial phase of proposed framework includes image cropping, green channel extraction, contrast enhancement and application of mathematical closing. Then, the pre-processed image is subjected to heuristic based clustering yielding a binary map. The binary image is post-processed to eliminate unwanted components. Finally, the component which possessed the minimum intensity is finalized as macula and its centre constitutes the fovea. The proposed approach outperforms existing works by reporting that 100%,of HRF, 100% of DRIVE, 96.92% of DIARETDB0, 97.75% of DIARETDB1, 98.81% of HEI-MED, 90% of STARE and 99.33% of MESSIDOR images satisfy the 1R criterion, a standard adopted for evaluating performance of macula and fovea identification. The proposed system thus helps the ophthalmologists in identifying the macula thereby facilitating to identify if any abnormality is present within the macula region. Copyright © 2018 Elsevier B.V. All rights reserved.
A Missing Link in Galaxy Evolution: The Mysteries of Dissolving Star Clusters
NASA Astrophysics Data System (ADS)
Pellerin, Anne; Meyer, Martin; Harris, Jason; Calzetti, Daniela
2007-05-01
Star-forming events in starbursts and normal galaxies have a direct impact on the global stellar content of galaxies. These events create numerous compact clusters where stars are produced in great number. These stars eventually end up in the star field background where they are smoothly distributed. However, due to instrumental limitations such as spatial resolution and sensitivity, the processes involved during the transition phase from the compact clusters to the star field background as well as the impact of the environment (spiral waves, bars, starburst) on the lifetime of clusters are still poorly constrained observationally. I will present our latest results on the physical properties of dissolving clusters directly detected in HST/ACS archival images of the three nearby galaxies IC 2574, NGC 1313, and IC 10 (D < 5 Mpc). The ACS has the capability to detect and spatially resolve individual stars in nearby galaxies within a large field-of-view. For all ACS images obtained in three filters (F435W, F555W or F606W, and F814W), we performed PSF stellar photometry in crowded field. Color-magnitude diagrams (CMD) allow us to identify the most massive stars more likely to be part of dissolving clusters (A-type and earlier), and to isolate them from the star field background. We then adapt and use a clustering algorithm on the selected stars to find groups of stars to reveal and quantify the properties of all star clusters (compactness, size, age, mass). With this algorithm, even the less compact clusters are revealed while they are being destroyed. Our sample of three galaxies covers an interesting range in gravitational potential well and explores a variety of galaxy morphological types, which allows us to discuss the dissolving cluster properties as a function of the host galaxy characteristics. The properties of the star field background will also be discussed.
Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.
Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias
2011-01-01
The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.
Learner Typologies Development Using OIndex and Data Mining Based Clustering Techniques
ERIC Educational Resources Information Center
Luan, Jing
2004-01-01
This explorative data mining project used distance based clustering algorithm to study 3 indicators, called OIndex, of student behavioral data and stabilized at a 6-cluster scenario following an exhaustive explorative study of 4, 5, and 6 cluster scenarios produced by K-Means and TwoStep algorithms. Using principles in data mining, the study…
Self-organization and clustering algorithms
NASA Technical Reports Server (NTRS)
Bezdek, James C.
1991-01-01
Kohonen's feature maps approach to clustering is often likened to the k or c-means clustering algorithms. Here, the author identifies some similarities and differences between the hard and fuzzy c-Means (HCM/FCM) or ISODATA algorithms and Kohonen's self-organizing approach. The author concludes that some differences are significant, but at the same time there may be some important unknown relationships between the two methodologies. Several avenues of research are proposed.
Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale.
Emmons, Scott; Kobourov, Stephen; Gallant, Mike; Börner, Katy
2016-01-01
Notions of community quality underlie the clustering of networks. While studies surrounding network clustering are increasingly common, a precise understanding of the realtionship between different cluster quality metrics is unknown. In this paper, we examine the relationship between stand-alone cluster quality metrics and information recovery metrics through a rigorous analysis of four widely-used network clustering algorithms-Louvain, Infomap, label propagation, and smart local moving. We consider the stand-alone quality metrics of modularity, conductance, and coverage, and we consider the information recovery metrics of adjusted Rand score, normalized mutual information, and a variant of normalized mutual information used in previous work. Our study includes both synthetic graphs and empirical data sets of sizes varying from 1,000 to 1,000,000 nodes. We find significant differences among the results of the different cluster quality metrics. For example, clustering algorithms can return a value of 0.4 out of 1 on modularity but score 0 out of 1 on information recovery. We find conductance, though imperfect, to be the stand-alone quality metric that best indicates performance on the information recovery metrics. Additionally, our study shows that the variant of normalized mutual information used in previous work cannot be assumed to differ only slightly from traditional normalized mutual information. Smart local moving is the overall best performing algorithm in our study, but discrepancies between cluster evaluation metrics prevent us from declaring it an absolutely superior algorithm. Interestingly, Louvain performed better than Infomap in nearly all the tests in our study, contradicting the results of previous work in which Infomap was superior to Louvain. We find that although label propagation performs poorly when clusters are less clearly defined, it scales efficiently and accurately to large graphs with well-defined clusters.
Validating clustering of molecular dynamics simulations using polymer models.
Phillips, Joshua L; Colvin, Michael E; Newsam, Shawn
2011-11-14
Molecular dynamics (MD) simulation is a powerful technique for sampling the meta-stable and transitional conformations of proteins and other biomolecules. Computational data clustering has emerged as a useful, automated technique for extracting conformational states from MD simulation data. Despite extensive application, relatively little work has been done to determine if the clustering algorithms are actually extracting useful information. A primary goal of this paper therefore is to provide such an understanding through a detailed analysis of data clustering applied to a series of increasingly complex biopolymer models. We develop a novel series of models using basic polymer theory that have intuitive, clearly-defined dynamics and exhibit the essential properties that we are seeking to identify in MD simulations of real biomolecules. We then apply spectral clustering, an algorithm particularly well-suited for clustering polymer structures, to our models and MD simulations of several intrinsically disordered proteins. Clustering results for the polymer models provide clear evidence that the meta-stable and transitional conformations are detected by the algorithm. The results for the polymer models also help guide the analysis of the disordered protein simulations by comparing and contrasting the statistical properties of the extracted clusters. We have developed a framework for validating the performance and utility of clustering algorithms for studying molecular biopolymer simulations that utilizes several analytic and dynamic polymer models which exhibit well-behaved dynamics including: meta-stable states, transition states, helical structures, and stochastic dynamics. We show that spectral clustering is robust to anomalies introduced by structural alignment and that different structural classes of intrinsically disordered proteins can be reliably discriminated from the clustering results. To our knowledge, our framework is the first to utilize model polymers to rigorously test the utility of clustering algorithms for studying biopolymers.
Validating clustering of molecular dynamics simulations using polymer models
2011-01-01
Background Molecular dynamics (MD) simulation is a powerful technique for sampling the meta-stable and transitional conformations of proteins and other biomolecules. Computational data clustering has emerged as a useful, automated technique for extracting conformational states from MD simulation data. Despite extensive application, relatively little work has been done to determine if the clustering algorithms are actually extracting useful information. A primary goal of this paper therefore is to provide such an understanding through a detailed analysis of data clustering applied to a series of increasingly complex biopolymer models. Results We develop a novel series of models using basic polymer theory that have intuitive, clearly-defined dynamics and exhibit the essential properties that we are seeking to identify in MD simulations of real biomolecules. We then apply spectral clustering, an algorithm particularly well-suited for clustering polymer structures, to our models and MD simulations of several intrinsically disordered proteins. Clustering results for the polymer models provide clear evidence that the meta-stable and transitional conformations are detected by the algorithm. The results for the polymer models also help guide the analysis of the disordered protein simulations by comparing and contrasting the statistical properties of the extracted clusters. Conclusions We have developed a framework for validating the performance and utility of clustering algorithms for studying molecular biopolymer simulations that utilizes several analytic and dynamic polymer models which exhibit well-behaved dynamics including: meta-stable states, transition states, helical structures, and stochastic dynamics. We show that spectral clustering is robust to anomalies introduced by structural alignment and that different structural classes of intrinsically disordered proteins can be reliably discriminated from the clustering results. To our knowledge, our framework is the first to utilize model polymers to rigorously test the utility of clustering algorithms for studying biopolymers. PMID:22082218
Adham, Manal T; Bentley, Peter J
2016-08-01
This paper proposes and evaluates a solution to the truck redistribution problem prominent in London's Santander Cycle scheme. Due to the complexity of this NP-hard combinatorial optimisation problem, no efficient optimisation techniques are known to solve the problem exactly. This motivates our use of the heuristic Artificial Ecosystem Algorithm (AEA) to find good solutions in a reasonable amount of time. The AEA is designed to take advantage of highly distributed computer architectures and adapt to changing problems. In the AEA a problem is first decomposed into its relative sub-components; they then evolve solution building blocks that fit together to form a single optimal solution. Three variants of the AEA centred on evaluating clustering methods are presented: the baseline AEA, the community-based AEA which groups stations according to journey flows, and the Adaptive AEA which actively modifies clusters to cater for changes in demand. We applied these AEA variants to the redistribution problem prominent in bike share schemes (BSS). The AEA variants are empirically evaluated using historical data from Santander Cycles to validate the proposed approach and prove its potential effectiveness. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China
Zou, Hui; Zou, Zhihong; Wang, Xiaojing
2015-01-01
The increase and the complexity of data caused by the uncertain environment is today’s reality. In order to identify water quality effectively and reliably, this paper presents a modified fast clustering algorithm for water quality analysis. The algorithm has adopted a varying weights K-means cluster algorithm to analyze water monitoring data. The varying weights scheme was the best weighting indicator selected by a modified indicator weight self-adjustment algorithm based on K-means, which is named MIWAS-K-means. The new clustering algorithm avoids the margin of the iteration not being calculated in some cases. With the fast clustering analysis, we can identify the quality of water samples. The algorithm is applied in water quality analysis of the Haihe River (China) data obtained by the monitoring network over a period of eight years (2006–2013) with four indicators at seven different sites (2078 samples). Both the theoretical and simulated results demonstrate that the algorithm is efficient and reliable for water quality analysis of the Haihe River. In addition, the algorithm can be applied to more complex data matrices with high dimensionality. PMID:26569283
Ferraro Petrillo, Umberto; Roscigno, Gianluca; Cattaneo, Giuseppe; Giancarlo, Raffaele
2018-06-01
Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications. The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH. umberto.ferraro@uniroma1.it. Supplementary data are available at Bioinformatics online.
Fault Identification by Unsupervised Learning Algorithm
NASA Astrophysics Data System (ADS)
Nandan, S.; Mannu, U.
2012-12-01
Contemporary fault identification techniques predominantly rely on the surface expression of the fault. This biased observation is inadequate to yield detailed fault structures in areas with surface cover like cities deserts vegetation etc and the changes in fault patterns with depth. Furthermore it is difficult to estimate faults structure which do not generate any surface rupture. Many disastrous events have been attributed to these blind faults. Faults and earthquakes are very closely related as earthquakes occur on faults and faults grow by accumulation of coseismic rupture. For a better seismic risk evaluation it is imperative to recognize and map these faults. We implement a novel approach to identify seismically active fault planes from three dimensional hypocenter distribution by making use of unsupervised learning algorithms. We employ K-means clustering algorithm and Expectation Maximization (EM) algorithm modified to identify planar structures in spatial distribution of hypocenter after filtering out isolated events. We examine difference in the faults reconstructed by deterministic assignment in K- means and probabilistic assignment in EM algorithm. The method is conceptually identical to methodologies developed by Ouillion et al (2008, 2010) and has been extensively tested on synthetic data. We determined the sensitivity of the methodology to uncertainties in hypocenter location, density of clustering and cross cutting fault structures. The method has been applied to datasets from two contrasting regions. While Kumaon Himalaya is a convergent plate boundary, Koyna-Warna lies in middle of the Indian Plate but has a history of triggered seismicity. The reconstructed faults were validated by examining the fault orientation of mapped faults and the focal mechanism of these events determined through waveform inversion. The reconstructed faults could be used to solve the fault plane ambiguity in focal mechanism determination and constrain the fault orientations for finite source inversions. The faults produced by the method exhibited good correlation with the fault planes obtained by focal mechanism solutions and previously mapped faults.
An effective fuzzy kernel clustering analysis approach for gene expression data.
Sun, Lin; Xu, Jiucheng; Yin, Jiaojiao
2015-01-01
Fuzzy clustering is an important tool for analyzing microarray data. A major problem in applying fuzzy clustering method to microarray gene expression data is the choice of parameters with cluster number and centers. This paper proposes a new approach to fuzzy kernel clustering analysis (FKCA) that identifies desired cluster number and obtains more steady results for gene expression data. First of all, to optimize characteristic differences and estimate optimal cluster number, Gaussian kernel function is introduced to improve spectrum analysis method (SAM). By combining subtractive clustering with max-min distance mean, maximum distance method (MDM) is proposed to determine cluster centers. Then, the corresponding steps of improved SAM (ISAM) and MDM are given respectively, whose superiority and stability are illustrated through performing experimental comparisons on gene expression data. Finally, by introducing ISAM and MDM into FKCA, an effective improved FKCA algorithm is proposed. Experimental results from public gene expression data and UCI database show that the proposed algorithms are feasible for cluster analysis, and the clustering accuracy is higher than the other related clustering algorithms.
NASA Astrophysics Data System (ADS)
Ebrahimi, A.; Pahlavani, P.; Masoumi, Z.
2017-09-01
Traffic monitoring and managing in urban intelligent transportation systems (ITS) can be carried out based on vehicular sensor networks. In a vehicular sensor network, vehicles equipped with sensors such as GPS, can act as mobile sensors for sensing the urban traffic and sending the reports to a traffic monitoring center (TMC) for traffic estimation. The energy consumption by the sensor nodes is a main problem in the wireless sensor networks (WSNs); moreover, it is the most important feature in designing these networks. Clustering the sensor nodes is considered as an effective solution to reduce the energy consumption of WSNs. Each cluster should have a Cluster Head (CH), and a number of nodes located within its supervision area. The cluster heads are responsible for gathering and aggregating the information of clusters. Then, it transmits the information to the data collection center. Hence, the use of clustering decreases the volume of transmitting information, and, consequently, reduces the energy consumption of network. In this paper, Fuzzy C-Means (FCM) and Fuzzy Subtractive algorithms are employed to cluster sensors and investigate their performance on the energy consumption of sensors. It can be seen that the FCM algorithm and Fuzzy Subtractive have been reduced energy consumption of vehicle sensors up to 90.68% and 92.18%, respectively. Comparing the performance of the algorithms implies the 1.5 percent improvement in Fuzzy Subtractive algorithm in comparison.
Investigation of stratigraphic mapping in paintings using micro-Raman spectroscopy
NASA Astrophysics Data System (ADS)
Karagiannis, Georgios Th.; Apostolidis, Georgios K.
2016-04-01
In this work, microRaman spectroscopy is used to investigate the stratigraphic mapping in paintings. The objective of mapping imaging is to segment the dataset, here spectra, into clusters each of which consisting spectra that have similar characteristics; hence, similar chemical composition. The spatial distribution of such clusters can be illustrated in pseudocolor images, in which each pixel of image is colored according to its cluster membership. Such mapping images convey information about the spatial distribution of the chemical substances in an object. Moreover, the laser light source that is used has excitation in 1064 nm, i.e., near infrared (NIR), allowing the penetration of the radiation in deeper layers. Thus, the mapping images that are produced by clustering the acquired spectra (specifying specific bands of Raman shifts) can provide stratigraphic information in the mapping images, i.e., images that convey information of the distribution of substances from deeper, as well. To cluster the spectra, unsupervised machine learning algorithms are applied, e.g., hierarchical clustering. Furthermore, the optical microscopy camera (×50), where the Raman probe (B and WTek iRaman EX) is plugged in, is attached to a computerized numerical control (CNC) system which is driven by a software that is specially developed for Raman mapping. This software except for the conventional CNC operation allows the user to parameterize the spectrometer and check each and every measurement to ensure proper acquisition. This facility is important in painting investigation because some materials are vulnerable to such specific parameterization that other materials demand. The technique is tested on a portable experimental overpainted icon of a known stratigraphy. Specifically, the under icon, i.e., the wavy hair of "Saint James", can be separated from upper icon, i.e., the halo of Mother of God in the "Descent of the Cross".
Paradeisos: A perfect hashing algorithm for many-body eigenvalue problems
NASA Astrophysics Data System (ADS)
Jia, C. J.; Wang, Y.; Mendl, C. B.; Moritz, B.; Devereaux, T. P.
2018-03-01
We describe an essentially perfect hashing algorithm for calculating the position of an element in an ordered list, appropriate for the construction and manipulation of many-body Hamiltonian, sparse matrices. Each element of the list corresponds to an integer value whose binary representation reflects the occupation of single-particle basis states for each element in the many-body Hilbert space. The algorithm replaces conventional methods, such as binary search, for locating the elements of the ordered list, eliminating the need to store the integer representation for each element, without increasing the computational complexity. Combined with the "checkerboard" decomposition of the Hamiltonian matrix for distribution over parallel computing environments, this leads to a substantial savings in aggregate memory. While the algorithm can be applied broadly to many-body, correlated problems, we demonstrate its utility in reducing total memory consumption for a series of fermionic single-band Hubbard model calculations on small clusters with progressively larger Hilbert space dimension.
A Decentralized Eigenvalue Computation Method for Spectrum Sensing Based on Average Consensus
NASA Astrophysics Data System (ADS)
Mohammadi, Jafar; Limmer, Steffen; Stańczak, Sławomir
2016-07-01
This paper considers eigenvalue estimation for the decentralized inference problem for spectrum sensing. We propose a decentralized eigenvalue computation algorithm based on the power method, which is referred to as generalized power method GPM; it is capable of estimating the eigenvalues of a given covariance matrix under certain conditions. Furthermore, we have developed a decentralized implementation of GPM by splitting the iterative operations into local and global computation tasks. The global tasks require data exchange to be performed among the nodes. For this task, we apply an average consensus algorithm to efficiently perform the global computations. As a special case, we consider a structured graph that is a tree with clusters of nodes at its leaves. For an accelerated distributed implementation, we propose to use computation over multiple access channel (CoMAC) as a building block of the algorithm. Numerical simulations are provided to illustrate the performance of the two algorithms.
NASA Astrophysics Data System (ADS)
Arimbi, Mentari Dian; Bustamam, Alhadi; Lestari, Dian
2017-03-01
Data clustering can be executed through partition or hierarchical method for many types of data including DNA sequences. Both clustering methods can be combined by processing partition algorithm in the first level and hierarchical in the second level, called hybrid clustering. In the partition phase some popular methods such as PAM, K-means, or Fuzzy c-means methods could be applied. In this study we selected partitioning around medoids (PAM) in our partition stage. Furthermore, following the partition algorithm, in hierarchical stage we applied divisive analysis algorithm (DIANA) in order to have more specific clusters and sub clusters structures. The number of main clusters is determined using Davies Bouldin Index (DBI) value. We choose the optimal number of clusters if the results minimize the DBI value. In this work, we conduct the clustering on 1252 HPV DNA sequences data from GenBank. The characteristic extraction is initially performed, followed by normalizing and genetic distance calculation using Euclidean distance. In our implementation, we used the hybrid PAM and DIANA using the R open source programming tool. In our results, we obtained 3 main clusters with average DBI value is 0.979, using PAM in the first stage. After executing DIANA in the second stage, we obtained 4 sub clusters for Cluster-1, 9 sub clusters for Cluster-2 and 2 sub clusters in Cluster-3, with the BDI value 0.972, 0.771, and 0.768 for each main cluster respectively. Since the second stage produce lower DBI value compare to the DBI value in the first stage, we conclude that this hybrid approach can improve the accuracy of our clustering results.
Detection of protein complex from protein-protein interaction network using Markov clustering
NASA Astrophysics Data System (ADS)
Ochieng, P. J.; Kusuma, W. A.; Haryanto, T.
2017-05-01
Detection of complexes, or groups of functionally related proteins, is an important challenge while analysing biological networks. However, existing algorithms to identify protein complexes are insufficient when applied to dense networks of experimentally derived interaction data. Therefore, we introduced a graph clustering method based on Markov clustering algorithm to identify protein complex within highly interconnected protein-protein interaction networks. Protein-protein interaction network was first constructed to develop geometrical network, the network was then partitioned using Markov clustering to detect protein complexes. The interest of the proposed method was illustrated by its application to Human Proteins associated to type II diabetes mellitus. Flow simulation of MCL algorithm was initially performed and topological properties of the resultant network were analysed for detection of the protein complex. The results indicated the proposed method successfully detect an overall of 34 complexes with 11 complexes consisting of overlapping modules and 20 non-overlapping modules. The major complex consisted of 102 proteins and 521 interactions with cluster modularity and density of 0.745 and 0.101 respectively. The comparison analysis revealed MCL out perform AP, MCODE and SCPS algorithms with high clustering coefficient (0.751) network density and modularity index (0.630). This demonstrated MCL was the most reliable and efficient graph clustering algorithm for detection of protein complexes from PPI networks.
SU-G-TeP3-14: Three-Dimensional Cluster Model in Inhomogeneous Dose Distribution
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wei, J; Penagaricano, J; Narayanasamy, G
2016-06-15
Purpose: We aim to investigate 3D cluster formation in inhomogeneous dose distribution to search for new models predicting radiation tissue damage and further leading to new optimization paradigm for radiotherapy planning. Methods: The aggregation of higher dose in the organ at risk (OAR) than a preset threshold was chosen as the cluster whose connectivity dictates the cluster structure. Upon the selection of the dose threshold, the fractional density defined as the fraction of voxels in the organ eligible to be part of the cluster was determined according to the dose volume histogram (DVH). A Monte Carlo method was implemented tomore » establish a case pertinent to the corresponding DVH. Ones and zeros were randomly assigned to each OAR voxel with the sampling probability equal to the fractional density. Ten thousand samples were randomly generated to ensure a sufficient number of cluster sets. A recursive cluster searching algorithm was developed to analyze the cluster with various connectivity choices like 1-, 2-, and 3-connectivity. The mean size of the largest cluster (MSLC) from the Monte Carlo samples was taken to be a function of the fractional density. Various OARs from clinical plans were included in the study. Results: Intensive Monte Carlo study demonstrates the inverse relationship between the MSLC and the cluster connectivity as anticipated and the cluster size does not change with fractional density linearly regardless of the connectivity types. An initially-slow-increase to exponential growth transition of the MSLC from low to high density was observed. The cluster sizes were found to vary within a large range and are relatively independent of the OARs. Conclusion: The Monte Carlo study revealed that the cluster size could serve as a suitable index of the tissue damage (percolation cluster) and the clinical outcome of the same DVH might be potentially different.« less
Community Detection Algorithm Combining Stochastic Block Model and Attribute Data Clustering
NASA Astrophysics Data System (ADS)
Kataoka, Shun; Kobayashi, Takuto; Yasuda, Muneki; Tanaka, Kazuyuki
2016-11-01
We propose a new algorithm to detect the community structure in a network that utilizes both the network structure and vertex attribute data. Suppose we have the network structure together with the vertex attribute data, that is, the information assigned to each vertex associated with the community to which it belongs. The problem addressed this paper is the detection of the community structure from the information of both the network structure and the vertex attribute data. Our approach is based on the Bayesian approach that models the posterior probability distribution of the community labels. The detection of the community structure in our method is achieved by using belief propagation and an EM algorithm. We numerically verified the performance of our method using computer-generated networks and real-world networks.
Study of phase clustering method for analyzing large volumes of meteorological observation data
NASA Astrophysics Data System (ADS)
Volkov, Yu. V.; Krutikov, V. A.; Botygin, I. A.; Sherstnev, V. S.; Sherstneva, A. I.
2017-11-01
The article describes an iterative parallel phase grouping algorithm for temperature field classification. The algorithm is based on modified method of structure forming by using analytic signal. The developed method allows to solve tasks of climate classification as well as climatic zoning for any time or spatial scale. When used to surface temperature measurement series, the developed algorithm allows to find climatic structures with correlated changes of temperature field, to make conclusion on climate uniformity in a given area and to overview climate changes over time by analyzing offset in type groups. The information on climate type groups specific for selected geographical areas is expanded by genetic scheme of class distribution depending on change in mutual correlation level between ground temperature monthly average.
Cluster and constraint analysis in tetrahedron packings
NASA Astrophysics Data System (ADS)
Jin, Weiwei; Lu, Peng; Liu, Lufeng; Li, Shuixiang
2015-04-01
The disordered packings of tetrahedra often show no obvious macroscopic orientational or positional order for a wide range of packing densities, and it has been found that the local order in particle clusters is the main order form of tetrahedron packings. Therefore, a cluster analysis is carried out to investigate the local structures and properties of tetrahedron packings in this work. We obtain a cluster distribution of differently sized clusters, and peaks are observed at two special clusters, i.e., dimer and wagon wheel. We then calculate the amounts of dimers and wagon wheels, which are observed to have linear or approximate linear correlations with packing density. Following our previous work, the amount of particles participating in dimers is used as an order metric to evaluate the order degree of the hierarchical packing structure of tetrahedra, and an order map is consequently depicted. Furthermore, a constraint analysis is performed to determine the isostatic or hyperstatic region in the order map. We employ a Monte Carlo algorithm to test jamming and then suggest a new maximally random jammed packing of hard tetrahedra from the order map with a packing density of 0.6337.
Deeper Insights into the Circumgalactic Medium using Multivariate Analysis Methods
NASA Astrophysics Data System (ADS)
Lewis, James; Churchill, Christopher W.; Nielsen, Nikole M.; Kacprzak, Glenn
2017-01-01
Drawing from a database of galaxies whose surrounding gas has absorption from MgII, called the MgII-Absorbing Galaxy Catalog (MAGIICAT, Neilsen et al 2013), we studied the circumgalactic medium (CGM) for a sample of 47 galaxies. Using multivariate analysis, in particular the k-means clustering algorithm, we determined that simultaneously examining column density (N), rest-frame B-K color, virial mass, and azimuthal angle (the projected angle between the galaxy major axis and the quasar line of sight) yields two distinct populations: (1) bluer, lower mass galaxies with higher column density along the minor axis, and (2) redder, higher mass galaxies with lower column density along the major axis. We support this grouping by running (i) two-sample, two-dimensional Kolmogorov-Smirnov (KS) tests on each of the six bivariate planes and (ii) two-sample KS tests on each of the four variables to show that the galaxies significantly cluster into two independent populations. To account for the fact that 16 of our 47 galaxies have upper limits on N, we performed Monte-Carlo tests whereby we replaced upper limits with random deviates drawn from a Schechter distribution fit, f(N). These tests strengthen the results of the KS tests. We examined the behavior of the MgII λ2796 absorption line equivalent width and velocity width for each galaxy population. We find that equivalent width and velocity width do not show similar characteristic distinctions between the two galaxy populations. We discuss the k-means clustering algorithm for optimizing the analysis of populations within datasets as opposed to using arbitrary bivariate subsample cuts. We also discuss the power of the k-means clustering algorithm in extracting deeper physical insight into the CGM in relationship to host galaxies.
Cooperative inversion of magnetotelluric and seismic data sets
NASA Astrophysics Data System (ADS)
Markovic, M.; Santos, F.
2012-04-01
Cooperative inversion of magnetotelluric and seismic data sets Milenko Markovic,Fernando Monteiro Santos IDL, Faculdade de Ciências da Universidade de Lisboa 1749-016 Lisboa Inversion of single geophysical data has well-known limitations due to the non-linearity of the fields and non-uniqueness of the model. There is growing need, both in academy and industry to use two or more different data sets and thus obtain subsurface property distribution. In our case ,we are dealing with magnetotelluric and seismic data sets. In our approach,we are developing algorithm based on fuzzy-c means clustering technique, for pattern recognition of geophysical data. Separate inversion is performed on every step, information exchanged for model integration. Interrelationships between parameters from different models is not required in analytical form. We are investigating how different number of clusters, affects zonation and spatial distribution of parameters. In our study optimization in fuzzy c-means clustering (for magnetotelluric and seismic data) is compared for two cases, firstly alternating optimization and then hybrid method (alternating optimization+ Quasi-Newton method). Acknowledgment: This work is supported by FCT Portugal
Optimized data fusion for K-means Laplacian clustering
Yu, Shi; Liu, Xinhai; Tranchevent, Léon-Charles; Glänzel, Wolfgang; Suykens, Johan A. K.; De Moor, Bart; Moreau, Yves
2011-01-01
Motivation: We propose a novel algorithm to combine multiple kernels and Laplacians for clustering analysis. The new algorithm is formulated on a Rayleigh quotient objective function and is solved as a bi-level alternating minimization procedure. Using the proposed algorithm, the coefficients of kernels and Laplacians can be optimized automatically. Results: Three variants of the algorithm are proposed. The performance is systematically validated on two real-life data fusion applications. The proposed Optimized Kernel Laplacian Clustering (OKLC) algorithms perform significantly better than other methods. Moreover, the coefficients of kernels and Laplacians optimized by OKLC show some correlation with the rank of performance of individual data source. Though in our evaluation the K values are predefined, in practical studies, the optimal cluster number can be consistently estimated from the eigenspectrum of the combined kernel Laplacian matrix. Availability: The MATLAB code of algorithms implemented in this paper is downloadable from http://homes.esat.kuleuven.be/~sistawww/bioi/syu/oklc.html. Contact: shiyu@uchicago.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20980271
Improved fuzzy clustering algorithms in segmentation of DC-enhanced breast MRI.
Kannan, S R; Ramathilagam, S; Devi, Pandiyarajan; Sathya, A
2012-02-01
Segmentation of medical images is a difficult and challenging problem due to poor image contrast and artifacts that result in missing or diffuse organ/tissue boundaries. Many researchers have applied various techniques however fuzzy c-means (FCM) based algorithms is more effective compared to other methods. The objective of this work is to develop some robust fuzzy clustering segmentation systems for effective segmentation of DCE - breast MRI. This paper obtains the robust fuzzy clustering algorithms by incorporating kernel methods, penalty terms, tolerance of the neighborhood attraction, additional entropy term and fuzzy parameters. The initial centers are obtained using initialization algorithm to reduce the computation complexity and running time of proposed algorithms. Experimental works on breast images show that the proposed algorithms are effective to improve the similarity measurement, to handle large amount of noise, to have better results in dealing the data corrupted by noise, and other artifacts. The clustering results of proposed methods are validated using Silhouette Method.
Singh, Dadabhai T; Trehan, Rahul; Schmidt, Bertil; Bretschneider, Timo
2008-01-01
Preparedness for a possible global pandemic caused by viruses such as the highly pathogenic influenza A subtype H5N1 has become a global priority. In particular, it is critical to monitor the appearance of any new emerging subtypes. Comparative phyloinformatics can be used to monitor, analyze, and possibly predict the evolution of viruses. However, in order to utilize the full functionality of available analysis packages for large-scale phyloinformatics studies, a team of computer scientists, biostatisticians and virologists is needed--a requirement which cannot be fulfilled in many cases. Furthermore, the time complexities of many algorithms involved leads to prohibitive runtimes on sequential computer platforms. This has so far hindered the use of comparative phyloinformatics as a commonly applied tool in this area. In this paper the graphical-oriented workflow design system called Quascade and its efficient usage for comparative phyloinformatics are presented. In particular, we focus on how this task can be effectively performed in a distributed computing environment. As a proof of concept, the designed workflows are used for the phylogenetic analysis of neuraminidase of H5N1 isolates (micro level) and influenza viruses (macro level). The results of this paper are hence twofold. Firstly, this paper demonstrates the usefulness of a graphical user interface system to design and execute complex distributed workflows for large-scale phyloinformatics studies of virus genes. Secondly, the analysis of neuraminidase on different levels of complexity provides valuable insights of this virus's tendency for geographical based clustering in the phylogenetic tree and also shows the importance of glycan sites in its molecular evolution. The current study demonstrates the efficiency and utility of workflow systems providing a biologist friendly approach to complex biological dataset analysis using high performance computing. In particular, the utility of the platform Quascade for deploying distributed and parallelized versions of a variety of computationally intensive phylogenetic algorithms has been shown. Secondly, the analysis of the utilized H5N1 neuraminidase datasets at macro and micro levels has clearly indicated a pattern of spatial clustering of the H5N1 viral isolates based on geographical distribution rather than temporal or host range based clustering.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Churchill, R. Michael
Apache Spark is explored as a tool for analyzing large data sets from the magnetic fusion simulation code XGCI. Implementation details of Apache Spark on the NERSC Edison supercomputer are discussed, including binary file reading, and parameter setup. Here, an unsupervised machine learning algorithm, k-means clustering, is applied to XGCI particle distribution function data, showing that highly turbulent spatial regions do not have common coherent structures, but rather broad, ring-like structures in velocity space.
A hybrid algorithm for clustering of time series data based on affinity search technique.
Aghabozorgi, Saeed; Ying Wah, Teh; Herawan, Tutut; Jalab, Hamid A; Shaygan, Mohammad Amin; Jalali, Alireza
2014-01-01
Time series clustering is an important solution to various problems in numerous fields of research, including business, medical science, and finance. However, conventional clustering algorithms are not practical for time series data because they are essentially designed for static data. This impracticality results in poor clustering accuracy in several systems. In this paper, a new hybrid clustering algorithm is proposed based on the similarity in shape of time series data. Time series data are first grouped as subclusters based on similarity in time. The subclusters are then merged using the k-Medoids algorithm based on similarity in shape. This model has two contributions: (1) it is more accurate than other conventional and hybrid approaches and (2) it determines the similarity in shape among time series data with a low complexity. To evaluate the accuracy of the proposed model, the model is tested extensively using syntactic and real-world time series datasets.
Zhang, Junfeng; Chen, Wei; Gao, Mingyi; Shen, Gangxiang
2017-10-30
In this work, we proposed two k-means-clustering-based algorithms to mitigate the fiber nonlinearity for 64-quadrature amplitude modulation (64-QAM) signal, the training-sequence assisted k-means algorithm and the blind k-means algorithm. We experimentally demonstrated the proposed k-means-clustering-based fiber nonlinearity mitigation techniques in 75-Gb/s 64-QAM coherent optical communication system. The proposed algorithms have reduced clustering complexity and low data redundancy and they are able to quickly find appropriate initial centroids and select correctly the centroids of the clusters to obtain the global optimal solutions for large k value. We measured the bit-error-ratio (BER) performance of 64-QAM signal with different launched powers into the 50-km single mode fiber and the proposed techniques can greatly mitigate the signal impairments caused by the amplified spontaneous emission noise and the fiber Kerr nonlinearity and improve the BER performance.
A Hybrid Algorithm for Clustering of Time Series Data Based on Affinity Search Technique
Aghabozorgi, Saeed; Ying Wah, Teh; Herawan, Tutut; Jalab, Hamid A.; Shaygan, Mohammad Amin; Jalali, Alireza
2014-01-01
Time series clustering is an important solution to various problems in numerous fields of research, including business, medical science, and finance. However, conventional clustering algorithms are not practical for time series data because they are essentially designed for static data. This impracticality results in poor clustering accuracy in several systems. In this paper, a new hybrid clustering algorithm is proposed based on the similarity in shape of time series data. Time series data are first grouped as subclusters based on similarity in time. The subclusters are then merged using the k-Medoids algorithm based on similarity in shape. This model has two contributions: (1) it is more accurate than other conventional and hybrid approaches and (2) it determines the similarity in shape among time series data with a low complexity. To evaluate the accuracy of the proposed model, the model is tested extensively using syntactic and real-world time series datasets. PMID:24982966
An Enhanced PSO-Based Clustering Energy Optimization Algorithm for Wireless Sensor Network.
Vimalarani, C; Subramanian, R; Sivanandam, S N
2016-01-01
Wireless Sensor Network (WSN) is a network which formed with a maximum number of sensor nodes which are positioned in an application environment to monitor the physical entities in a target area, for example, temperature monitoring environment, water level, monitoring pressure, and health care, and various military applications. Mostly sensor nodes are equipped with self-supported battery power through which they can perform adequate operations and communication among neighboring nodes. Maximizing the lifetime of the Wireless Sensor networks, energy conservation measures are essential for improving the performance of WSNs. This paper proposes an Enhanced PSO-Based Clustering Energy Optimization (EPSO-CEO) algorithm for Wireless Sensor Network in which clustering and clustering head selection are done by using Particle Swarm Optimization (PSO) algorithm with respect to minimizing the power consumption in WSN. The performance metrics are evaluated and results are compared with competitive clustering algorithm to validate the reduction in energy consumption.
NASA Astrophysics Data System (ADS)
Sahraei, S.; Asadzadeh, M.
2017-12-01
Any modern multi-objective global optimization algorithm should be able to archive a well-distributed set of solutions. While the solution diversity in the objective space has been explored extensively in the literature, little attention has been given to the solution diversity in the decision space. Selection metrics such as the hypervolume contribution and crowding distance calculated in the objective space would guide the search toward solutions that are well-distributed across the objective space. In this study, the diversity of solutions in the decision-space is used as the main selection criteria beside the dominance check in multi-objective optimization. To this end, currently archived solutions are clustered in the decision space and the ones in less crowded clusters are given more chance to be selected for generating new solution. The proposed approach is first tested on benchmark mathematical test problems. Second, it is applied to a hydrologic model calibration problem with more than three objective functions. Results show that the chance of finding more sparse set of high-quality solutions increases, and therefore the analyst would receive a well-diverse set of options with maximum amount of information. Pareto Archived-Dynamically Dimensioned Search, which is an efficient and parsimonious multi-objective optimization algorithm for model calibration, is utilized in this study.
A parallel-processing approach to computing for the geographic sciences
Crane, Michael; Steinwand, Dan; Beckmann, Tim; Krpan, Greg; Haga, Jim; Maddox, Brian; Feller, Mark
2001-01-01
The overarching goal of this project is to build a spatially distributed infrastructure for information science research by forming a team of information science researchers and providing them with similar hardware and software tools to perform collaborative research. Four geographically distributed Centers of the U.S. Geological Survey (USGS) are developing their own clusters of low-cost personal computers into parallel computing environments that provide a costeffective way for the USGS to increase participation in the high-performance computing community. Referred to as Beowulf clusters, these hybrid systems provide the robust computing power required for conducting research into various areas, such as advanced computer architecture, algorithms to meet the processing needs for real-time image and data processing, the creation of custom datasets from seamless source data, rapid turn-around of products for emergency response, and support for computationally intense spatial and temporal modeling.
DISENTANGLING THE ICL WITH THE CHEFs: ABELL 2744 AS A CASE STUDY
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jiménez-Teja, Y.; Dupke, R., E-mail: yojite@iaa.es
Measurements of the intracluster light (ICL) are still prone to methodological ambiguities, and there are multiple techniques in the literature to address them, mostly based on the binding energy, the local density distribution, or the surface brightness. A common issue with these methods is the a priori assumption of a number of hypotheses on either the ICL morphology, its surface brightness level, or some properties of the brightest cluster galaxy (BCG). The discrepancy in the results is high, and numerical simulations just place a boundary on the ICL fraction in present-day galaxy clusters in the range 10%–50%. We developed amore » new algorithm based on the Chebyshev–Fourier functions to estimate the ICL fraction without relying on any a priori assumption about the physical or geometrical characteristics of the ICL. We are able to not only disentangle the ICL from the galactic luminosity but mark out the limits of the BCG from the ICL in a natural way. We test our technique with the recently released data of the cluster Abell 2744, observed by the Frontier Fields program. The complexity of this multiple merging cluster system and the formidable depth of these images make it a challenging test case to prove the efficiency of our algorithm. We found a final ICL fraction of 19.17 ± 2.87%, which is very consistent with numerical simulations.« less
NASA Astrophysics Data System (ADS)
Adya Zizwan, Putra; Zarlis, Muhammad; Budhiarti Nababan, Erna
2017-12-01
The determination of Centroid on K-Means Algorithm directly affects the quality of the clustering results. Determination of centroid by using random numbers has many weaknesses. The GenClust algorithm that combines the use of Genetic Algorithms and K-Means uses a genetic algorithm to determine the centroid of each cluster. The use of the GenClust algorithm uses 50% chromosomes obtained through deterministic calculations and 50% is obtained from the generation of random numbers. This study will modify the use of the GenClust algorithm in which the chromosomes used are 100% obtained through deterministic calculations. The results of this study resulted in performance comparisons expressed in Mean Square Error influenced by centroid determination on K-Means method by using GenClust method, modified GenClust method and also classic K-Means.
Du, Yuncheng; Budman, Hector M; Duever, Thomas A
2017-06-01
Accurate and fast quantitative analysis of living cells from fluorescence microscopy images is useful for evaluating experimental outcomes and cell culture protocols. An algorithm is developed in this work to automatically segment and distinguish apoptotic cells from normal cells. The algorithm involves three steps consisting of two segmentation steps and a classification step. The segmentation steps are: (i) a coarse segmentation, combining a range filter with a marching square method, is used as a prefiltering step to provide the approximate positions of cells within a two-dimensional matrix used to store cells' images and the count of the number of cells for a given image; and (ii) a fine segmentation step using the Active Contours Without Edges method is applied to the boundaries of cells identified in the coarse segmentation step. Although this basic two-step approach provides accurate edges when the cells in a given image are sparsely distributed, the occurrence of clusters of cells in high cell density samples requires further processing. Hence, a novel algorithm for clusters is developed to identify the edges of cells within clusters and to approximate their morphological features. Based on the segmentation results, a support vector machine classifier that uses three morphological features: the mean value of pixel intensities in the cellular regions, the variance of pixel intensities in the vicinity of cell boundaries, and the lengths of the boundaries, is developed for distinguishing apoptotic cells from normal cells. The algorithm is shown to be efficient in terms of computational time, quantitative analysis, and differentiation accuracy, as compared with the use of the active contours method without the proposed preliminary coarse segmentation step.
Hebbian self-organizing integrate-and-fire networks for data clustering.
Landis, Florian; Ott, Thomas; Stoop, Ruedi
2010-01-01
We propose a Hebbian learning-based data clustering algorithm using spiking neurons. The algorithm is capable of distinguishing between clusters and noisy background data and finds an arbitrary number of clusters of arbitrary shape. These properties render the approach particularly useful for visual scene segmentation into arbitrarily shaped homogeneous regions. We present several application examples, and in order to highlight the advantages and the weaknesses of our method, we systematically compare the results with those from standard methods such as the k-means and Ward's linkage clustering. The analysis demonstrates that not only the clustering ability of the proposed algorithm is more powerful than those of the two concurrent methods, the time complexity of the method is also more modest than that of its generally used strongest competitor.
Impact of heuristics in clustering large biological networks.
Shafin, Md Kishwar; Kabir, Kazi Lutful; Ridwan, Iffatur; Anannya, Tasmiah Tamzid; Karim, Rashid Saadman; Hoque, Mohammad Mozammel; Rahman, M Sohel
2015-12-01
Traditional clustering algorithms often exhibit poor performance for large networks. On the contrary, greedy algorithms are found to be relatively efficient while uncovering functional modules from large biological networks. The quality of the clusters produced by these greedy techniques largely depends on the underlying heuristics employed. Different heuristics based on different attributes and properties perform differently in terms of the quality of the clusters produced. This motivates us to design new heuristics for clustering large networks. In this paper, we have proposed two new heuristics and analyzed the performance thereof after incorporating those with three different combinations in a recently celebrated greedy clustering algorithm named SPICi. We have extensively analyzed the effectiveness of these new variants. The results are found to be promising. Copyright © 2015 Elsevier Ltd. All rights reserved.
Fornasaro, Stefano; Vicario, Annalisa; De Leo, Luigina; Bonifacio, Alois; Not, Tarcisio; Sergo, Valter
2018-05-14
Raman hyperspectral imaging is an emerging practice in biological and biomedical research for label free analysis of tissues and cells. Using this method, both spatial distribution and spectral information of analyzed samples can be obtained. The current study reports the first Raman microspectroscopic characterisation of colon tissues from patients with Coeliac Disease (CD). The aim was to assess if Raman imaging coupled with hyperspectral multivariate image analysis is capable of detecting the alterations in the biochemical composition of intestinal tissues associated with CD. The analytical approach was based on a multi-step methodology: duodenal biopsies from healthy and coeliac patients were measured and processed with Multivariate Curve Resolution Alternating Least Squares (MCR-ALS). Based on the distribution maps and the pure spectra of the image constituents obtained from MCR-ALS, interesting biochemical differences between healthy and coeliac patients has been derived. Noticeably, a reduced distribution of complex lipids in the pericryptic space, and a different distribution and abundance of proteins rich in beta-sheet structures was found in CD patients. The output of the MCR-ALS analysis was then used as a starting point for two clustering algorithms (k-means clustering and hierarchical clustering methods). Both methods converged with similar results providing precise segmentation over multiple Raman images of studied tissues.
Mai, Xiaofeng; Liu, Jie; Wu, Xiong; Zhang, Qun; Guo, Changjian; Yang, Yanfu; Li, Zhaohui
2017-02-06
A Stokes-space modulation format classification (MFC) technique is proposed for coherent optical receivers by using a non-iterative clustering algorithm. In the clustering algorithm, two simple parameters are calculated to help find the density peaks of the data points in Stokes space and no iteration is required. Correct MFC can be realized in numerical simulations among PM-QPSK, PM-8QAM, PM-16QAM, PM-32QAM and PM-64QAM signals within practical optical signal-to-noise ratio (OSNR) ranges. The performance of the proposed MFC algorithm is also compared with those of other schemes based on clustering algorithms. The simulation results show that good classification performance can be achieved using the proposed MFC scheme with moderate time complexity. Proof-of-concept experiments are finally implemented to demonstrate MFC among PM-QPSK/16QAM/64QAM signals, which confirm the feasibility of our proposed MFC scheme.
Optimization of wireless sensor networks based on chicken swarm optimization algorithm
NASA Astrophysics Data System (ADS)
Wang, Qingxi; Zhu, Lihua
2017-05-01
In order to reduce the energy consumption of wireless sensor network and improve the survival time of network, the clustering routing protocol of wireless sensor networks based on chicken swarm optimization algorithm was proposed. On the basis of LEACH agreement, it was improved and perfected that the points on the cluster and the selection of cluster head using the chicken group optimization algorithm, and update the location of chicken which fall into the local optimum by Levy flight, enhance population diversity, ensure the global search capability of the algorithm. The new protocol avoided the die of partial node of intensive using by making balanced use of the network nodes, improved the survival time of wireless sensor network. The simulation experiments proved that the protocol is better than LEACH protocol on energy consumption, also is better than that of clustering routing protocol based on particle swarm optimization algorithm.
Predicting the random drift of MEMS gyroscope based on K-means clustering and OLS RBF Neural Network
NASA Astrophysics Data System (ADS)
Wang, Zhen-yu; Zhang, Li-jie
2017-10-01
Measure error of the sensor can be effectively compensated with prediction. Aiming at large random drift error of MEMS(Micro Electro Mechanical System))gyroscope, an improved learning algorithm of Radial Basis Function(RBF) Neural Network(NN) based on K-means clustering and Orthogonal Least-Squares (OLS) is proposed in this paper. The algorithm selects the typical samples as the initial cluster centers of RBF NN firstly, candidates centers with K-means algorithm secondly, and optimizes the candidate centers with OLS algorithm thirdly, which makes the network structure simpler and makes the prediction performance better. Experimental results show that the proposed K-means clustering OLS learning algorithm can predict the random drift of MEMS gyroscope effectively, the prediction error of which is 9.8019e-007°/s and the prediction time of which is 2.4169e-006s
Fault Tolerant Frequent Pattern Mining
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shohdy, Sameh; Vishnu, Abhinav; Agrawal, Gagan
FP-Growth algorithm is a Frequent Pattern Mining (FPM) algorithm that has been extensively used to study correlations and patterns in large scale datasets. While several researchers have designed distributed memory FP-Growth algorithms, it is pivotal to consider fault tolerant FP-Growth, which can address the increasing fault rates in large scale systems. In this work, we propose a novel parallel, algorithm-level fault-tolerant FP-Growth algorithm. We leverage algorithmic properties and MPI advanced features to guarantee an O(1) space complexity, achieved by using the dataset memory space itself for checkpointing. We also propose a recovery algorithm that can use in-memory and disk-based checkpointing,more » though in many cases the recovery can be completed without any disk access, and incurring no memory overhead for checkpointing. We evaluate our FT algorithm on a large scale InfiniBand cluster with several large datasets using up to 2K cores. Our evaluation demonstrates excellent efficiency for checkpointing and recovery in comparison to the disk-based approach. We have also observed 20x average speed-up in comparison to Spark, establishing that a well designed algorithm can easily outperform a solution based on a general fault-tolerant programming model.« less
A Modified MinMax k-Means Algorithm Based on PSO
2016-01-01
The MinMax k-means algorithm is widely used to tackle the effect of bad initialization by minimizing the maximum intraclustering errors. Two parameters, including the exponent parameter and memory parameter, are involved in the executive process. Since different parameters have different clustering errors, it is crucial to choose appropriate parameters. In the original algorithm, a practical framework is given. Such framework extends the MinMax k-means to automatically adapt the exponent parameter to the data set. It has been believed that if the maximum exponent parameter has been set, then the programme can reach the lowest intraclustering errors. However, our experiments show that this is not always correct. In this paper, we modified the MinMax k-means algorithm by PSO to determine the proper values of parameters which can subject the algorithm to attain the lowest clustering errors. The proposed clustering method is tested on some favorite data sets in several different initial situations and is compared to the k-means algorithm and the original MinMax k-means algorithm. The experimental results indicate that our proposed algorithm can reach the lowest clustering errors automatically. PMID:27656201
NASA Astrophysics Data System (ADS)
Ma, Xiaoke; Wang, Bingbo; Yu, Liang
2018-01-01
Community detection is fundamental for revealing the structure-functionality relationship in complex networks, which involves two issues-the quantitative function for community as well as algorithms to discover communities. Despite significant research on either of them, few attempt has been made to establish the connection between the two issues. To attack this problem, a generalized quantification function is proposed for community in weighted networks, which provides a framework that unifies several well-known measures. Then, we prove that the trace optimization of the proposed measure is equivalent with the objective functions of algorithms such as nonnegative matrix factorization, kernel K-means as well as spectral clustering. It serves as the theoretical foundation for designing algorithms for community detection. On the second issue, a semi-supervised spectral clustering algorithm is developed by exploring the equivalence relation via combining the nonnegative matrix factorization and spectral clustering. Different from the traditional semi-supervised algorithms, the partial supervision is integrated into the objective of the spectral algorithm. Finally, through extensive experiments on both artificial and real world networks, we demonstrate that the proposed method improves the accuracy of the traditional spectral algorithms in community detection.
Computational strategies for three-dimensional flow simulations on distributed computer systems
NASA Technical Reports Server (NTRS)
Sankar, Lakshmi N.; Weed, Richard A.
1995-01-01
This research effort is directed towards an examination of issues involved in porting large computational fluid dynamics codes in use within the industry to a distributed computing environment. This effort addresses strategies for implementing the distributed computing in a device independent fashion and load balancing. A flow solver called TEAM presently in use at Lockheed Aeronautical Systems Company was acquired to start this effort. The following tasks were completed: (1) The TEAM code was ported to a number of distributed computing platforms including a cluster of HP workstations located in the School of Aerospace Engineering at Georgia Tech; a cluster of DEC Alpha Workstations in the Graphics visualization lab located at Georgia Tech; a cluster of SGI workstations located at NASA Ames Research Center; and an IBM SP-2 system located at NASA ARC. (2) A number of communication strategies were implemented. Specifically, the manager-worker strategy and the worker-worker strategy were tested. (3) A variety of load balancing strategies were investigated. Specifically, the static load balancing, task queue balancing and the Crutchfield algorithm were coded and evaluated. (4) The classical explicit Runge-Kutta scheme in the TEAM solver was replaced with an LU implicit scheme. And (5) the implicit TEAM-PVM solver was extensively validated through studies of unsteady transonic flow over an F-5 wing, undergoing combined bending and torsional motion. These investigations are documented in extensive detail in the dissertation, 'Computational Strategies for Three-Dimensional Flow Simulations on Distributed Computing Systems', enclosed as an appendix.
Parallel processing for scientific computations
NASA Technical Reports Server (NTRS)
Alkhatib, Hasan S.
1995-01-01
The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.
Computational strategies for three-dimensional flow simulations on distributed computer systems
NASA Astrophysics Data System (ADS)
Sankar, Lakshmi N.; Weed, Richard A.
1995-08-01
This research effort is directed towards an examination of issues involved in porting large computational fluid dynamics codes in use within the industry to a distributed computing environment. This effort addresses strategies for implementing the distributed computing in a device independent fashion and load balancing. A flow solver called TEAM presently in use at Lockheed Aeronautical Systems Company was acquired to start this effort. The following tasks were completed: (1) The TEAM code was ported to a number of distributed computing platforms including a cluster of HP workstations located in the School of Aerospace Engineering at Georgia Tech; a cluster of DEC Alpha Workstations in the Graphics visualization lab located at Georgia Tech; a cluster of SGI workstations located at NASA Ames Research Center; and an IBM SP-2 system located at NASA ARC. (2) A number of communication strategies were implemented. Specifically, the manager-worker strategy and the worker-worker strategy were tested. (3) A variety of load balancing strategies were investigated. Specifically, the static load balancing, task queue balancing and the Crutchfield algorithm were coded and evaluated. (4) The classical explicit Runge-Kutta scheme in the TEAM solver was replaced with an LU implicit scheme. And (5) the implicit TEAM-PVM solver was extensively validated through studies of unsteady transonic flow over an F-5 wing, undergoing combined bending and torsional motion. These investigations are documented in extensive detail in the dissertation, 'Computational Strategies for Three-Dimensional Flow Simulations on Distributed Computing Systems', enclosed as an appendix.
Parallel Wavefront Analysis for a 4D Interferometer
NASA Technical Reports Server (NTRS)
Rao, Shanti R.
2011-01-01
This software provides a programming interface for automating data collection with a PhaseCam interferometer from 4D Technology, and distributing the image-processing algorithm across a cluster of general-purpose computers. Multiple instances of 4Sight (4D Technology s proprietary software) run on a networked cluster of computers. Each connects to a single server (the controller) and waits for instructions. The controller directs the interferometer to several images, then assigns each image to a different computer for processing. When the image processing is finished, the server directs one of the computers to collate and combine the processed images, saving the resulting measurement in a file on a disk. The available software captures approximately 100 images and analyzes them immediately. This software separates the capture and analysis processes, so that analysis can be done at a different time and faster by running the algorithm in parallel across several processors. The PhaseCam family of interferometers can measure an optical system in milliseconds, but it takes many seconds to process the data so that it is usable. In characterizing an adaptive optics system, like the next generation of astronomical observatories, thousands of measurements are required, and the processing time quickly becomes excessive. A programming interface distributes data processing for a PhaseCam interferometer across a Windows computing cluster. A scriptable controller program coordinates data acquisition from the interferometer, storage on networked hard disks, and parallel processing. Idle time of the interferometer is minimized. This architecture is implemented in Python and JavaScript, and may be altered to fit a customer s needs.
Parallel Optimization of Polynomials for Large-scale Problems in Stability and Control
NASA Astrophysics Data System (ADS)
Kamyar, Reza
In this thesis, we focus on some of the NP-hard problems in control theory. Thanks to the converse Lyapunov theory, these problems can often be modeled as optimization over polynomials. To avoid the problem of intractability, we establish a trade off between accuracy and complexity. In particular, we develop a sequence of tractable optimization problems --- in the form of Linear Programs (LPs) and/or Semi-Definite Programs (SDPs) --- whose solutions converge to the exact solution of the NP-hard problem. However, the computational and memory complexity of these LPs and SDPs grow exponentially with the progress of the sequence - meaning that improving the accuracy of the solutions requires solving SDPs with tens of thousands of decision variables and constraints. Setting up and solving such problems is a significant challenge. The existing optimization algorithms and software are only designed to use desktop computers or small cluster computers --- machines which do not have sufficient memory for solving such large SDPs. Moreover, the speed-up of these algorithms does not scale beyond dozens of processors. This in fact is the reason we seek parallel algorithms for setting-up and solving large SDPs on large cluster- and/or super-computers. We propose parallel algorithms for stability analysis of two classes of systems: 1) Linear systems with a large number of uncertain parameters; 2) Nonlinear systems defined by polynomial vector fields. First, we develop a distributed parallel algorithm which applies Polya's and/or Handelman's theorems to some variants of parameter-dependent Lyapunov inequalities with parameters defined over the standard simplex. The result is a sequence of SDPs which possess a block-diagonal structure. We then develop a parallel SDP solver which exploits this structure in order to map the computation, memory and communication to a distributed parallel environment. Numerical tests on a supercomputer demonstrate the ability of the algorithm to efficiently utilize hundreds and potentially thousands of processors, and analyze systems with 100+ dimensional state-space. Furthermore, we extend our algorithms to analyze robust stability over more complicated geometries such as hypercubes and arbitrary convex polytopes. Our algorithms can be readily extended to address a wide variety of problems in control such as Hinfinity synthesis for systems with parametric uncertainty and computing control Lyapunov functions.
A Comparative Evaluation of Anomaly Detection Algorithms for Maritime Video Surveillance
2011-01-01
of k-means clustering and the k- NN Localized p-value Estimator ( KNN -LPE). K-means is a popular distance-based clustering algorithm while KNN -LPE...implemented the sparse cluster identification rule we described in Section 3.1. 2. k-NN Localized p-value Estimator ( KNN -LPE): We implemented this using...Average Density ( KNN -NAD): This was implemented as described in Section 3.4. Algorithm Parameter Settings The global and local density-based anomaly
NASA Astrophysics Data System (ADS)
Chang, Bingguo; Chen, Xiaofei
2018-05-01
Ultrasonography is an important examination for the diagnosis of chronic liver disease. The doctor gives the liver indicators and suggests the patient's condition according to the description of ultrasound report. With the rapid increase in the amount of data of ultrasound report, the workload of professional physician to manually distinguish ultrasound results significantly increases. In this paper, we use the spectral clustering method to cluster analysis of the description of the ultrasound report, and automatically generate the ultrasonic diagnostic diagnosis by machine learning. 110 groups ultrasound examination report of chronic liver disease were selected as test samples in this experiment, and the results were validated by spectral clustering and compared with k-means clustering algorithm. The results show that the accuracy of spectral clustering is 92.73%, which is higher than that of k-means clustering algorithm, which provides a powerful ultrasound-assisted diagnosis for patients with chronic liver disease.
Orbit Clustering Based on Transfer Cost
NASA Technical Reports Server (NTRS)
Gustafson, Eric D.; Arrieta-Camacho, Juan J.; Petropoulos, Anastassios E.
2013-01-01
We propose using cluster analysis to perform quick screening for combinatorial global optimization problems. The key missing component currently preventing cluster analysis from use in this context is the lack of a useable metric function that defines the cost to transfer between two orbits. We study several proposed metrics and clustering algorithms, including k-means and the expectation maximization algorithm. We also show that proven heuristic methods such as the Q-law can be modified to work with cluster analysis.
Clustering by reordering of similarity and Laplacian matrices: Application to galaxy clusters
NASA Astrophysics Data System (ADS)
Mahmoud, E.; Shoukry, A.; Takey, A.
2018-04-01
Similarity metrics, kernels and similarity-based algorithms have gained much attention due to their increasing applications in information retrieval, data mining, pattern recognition and machine learning. Similarity Graphs are often adopted as the underlying representation of similarity matrices and are at the origin of known clustering algorithms such as spectral clustering. Similarity matrices offer the advantage of working in object-object (two-dimensional) space where visualization of clusters similarities is available instead of object-features (multi-dimensional) space. In this paper, sparse ɛ-similarity graphs are constructed and decomposed into strong components using appropriate methods such as Dulmage-Mendelsohn permutation (DMperm) and/or Reverse Cuthill-McKee (RCM) algorithms. The obtained strong components correspond to groups (clusters) in the input (feature) space. Parameter ɛi is estimated locally, at each data point i from a corresponding narrow range of the number of nearest neighbors. Although more advanced clustering techniques are available, our method has the advantages of simplicity, better complexity and direct visualization of the clusters similarities in a two-dimensional space. Also, no prior information about the number of clusters is needed. We conducted our experiments on two and three dimensional, low and high-sized synthetic datasets as well as on an astronomical real-dataset. The results are verified graphically and analyzed using gap statistics over a range of neighbors to verify the robustness of the algorithm and the stability of the results. Combining the proposed algorithm with gap statistics provides a promising tool for solving clustering problems. An astronomical application is conducted for confirming the existence of 45 galaxy clusters around the X-ray positions of galaxy clusters in the redshift range [0.1..0.8]. We re-estimate the photometric redshifts of the identified galaxy clusters and obtain acceptable values compared to published spectroscopic redshifts with a 0.029 standard deviation of their differences.
Improving clustering with metabolic pathway data.
Milone, Diego H; Stegmayer, Georgina; López, Mariana; Kamenetzky, Laura; Carrari, Fernando
2014-04-10
It is a common practice in bioinformatics to validate each group returned by a clustering algorithm through manual analysis, according to a-priori biological knowledge. This procedure helps finding functionally related patterns to propose hypotheses for their behavior and the biological processes involved. Therefore, this knowledge is used only as a second step, after data are just clustered according to their expression patterns. Thus, it could be very useful to be able to improve the clustering of biological data by incorporating prior knowledge into the cluster formation itself, in order to enhance the biological value of the clusters. A novel training algorithm for clustering is presented, which evaluates the biological internal connections of the data points while the clusters are being formed. Within this training algorithm, the calculation of distances among data points and neurons centroids includes a new term based on information from well-known metabolic pathways. The standard self-organizing map (SOM) training versus the biologically-inspired SOM (bSOM) training were tested with two real data sets of transcripts and metabolites from Solanum lycopersicum and Arabidopsis thaliana species. Classical data mining validation measures were used to evaluate the clustering solutions obtained by both algorithms. Moreover, a new measure that takes into account the biological connectivity of the clusters was applied. The results of bSOM show important improvements in the convergence and performance for the proposed clustering method in comparison to standard SOM training, in particular, from the application point of view. Analyses of the clusters obtained with bSOM indicate that including biological information during training can certainly increase the biological value of the clusters found with the proposed method. It is worth to highlight that this fact has effectively improved the results, which can simplify their further analysis.The algorithm is available as a web-demo at http://fich.unl.edu.ar/sinc/web-demo/bsom-lite/. The source code and the data sets supporting the results of this article are available at http://sourceforge.net/projects/sourcesinc/files/bsom.
NASA Astrophysics Data System (ADS)
Chen, Xiao; Li, Yaan; Yu, Jing; Li, Yuxing
2018-01-01
For fast and more effective implementation of tracking multiple targets in a cluttered environment, we propose a multiple targets tracking (MTT) algorithm called maximum entropy fuzzy c-means clustering joint probabilistic data association that combines fuzzy c-means clustering and the joint probabilistic data association (PDA) algorithm. The algorithm uses the membership value to express the probability of the target originating from measurement. The membership value is obtained through fuzzy c-means clustering objective function optimized by the maximum entropy principle. When considering the effect of the public measurement, we use a correction factor to adjust the association probability matrix to estimate the state of the target. As this algorithm avoids confirmation matrix splitting, it can solve the high computational load problem of the joint PDA algorithm. The results of simulations and analysis conducted for tracking neighbor parallel targets and cross targets in a different density cluttered environment show that the proposed algorithm can realize MTT quickly and efficiently in a cluttered environment. Further, the performance of the proposed algorithm remains constant with increasing process noise variance. The proposed algorithm has the advantages of efficiency and low computational load, which can ensure optimum performance when tracking multiple targets in a dense cluttered environment.
A similarity based agglomerative clustering algorithm in networks
NASA Astrophysics Data System (ADS)
Liu, Zhiyuan; Wang, Xiujuan; Ma, Yinghong
2018-04-01
The detection of clusters is benefit for understanding the organizations and functions of networks. Clusters, or communities, are usually groups of nodes densely interconnected but sparsely linked with any other clusters. To identify communities, an efficient and effective community agglomerative algorithm based on node similarity is proposed. The proposed method initially calculates similarities between each pair of nodes, and form pre-partitions according to the principle that each node is in the same community as its most similar neighbor. After that, check each partition whether it satisfies community criterion. For the pre-partitions who do not satisfy, incorporate them with others that having the biggest attraction until there are no changes. To measure the attraction ability of a partition, we propose an attraction index that based on the linked node's importance in networks. Therefore, our proposed method can better exploit the nodes' properties and network's structure. To test the performance of our algorithm, both synthetic and empirical networks ranging in different scales are tested. Simulation results show that the proposed algorithm can obtain superior clustering results compared with six other widely used community detection algorithms.
Generalized Wishart Mixtures for Unsupervised Classification of PolSAR Data
NASA Astrophysics Data System (ADS)
Li, Lan; Chen, Erxue; Li, Zengyuan
2013-01-01
This paper presents an unsupervised clustering algorithm based upon the expectation maximization (EM) algorithm for finite mixture modelling, using the complex wishart probability density function (PDF) for the probabilities. The mixture model enables to consider heterogeneous thematic classes which could not be better fitted by the unimodal wishart distribution. In order to make it fast and robust to calculate, we use the recently proposed generalized gamma distribution (GΓD) for the single polarization intensity data to make the initial partition. Then we use the wishart probability density function for the corresponding sample covariance matrix to calculate the posterior class probabilities for each pixel. The posterior class probabilities are used for the prior probability estimates of each class and weights for all class parameter updates. The proposed method is evaluated and compared with the wishart H-Alpha-A classification. Preliminary results show that the proposed method has better performance.
Taxonomy and clustering in collaborative systems: The case of the on-line encyclopedia Wikipedia
NASA Astrophysics Data System (ADS)
Capocci, A.; Rao, F.; Caldarelli, G.
2008-01-01
In this paper we investigate the nature and structure of the relation between imposed classifications and real clustering in a particular case of a scale-free network given by the on-line encyclopedia Wikipedia. We find a statistical similarity in the distributions of community sizes both by using the top-down approach of the categories division present in the archive and in the bottom-up procedure of community detection given by an algorithm based on the spectral properties of the graph. Regardless of the statistically similar behaviour, the two methods provide a rather different division of the articles, thereby signaling that the nature and presence of power laws is a general feature for these systems and cannot be used as a benchmark to evaluate the suitability of a clustering method.
Revisiting Abell 2744: a powerful synergy of GLASS spectroscopy and HFF photometry
NASA Astrophysics Data System (ADS)
Wang, Xin; Wang
We present new emission line identifications and improve the lensing reconstruction of the mass distribution of galaxy cluster Abell 2744 using the Grism Lens-Amplified Survey from Space (GLASS) spectroscopy and the Hubble Frontier Fields (HFF) imaging. We performed blind and targeted searches for faint line emitters on all objects, including the arc sample, within the field of view (FoV) of GLASS prime pointings. We report 55 high quality spectroscopic redshifts, 5 of which are for arc images. We also present an extensive analysis based on the HFF photometry, measuring the colors and photometric redshifts of all objects within the FoV, and comparing the spectroscopic and photometric redshift estimates. In order to improve the lens model of Abell 2744, we develop a rigorous algorithm to screen arc images, based on their colors and morphology, and selecting the most reliable ones to use. As a result, 25 systems (corresponding to 72 images) pass the screening process and are used to reconstruct the gravitational potential of the cluster pixellated on an adaptive mesh. The resulting total mass distribution is compared with a stellar mass map obtained from the Spitzer Frontier Fields data in order to study the relative distribution of stars and dark matter in the cluster.
Lin, Nan; Jiang, Junhai; Guo, Shicheng; Xiong, Momiao
2015-01-01
Due to the advancement in sensor technology, the growing large medical image data have the ability to visualize the anatomical changes in biological tissues. As a consequence, the medical images have the potential to enhance the diagnosis of disease, the prediction of clinical outcomes and the characterization of disease progression. But in the meantime, the growing data dimensions pose great methodological and computational challenges for the representation and selection of features in image cluster analysis. To address these challenges, we first extend the functional principal component analysis (FPCA) from one dimension to two dimensions to fully capture the space variation of image the signals. The image signals contain a large number of redundant features which provide no additional information for clustering analysis. The widely used methods for removing the irrelevant features are sparse clustering algorithms using a lasso-type penalty to select the features. However, the accuracy of clustering using a lasso-type penalty depends on the selection of the penalty parameters and the threshold value. In practice, they are difficult to determine. Recently, randomized algorithms have received a great deal of attentions in big data analysis. This paper presents a randomized algorithm for accurate feature selection in image clustering analysis. The proposed method is applied to both the liver and kidney cancer histology image data from the TCGA database. The results demonstrate that the randomized feature selection method coupled with functional principal component analysis substantially outperforms the current sparse clustering algorithms in image cluster analysis. PMID:26196383
Clustering approaches to identifying gene expression patterns from DNA microarray data.
Do, Jin Hwan; Choi, Dong-Kug
2008-04-30
The analysis of microarray data is essential for large amounts of gene expression data. In this review we focus on clustering techniques. The biological rationale for this approach is the fact that many co-expressed genes are co-regulated, and identifying co-expressed genes could aid in functional annotation of novel genes, de novo identification of transcription factor binding sites and elucidation of complex biological pathways. Co-expressed genes are usually identified in microarray experiments by clustering techniques. There are many such methods, and the results obtained even for the same datasets may vary considerably depending on the algorithms and metrics for dissimilarity measures used, as well as on user-selectable parameters such as desired number of clusters and initial values. Therefore, biologists who want to interpret microarray data should be aware of the weakness and strengths of the clustering methods used. In this review, we survey the basic principles of clustering of DNA microarray data from crisp clustering algorithms such as hierarchical clustering, K-means and self-organizing maps, to complex clustering algorithms like fuzzy clustering.
Identification of chronic rhinosinusitis phenotypes using cluster analysis.
Soler, Zachary M; Hyer, J Madison; Ramakrishnan, Viswanathan; Smith, Timothy L; Mace, Jess; Rudmik, Luke; Schlosser, Rodney J
2015-05-01
Current clinical classifications of chronic rhinosinusitis (CRS) have been largely defined based upon preconceived notions of factors thought to be important, such as polyp or eosinophil status. Unfortunately, these classification systems have little correlation with symptom severity or treatment outcomes. Unsupervised clustering can be used to identify phenotypic subgroups of CRS patients, describe clinical differences in these clusters and define simple algorithms for classification. A multi-institutional, prospective study of 382 patients with CRS who had failed initial medical therapy completed the Sino-Nasal Outcome Test (SNOT-22), Rhinosinusitis Disability Index (RSDI), Medical Outcomes Study Short Form-12 (SF-12), Pittsburgh Sleep Quality Index (PSQI), and Patient Health Questionnaire (PHQ-2). Objective measures of CRS severity included Brief Smell Identification Test (B-SIT), CT, and endoscopy scoring. All variables were reduced and unsupervised hierarchical clustering was performed. After clusters were defined, variations in medication usage were analyzed. Discriminant analysis was performed to develop a simplified, clinically useful algorithm for clustering. Clustering was largely determined by age, severity of patient reported outcome measures, depression, and fibromyalgia. CT and endoscopy varied somewhat among clusters. Traditional clinical measures, including polyp/atopic status, prior surgery, B-SIT and asthma, did not vary among clusters. A simplified algorithm based upon productivity loss, SNOT-22 score, and age predicted clustering with 89% accuracy. Medication usage among clusters did vary significantly. A simplified algorithm based upon hierarchical clustering is able to classify CRS patients and predict medication usage. Further studies are warranted to determine if such clustering predicts treatment outcomes. © 2015 ARS-AAOA, LLC.
Quantum annealing for combinatorial clustering
NASA Astrophysics Data System (ADS)
Kumar, Vaibhaw; Bass, Gideon; Tomlin, Casey; Dulny, Joseph
2018-02-01
Clustering is a powerful machine learning technique that groups "similar" data points based on their characteristics. Many clustering algorithms work by approximating the minimization of an objective function, namely the sum of within-the-cluster distances between points. The straightforward approach involves examining all the possible assignments of points to each of the clusters. This approach guarantees the solution will be a global minimum; however, the number of possible assignments scales quickly with the number of data points and becomes computationally intractable even for very small datasets. In order to circumvent this issue, cost function minima are found using popular local search-based heuristic approaches such as k-means and hierarchical clustering. Due to their greedy nature, such techniques do not guarantee that a global minimum will be found and can lead to sub-optimal clustering assignments. Other classes of global search-based techniques, such as simulated annealing, tabu search, and genetic algorithms, may offer better quality results but can be too time-consuming to implement. In this work, we describe how quantum annealing can be used to carry out clustering. We map the clustering objective to a quadratic binary optimization problem and discuss two clustering algorithms which are then implemented on commercially available quantum annealing hardware, as well as on a purely classical solver "qbsolv." The first algorithm assigns N data points to K clusters, and the second one can be used to perform binary clustering in a hierarchical manner. We present our results in the form of benchmarks against well-known k-means clustering and discuss the advantages and disadvantages of the proposed techniques.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mackey, Lester; Nachman, Benjamin; Schwartzman, Ariel
Collimated streams of particles produced in high energy physics experiments are organized using clustering algorithms to form jets . To construct jets, the experimental collaborations based at the Large Hadron Collider (LHC) primarily use agglomerative hierarchical clustering schemes known as sequential recombination. We propose a new class of algorithms for clustering jets that use infrared and collinear safe mixture models. These new algorithms, known as fuzzy jets , are clustered using maximum likelihood techniques and can dynamically determine various properties of jets like their size. We show that the fuzzy jet size adds additional information to conventional jet tagging variablesmore » in boosted topologies. Furthermore, we study the impact of pileup and show that with some slight modifications to the algorithm, fuzzy jets can be stable up to high pileup interaction multiplicities.« less
Mackey, Lester; Nachman, Benjamin; Schwartzman, Ariel; ...
2016-06-01
Collimated streams of particles produced in high energy physics experiments are organized using clustering algorithms to form jets . To construct jets, the experimental collaborations based at the Large Hadron Collider (LHC) primarily use agglomerative hierarchical clustering schemes known as sequential recombination. We propose a new class of algorithms for clustering jets that use infrared and collinear safe mixture models. These new algorithms, known as fuzzy jets , are clustered using maximum likelihood techniques and can dynamically determine various properties of jets like their size. We show that the fuzzy jet size adds additional information to conventional jet tagging variablesmore » in boosted topologies. Furthermore, we study the impact of pileup and show that with some slight modifications to the algorithm, fuzzy jets can be stable up to high pileup interaction multiplicities.« less
Setting objective thresholds for rare event detection in flow cytometry
Richards, Adam J.; Staats, Janet; Enzor, Jennifer; McKinnon, Katherine; Frelinger, Jacob; Denny, Thomas N.; Weinhold, Kent J.; Chan, Cliburn
2014-01-01
The accurate identification of rare antigen-specific cytokine positive cells from peripheral blood mononuclear cells (PBMC) after antigenic stimulation in an intracellular staining (ICS) flow cytometry assay is challenging, as cytokine positive events may be fairly diffusely distributed and lack an obvious separation from the negative population. Traditionally, the approach by flow operators has been to manually set a positivity threshold to partition events into cytokine-positive and cytokine-negative. This approach suffers from subjectivity and inconsistency across different flow operators. The use of statistical clustering methods does not remove the need to find an objective threshold between between positive and negative events since consistent identification of rare event subsets is highly challenging for automated algorithms, especially when there is distributional overlap between the positive and negative events (“smear”). We present a new approach, based on the Fβ measure, that is similar to manual thresholding in providing a hard cutoff, but has the advantage of being determined objectively. The performance of this algorithm is compared with results obtained by expert visual gating. Several ICS data sets from the External Quality Assurance Program Oversight Laboratory (EQAPOL) proficiency program were used to make the comparisons. We first show that visually determined thresholds are difficult to reproduce and pose a problem when comparing results across operators or laboratories, as well as problems that occur with the use of commonly employed clustering algorithms. In contrast, a single parameterization for the Fβ method performs consistently across different centers, samples, and instruments because it optimizes the precision/recall tradeoff by using both negative and positive controls. PMID:24727143
NASA Technical Reports Server (NTRS)
Wright, Jeffrey; Thakur, Siddharth
2006-01-01
Loci-STREAM is an evolving computational fluid dynamics (CFD) software tool for simulating possibly chemically reacting, possibly unsteady flows in diverse settings, including rocket engines, turbomachines, oil refineries, etc. Loci-STREAM implements a pressure- based flow-solving algorithm that utilizes unstructured grids. (The benefit of low memory usage by pressure-based algorithms is well recognized by experts in the field.) The algorithm is robust for flows at all speeds from zero to hypersonic. The flexibility of arbitrary polyhedral grids enables accurate, efficient simulation of flows in complex geometries, including those of plume-impingement problems. The present version - Loci-STREAM version 0.9 - includes an interface with the Portable, Extensible Toolkit for Scientific Computation (PETSc) library for access to enhanced linear-equation-solving programs therein that accelerate convergence toward a solution. The name "Loci" reflects the creation of this software within the Loci computational framework, which was developed at Mississippi State University for the primary purpose of simplifying the writing of complex multidisciplinary application programs to run in distributed-memory computing environments including clusters of personal computers. Loci has been designed to relieve application programmers of the details of programming for distributed-memory computers.
Blooming Trees: Substructures and Surrounding Groups of Galaxy Clusters
NASA Astrophysics Data System (ADS)
Yu, Heng; Diaferio, Antonaldo; Serra, Ana Laura; Baldi, Marco
2018-06-01
We develop the Blooming Tree Algorithm, a new technique that uses spectroscopic redshift data alone to identify the substructures and the surrounding groups of galaxy clusters, along with their member galaxies. Based on the estimated binding energy of galaxy pairs, the algorithm builds a binary tree that hierarchically arranges all of the galaxies in the field of view. The algorithm searches for buds, corresponding to gravitational potential minima on the binary tree branches; for each bud, the algorithm combines the number of galaxies, their velocity dispersion, and their average pairwise distance into a parameter that discriminates between the buds that do not correspond to any substructure or group, and thus eventually die, and the buds that correspond to substructures and groups, and thus bloom into the identified structures. We test our new algorithm with a sample of 300 mock redshift surveys of clusters in different dynamical states; the clusters are extracted from a large cosmological N-body simulation of a ΛCDM model. We limit our analysis to substructures and surrounding groups identified in the simulation with mass larger than 1013 h ‑1 M ⊙. With mock redshift surveys with 200 galaxies within 6 h ‑1 Mpc from the cluster center, the technique recovers 80% of the real substructures and 60% of the surrounding groups; in 57% of the identified structures, at least 60% of the member galaxies of the substructures and groups belong to the same real structure. These results improve by roughly a factor of two the performance of the best substructure identification algorithm currently available, the σ plateau algorithm, and suggest that our Blooming Tree Algorithm can be an invaluable tool for detecting substructures of galaxy clusters and investigating their complex dynamics.
PCA based clustering for brain tumor segmentation of T1w MRI images.
Kaya, Irem Ersöz; Pehlivanlı, Ayça Çakmak; Sekizkardeş, Emine Gezmez; Ibrikci, Turgay
2017-03-01
Medical images are huge collections of information that are difficult to store and process consuming extensive computing time. Therefore, the reduction techniques are commonly used as a data pre-processing step to make the image data less complex so that a high-dimensional data can be identified by an appropriate low-dimensional representation. PCA is one of the most popular multivariate methods for data reduction. This paper is focused on T1-weighted MRI images clustering for brain tumor segmentation with dimension reduction by different common Principle Component Analysis (PCA) algorithms. Our primary aim is to present a comparison between different variations of PCA algorithms on MRIs for two cluster methods. Five most common PCA algorithms; namely the conventional PCA, Probabilistic Principal Component Analysis (PPCA), Expectation Maximization Based Principal Component Analysis (EM-PCA), Generalize Hebbian Algorithm (GHA), and Adaptive Principal Component Extraction (APEX) were applied to reduce dimensionality in advance of two clustering algorithms, K-Means and Fuzzy C-Means. In the study, the T1-weighted MRI images of the human brain with brain tumor were used for clustering. In addition to the original size of 512 lines and 512 pixels per line, three more different sizes, 256 × 256, 128 × 128 and 64 × 64, were included in the study to examine their effect on the methods. The obtained results were compared in terms of both the reconstruction errors and the Euclidean distance errors among the clustered images containing the same number of principle components. According to the findings, the PPCA obtained the best results among all others. Furthermore, the EM-PCA and the PPCA assisted K-Means algorithm to accomplish the best clustering performance in the majority as well as achieving significant results with both clustering algorithms for all size of T1w MRI images. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Automatic detection of erythemato-squamous diseases using k-means clustering.
Ubeyli, Elif Derya; Doğdu, Erdoğan
2010-04-01
A new approach based on the implementation of k-means clustering is presented for automated detection of erythemato-squamous diseases. The purpose of clustering techniques is to find a structure for the given data by finding similarities between data according to data characteristics. The studied domain contained records of patients with known diagnosis. The k-means clustering algorithm's task was to classify the data points, in this case the patients with attribute data, to one of the five clusters. The algorithm was used to detect the five erythemato-squamous diseases when 33 features defining five disease indications were used. The purpose is to determine an optimum classification scheme for this problem. The present research demonstrated that the features well represent the erythemato-squamous diseases and the k-means clustering algorithm's task achieved high classification accuracies for only five erythemato-squamous diseases.
Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal
2008-07-01
UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.
Gaur, Pallavi; Chaturvedi, Anoop
2017-07-22
The clustering pattern and motifs give immense information about any biological data. An application of machine learning algorithms for clustering and candidate motif detection in miRNAs derived from exosomes is depicted in this paper. Recent progress in the field of exosome research and more particularly regarding exosomal miRNAs has led much bioinformatic-based research to come into existence. The information on clustering pattern and candidate motifs in miRNAs of exosomal origin would help in analyzing existing, as well as newly discovered miRNAs within exosomes. Along with obtaining clustering pattern and candidate motifs in exosomal miRNAs, this work also elaborates the usefulness of the machine learning algorithms that can be efficiently used and executed on various programming languages/platforms. Data were clustered and sequence candidate motifs were detected successfully. The results were compared and validated with some available web tools such as 'BLASTN' and 'MEME suite'. The machine learning algorithms for aforementioned objectives were applied successfully. This work elaborated utility of machine learning algorithms and language platforms to achieve the tasks of clustering and candidate motif detection in exosomal miRNAs. With the information on mentioned objectives, deeper insight would be gained for analyses of newly discovered miRNAs in exosomes which are considered to be circulating biomarkers. In addition, the execution of machine learning algorithms on various language platforms gives more flexibility to users to try multiple iterations according to their requirements. This approach can be applied to other biological data-mining tasks as well.
Wehmeyer, Christoph; Falk von Rudorff, Guido; Wolf, Sebastian; Kabbe, Gabriel; Schärf, Daniel; Kühne, Thomas D; Sebastiani, Daniel
2012-11-21
We present a stochastic, swarm intelligence-based optimization algorithm for the prediction of global minima on potential energy surfaces of molecular cluster structures. Our optimization approach is a modification of the artificial bee colony (ABC) algorithm which is inspired by the foraging behavior of honey bees. We apply our modified ABC algorithm to the problem of global geometry optimization of molecular cluster structures and show its performance for clusters with 2-57 particles and different interatomic interaction potentials.
NASA Astrophysics Data System (ADS)
Wehmeyer, Christoph; Falk von Rudorff, Guido; Wolf, Sebastian; Kabbe, Gabriel; Schärf, Daniel; Kühne, Thomas D.; Sebastiani, Daniel
2012-11-01
We present a stochastic, swarm intelligence-based optimization algorithm for the prediction of global minima on potential energy surfaces of molecular cluster structures. Our optimization approach is a modification of the artificial bee colony (ABC) algorithm which is inspired by the foraging behavior of honey bees. We apply our modified ABC algorithm to the problem of global geometry optimization of molecular cluster structures and show its performance for clusters with 2-57 particles and different interatomic interaction potentials.
An adaptive tracker for ShipIR/NTCS
NASA Astrophysics Data System (ADS)
Ramaswamy, Srinivasan; Vaitekunas, David A.
2015-05-01
A key component in any image-based tracking system is the adaptive tracking algorithm used to segment the image into potential targets, rank-and-select the best candidate target, and the gating of the selected target to further improve tracker performance. This paper will describe a new adaptive tracker algorithm added to the naval threat countermeasure simulator (NTCS) of the NATO-standard ship signature model (ShipIR). The new adaptive tracking algorithm is an optional feature used with any of the existing internal NTCS or user-defined seeker algorithms (e.g., binary centroid, intensity centroid, and threshold intensity centroid). The algorithm segments the detected pixels into clusters, and the smallest set of clusters that meet the detection criterion is obtained by using a knapsack algorithm to identify the set of clusters that should not be used. The rectangular area containing the chosen clusters defines an inner boundary, from which a weighted centroid is calculated as the aim-point. A track-gate is then positioned around the clusters, taking into account the rate of change of the bounding area and compensating for any gimbal displacement. A sequence of scenarios is used to test the new tracking algorithm on a generic unclassified DDG ShipIR model, with and without flares, and demonstrate how some of the key seeker signals are impacted by both the ship and flare intrinsic signatures.
Xue, Zhong; Shen, Dinggang; Li, Hai; Wong, Stephen
2010-01-01
The traditional fuzzy clustering algorithm and its extensions have been successfully applied in medical image segmentation. However, because of the variability of tissues and anatomical structures, the clustering results might be biased by the tissue population and intensity differences. For example, clustering-based algorithms tend to over-segment white matter tissues of MR brain images. To solve this problem, we introduce a tissue probability map constrained clustering algorithm and apply it to serial MR brain image segmentation, i.e., a series of 3-D MR brain images of the same subject at different time points. Using the new serial image segmentation algorithm in the framework of the CLASSIC framework, which iteratively segments the images and estimates the longitudinal deformations, we improved both accuracy and robustness for serial image computing, and at the mean time produced longitudinally consistent segmentation and stable measures. In the algorithm, the tissue probability maps consist of both the population-based and subject-specific segmentation priors. Experimental study using both simulated longitudinal MR brain data and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data confirmed that using both priors more accurate and robust segmentation results can be obtained. The proposed algorithm can be applied in longitudinal follow up studies of MR brain imaging with subtle morphological changes for neurological disorders. PMID:26566399
Mining subspace clusters from DNA microarray data using large itemset techniques.
Chang, Ye-In; Chen, Jiun-Rung; Tsai, Yueh-Chi
2009-05-01
Mining subspace clusters from the DNA microarrays could help researchers identify those genes which commonly contribute to a disease, where a subspace cluster indicates a subset of genes whose expression levels are similar under a subset of conditions. Since in a DNA microarray, the number of genes is far larger than the number of conditions, those previous proposed algorithms which compute the maximum dimension sets (MDSs) for any two genes will take a long time to mine subspace clusters. In this article, we propose the Large Itemset-Based Clustering (LISC) algorithm for mining subspace clusters. Instead of constructing MDSs for any two genes, we construct only MDSs for any two conditions. Then, we transform the task of finding the maximal possible gene sets into the problem of mining large itemsets from the condition-pair MDSs. Since we are only interested in those subspace clusters with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonable large support values in the condition-pair MDSs. From our simulation results, we show that the proposed algorithm needs shorter processing time than those previous proposed algorithms which need to construct gene-pair MDSs.
NASA Astrophysics Data System (ADS)
Nguyen, Sy Dzung; Nguyen, Quoc Hung; Choi, Seung-Bok
2015-01-01
This paper presents a new algorithm for building an adaptive neuro-fuzzy inference system (ANFIS) from a training data set called B-ANFIS. In order to increase accuracy of the model, the following issues are executed. Firstly, a data merging rule is proposed to build and perform a data-clustering strategy. Subsequently, a combination of clustering processes in the input data space and in the joint input-output data space is presented. Crucial reason of this task is to overcome problems related to initialization and contradictory fuzzy rules, which usually happen when building ANFIS. The clustering process in the input data space is accomplished based on a proposed merging-possibilistic clustering (MPC) algorithm. The effectiveness of this process is evaluated to resume a clustering process in the joint input-output data space. The optimal parameters obtained after completion of the clustering process are used to build ANFIS. Simulations based on a numerical data, 'Daily Data of Stock A', and measured data sets of a smart damper are performed to analyze and estimate accuracy. In addition, convergence and robustness of the proposed algorithm are investigated based on both theoretical and testing approaches.
Blocked inverted indices for exact clustering of large chemical spaces.
Thiel, Philipp; Sach-Peltason, Lisa; Ottmann, Christian; Kohlbacher, Oliver
2014-09-22
The calculation of pairwise compound similarities based on fingerprints is one of the fundamental tasks in chemoinformatics. Methods for efficient calculation of compound similarities are of the utmost importance for various applications like similarity searching or library clustering. With the increasing size of public compound databases, exact clustering of these databases is desirable, but often computationally prohibitively expensive. We present an optimized inverted index algorithm for the calculation of all pairwise similarities on 2D fingerprints of a given data set. In contrast to other algorithms, it neither requires GPU computing nor yields a stochastic approximation of the clustering. The algorithm has been designed to work well with multicore architectures and shows excellent parallel speedup. As an application example of this algorithm, we implemented a deterministic clustering application, which has been designed to decompose virtual libraries comprising tens of millions of compounds in a short time on current hardware. Our results show that our implementation achieves more than 400 million Tanimoto similarity calculations per second on a common desktop CPU. Deterministic clustering of the available chemical space thus can be done on modern multicore machines within a few days.
Early dynamical evolution of substructured stellar clusters
NASA Astrophysics Data System (ADS)
Dorval, Julien; Boily, Christian
2015-08-01
It is now widely accepted that stellar clusters form with a high level of substructure (Kuhn et al. 2014, Bate 2009), inherited from the molecular cloud and the star formation process. Evidence from observations and simulations also indicate the stars in such young clusters form a subvirial system (Kirk et al. 2007, Maschberger et al. 2010). The subsequent dynamical evolution can cause important mass loss, ejecting a large part of the birth population in the field. It can also imprint the stellar population and still be inferred from observations of evolved clusters. Nbody simulations allow a better understanding of these early twists and turns, given realistic initial conditions. Nowadays, substructured, clumpy young clusters are usually obtained through pseudo-fractal growth (Goodwin et al. 2004) and velocity inheritance. Such models are visually realistics and are very useful, they are however somewhat artificial in their velocity distribution. I introduce a new way to create clumpy initial conditions through a "Hubble expansion" which naturally produces self consistent clumps, velocity-wise. A velocity distribution analysis shows the new method produces realistic models, consistent with the dynamical state of the newly created cores in hydrodynamic simulation of cluster formation (Klessen & Burkert 2000). I use these initial conditions to investigate the dynamical evolution of young subvirial clusters, up to 80000 stars. I find an overall soft evolution, with hierarchical merging leading to a high level of mass segregation. I investigate the influence of the mass function on the fate of the cluster, specifically on the amount of mass loss induced by the early violent relaxation. Using a new binary detection algorithm, I also find a strong processing of the native binary population.
A Game Theory Algorithm for Intra-Cluster Data Aggregation in a Vehicular Ad Hoc Network
Chen, Yuzhong; Weng, Shining; Guo, Wenzhong; Xiong, Naixue
2016-01-01
Vehicular ad hoc networks (VANETs) have an important role in urban management and planning. The effective integration of vehicle information in VANETs is critical to traffic analysis, large-scale vehicle route planning and intelligent transportation scheduling. However, given the limitations in the precision of the output information of a single sensor and the difficulty of information sharing among various sensors in a highly dynamic VANET, effectively performing data aggregation in VANETs remains a challenge. Moreover, current studies have mainly focused on data aggregation in large-scale environments but have rarely discussed the issue of intra-cluster data aggregation in VANETs. In this study, we propose a multi-player game theory algorithm for intra-cluster data aggregation in VANETs by analyzing the competitive and cooperative relationships among sensor nodes. Several sensor-centric metrics are proposed to measure the data redundancy and stability of a cluster. We then study the utility function to achieve efficient intra-cluster data aggregation by considering both data redundancy and cluster stability. In particular, we prove the existence of a unique Nash equilibrium in the game model, and conduct extensive experiments to validate the proposed algorithm. Results demonstrate that the proposed algorithm has advantages over typical data aggregation algorithms in both accuracy and efficiency. PMID:26907272
A Game Theory Algorithm for Intra-Cluster Data Aggregation in a Vehicular Ad Hoc Network.
Chen, Yuzhong; Weng, Shining; Guo, Wenzhong; Xiong, Naixue
2016-02-19
Vehicular ad hoc networks (VANETs) have an important role in urban management and planning. The effective integration of vehicle information in VANETs is critical to traffic analysis, large-scale vehicle route planning and intelligent transportation scheduling. However, given the limitations in the precision of the output information of a single sensor and the difficulty of information sharing among various sensors in a highly dynamic VANET, effectively performing data aggregation in VANETs remains a challenge. Moreover, current studies have mainly focused on data aggregation in large-scale environments but have rarely discussed the issue of intra-cluster data aggregation in VANETs. In this study, we propose a multi-player game theory algorithm for intra-cluster data aggregation in VANETs by analyzing the competitive and cooperative relationships among sensor nodes. Several sensor-centric metrics are proposed to measure the data redundancy and stability of a cluster. We then study the utility function to achieve efficient intra-cluster data aggregation by considering both data redundancy and cluster stability. In particular, we prove the existence of a unique Nash equilibrium in the game model, and conduct extensive experiments to validate the proposed algorithm. Results demonstrate that the proposed algorithm has advantages over typical data aggregation algorithms in both accuracy and efficiency.
Real Time Intelligent Target Detection and Analysis with Machine Vision
NASA Technical Reports Server (NTRS)
Howard, Ayanna; Padgett, Curtis; Brown, Kenneth
2000-01-01
We present an algorithm for detecting a specified set of targets for an Automatic Target Recognition (ATR) application. ATR involves processing images for detecting, classifying, and tracking targets embedded in a background scene. We address the problem of discriminating between targets and nontarget objects in a scene by evaluating 40x40 image blocks belonging to an image. Each image block is first projected onto a set of templates specifically designed to separate images of targets embedded in a typical background scene from those background images without targets. These filters are found using directed principal component analysis which maximally separates the two groups. The projected images are then clustered into one of n classes based on a minimum distance to a set of n cluster prototypes. These cluster prototypes have previously been identified using a modified clustering algorithm based on prior sensed data. Each projected image pattern is then fed into the associated cluster's trained neural network for classification. A detailed description of our algorithm will be given in this paper. We outline our methodology for designing the templates, describe our modified clustering algorithm, and provide details on the neural network classifiers. Evaluation of the overall algorithm demonstrates that our detection rates approach 96% with a false positive rate of less than 0.03%.
Adaptive block online learning target tracking based on super pixel segmentation
NASA Astrophysics Data System (ADS)
Cheng, Yue; Li, Jianzeng
2018-04-01
Video target tracking technology under the unremitting exploration of predecessors has made big progress, but there are still lots of problems not solved. This paper proposed a new algorithm of target tracking based on image segmentation technology. Firstly we divide the selected region using simple linear iterative clustering (SLIC) algorithm, after that, we block the area with the improved density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm. Each sub-block independently trained classifier and tracked, then the algorithm ignore the failed tracking sub-block while reintegrate the rest of the sub-blocks into tracking box to complete the target tracking. The experimental results show that our algorithm can work effectively under occlusion interference, rotation change, scale change and many other problems in target tracking compared with the current mainstream algorithms.
Di Pietro, C; Di Pietro, V; Emmanuele, G; Ferro, A; Maugeri, T; Modica, E; Pigola, G; Pulvirenti, A; Purrello, M; Ragusa, M; Scalia, M; Shasha, D; Travali, S; Zimmitti, V
2003-01-01
In this paper we present a new Multiple Sequence Alignment (MSA) algorithm called AntiClusAl. The method makes use of the commonly use idea of aligning homologous sequences belonging to classes generated by some clustering algorithm, and then continue the alignment process ina bottom-up way along a suitable tree structure. The final result is then read at the root of the tree. Multiple sequence alignment in each cluster makes use of the progressive alignment with the 1-median (center) of the cluster. The 1-median of set S of sequences is the element of S which minimizes the average distance from any other sequence in S. Its exact computation requires quadratic time. The basic idea of our proposed algorithm is to make use of a simple and natural algorithmic technique based on randomized tournaments which has been successfully applied to large size search problems in general metric spaces. In particular a clustering algorithm called Antipole tree and an approximate linear 1-median computation are used. Our algorithm compared with Clustal W, a widely used tool to MSA, shows a better running time results with fully comparable alignment quality. A successful biological application showing high aminoacid conservation during evolution of Xenopus laevis SOD2 is also cited.
NASA Technical Reports Server (NTRS)
Wharton, S. W.
1980-01-01
An Interactive Cluster Analysis Procedure (ICAP) was developed to derive classifier training statistics from remotely sensed data. The algorithm interfaces the rapid numerical processing capacity of a computer with the human ability to integrate qualitative information. Control of the clustering process alternates between the algorithm, which creates new centroids and forms clusters and the analyst, who evaluate and elect to modify the cluster structure. Clusters can be deleted or lumped pairwise, or new centroids can be added. A summary of the cluster statistics can be requested to facilitate cluster manipulation. The ICAP was implemented in APL (A Programming Language), an interactive computer language. The flexibility of the algorithm was evaluated using data from different LANDSAT scenes to simulate two situations: one in which the analyst is assumed to have no prior knowledge about the data and wishes to have the clusters formed more or less automatically; and the other in which the analyst is assumed to have some knowledge about the data structure and wishes to use that information to closely supervise the clustering process. For comparison, an existing clustering method was also applied to the two data sets.
NASA Astrophysics Data System (ADS)
Zhou, Shuguang; Zhou, Kefa; Wang, Jinlin; Yang, Genfang; Wang, Shanshan
2017-12-01
Cluster analysis is a well-known technique that is used to analyze various types of data. In this study, cluster analysis is applied to geochemical data that describe 1444 stream sediment samples collected in northwestern Xinjiang with a sample spacing of approximately 2 km. Three algorithms (the hierarchical, k-means, and fuzzy c-means algorithms) and six data transformation methods (the z-score standardization, ZST; the logarithmic transformation, LT; the additive log-ratio transformation, ALT; the centered log-ratio transformation, CLT; the isometric log-ratio transformation, ILT; and no transformation, NT) are compared in terms of their effects on the cluster analysis of the geochemical compositional data. The study shows that, on the one hand, the ZST does not affect the results of column- or variable-based (R-type) cluster analysis, whereas the other methods, including the LT, the ALT, and the CLT, have substantial effects on the results. On the other hand, the results of the row- or observation-based (Q-type) cluster analysis obtained from the geochemical data after applying NT and the ZST are relatively poor. However, we derive some improved results from the geochemical data after applying the CLT, the ILT, the LT, and the ALT. Moreover, the k-means and fuzzy c-means clustering algorithms are more reliable than the hierarchical algorithm when they are used to cluster the geochemical data. We apply cluster analysis to the geochemical data to explore for Au deposits within the study area, and we obtain a good correlation between the results retrieved by combining the CLT or the ILT with the k-means or fuzzy c-means algorithms and the potential zones of Au mineralization. Therefore, we suggest that the combination of the CLT or the ILT with the k-means or fuzzy c-means algorithms is an effective tool to identify potential zones of mineralization from geochemical data.
A ground truth based comparative study on clustering of gene expression data.
Zhu, Yitan; Wang, Zuyi; Miller, David J; Clarke, Robert; Xuan, Jianhua; Hoffman, Eric P; Wang, Yue
2008-05-01
Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG toolkit (VIsual Statistical Data Analyzer--VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.
Enabling analytical and Modeling Tools for Enhanced Disease Surveillance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dawn K. Manley
2003-04-01
Early detection, identification, and warning are essential to minimize casualties from a biological attack. For covert attacks, sick people are likely to provide the first indication of an attack. An enhanced medical surveillance system that synthesizes distributed health indicator information and rapidly analyzes the information can dramatically increase the number of lives saved. Current surveillance methods to detect both biological attacks and natural outbreaks are hindered by factors such as distributed ownership of information, incompatible data storage and analysis programs, and patient privacy concerns. Moreover, because data are not widely shared, few data mining algorithms have been tested on andmore » applied to diverse health indicator data. This project addressed both integration of multiple data sources and development and integration of analytical tools for rapid detection of disease outbreaks. As a first prototype, we developed an application to query and display distributed patient records. This application incorporated need-to-know access control and incorporated data from standard commercial databases. We developed and tested two different algorithms for outbreak recognition. The first is a pattern recognition technique that searches for space-time data clusters that may signal a disease outbreak. The second is a genetic algorithm to design and train neural networks (GANN) that we applied toward disease forecasting. We tested these algorithms against influenza, respiratory illness, and Dengue Fever data. Through this LDRD in combination with other internal funding, we delivered a distributed simulation capability to synthesize disparate information and models for earlier recognition and improved decision-making in the event of a biological attack. The architecture incorporates user feedback and control so that a user's decision inputs can impact the scenario outcome as well as integrated security and role-based access-control for communicating between distributed data and analytical tools. This work included construction of interfaces to various commercial database products and to one of the data analysis algorithms developed through this LDRD.« less
Clustering Millions of Faces by Identity.
Otto, Charles; Wang, Dayong; Jain, Anil K
2018-02-01
Given a large collection of unlabeled face images, we address the problem of clustering faces into an unknown number of identities. This problem is of interest in social media, law enforcement, and other applications, where the number of faces can be of the order of hundreds of million, while the number of identities (clusters) can range from a few thousand to millions. To address the challenges of run-time complexity and cluster quality, we present an approximate Rank-Order clustering algorithm that performs better than popular clustering algorithms (k-Means and Spectral). Our experiments include clustering up to 123 million face images into over 10 million clusters. Clustering results are analyzed in terms of external (known face labels) and internal (unknown face labels) quality measures, and run-time. Our algorithm achieves an F-measure of 0.87 on the LFW benchmark (13 K faces of 5,749 individuals), which drops to 0.27 on the largest dataset considered (13 K faces in LFW + 123M distractor images). Additionally, we show that frames in the YouTube benchmark can be clustered with an F-measure of 0.71. An internal per-cluster quality measure is developed to rank individual clusters for manual exploration of high quality clusters that are compact and isolated.
Application of artificial intelligence to search ground-state geometry of clusters
NASA Astrophysics Data System (ADS)
Lemes, Maurício Ruv; Marim, L. R.; dal Pino, A.
2002-08-01
We introduce a global optimization procedure, the neural-assisted genetic algorithm (NAGA). It combines the power of an artificial neural network (ANN) with the versatility of the genetic algorithm. This method is suitable to solve optimization problems that depend on some kind of heuristics to limit the search space. If a reasonable amount of data is available, the ANN can ``understand'' the problem and provide the genetic algorithm with a selected population of elements that will speed up the search for the optimum solution. We tested the method in a search for the ground-state geometry of silicon clusters. We trained the ANN with information about the geometry and energetics of small silicon clusters. Next, the ANN learned how to restrict the configurational space for larger silicon clusters. For Si10 and Si20, we noticed that the NAGA is at least three times faster than the ``pure'' genetic algorithm. As the size of the cluster increases, it is expected that the gain in terms of time will increase as well.
Saro, A.
2015-10-12
In this study, we cross-match galaxy cluster candidates selected via their Sunyaev–Zel'dovich effect (SZE) signatures in 129.1 deg 2 of the South Pole Telescope 2500d SPT-SZ survey with optically identified clusters selected from the Dark Energy Survey science verification data. We identify 25 clusters between 0.1 ≲ z ≲ 0.8 in the union of the SPT-SZ and redMaPPer (RM) samples. RM is an optical cluster finding algorithm that also returns a richness estimate for each cluster. We model the richness λ-mass relation with the following function 500> ∝ B λlnM 500 + C λlnE(z) and use SPT-SZ cluster masses andmore » RM richnesses λ to constrain the parameters. We find B λ = 1.14 +0.21 –0.18 and C λ = 0.73 +0.77 –0.75. The associated scatter in mass at fixed richness is σ lnM|λ = 0.18 +0.08 –0.05 at a characteristic richness λ = 70. We demonstrate that our model provides an adequate description of the matched sample, showing that the fraction of SPT-SZ-selected clusters with RM counterparts is consistent with expectations and that the fraction of RM-selected clusters with SPT-SZ counterparts is in mild tension with expectation. We model the optical-SZE cluster positional offset distribution with the sum of two Gaussians, showing that it is consistent with a dominant, centrally peaked population and a subdominant population characterized by larger offsets. We also cross-match the RM catalogue with SPT-SZ candidates below the official catalogue threshold significance ξ = 4.5, using the RM catalogue to provide optical confirmation and redshifts for 15 additional clusters with ξ ϵ [4, 4.5].« less
Burns, Randal; Roncal, William Gray; Kleissas, Dean; Lillaney, Kunal; Manavalan, Priya; Perlman, Eric; Berger, Daniel R; Bock, Davi D; Chung, Kwanghun; Grosenick, Logan; Kasthuri, Narayanan; Weiler, Nicholas C; Deisseroth, Karl; Kazhdan, Michael; Lichtman, Jeff; Reid, R Clay; Smith, Stephen J; Szalay, Alexander S; Vogelstein, Joshua T; Vogelstein, R Jacob
2013-01-01
We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes - neural connectivity maps of the brain-using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at openconnecto.me. The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems-reads to parallel disk arrays and writes to solid-state storage-to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effec-tiveness of spatial data organization.
Burns, Randal; Roncal, William Gray; Kleissas, Dean; Lillaney, Kunal; Manavalan, Priya; Perlman, Eric; Berger, Daniel R.; Bock, Davi D.; Chung, Kwanghun; Grosenick, Logan; Kasthuri, Narayanan; Weiler, Nicholas C.; Deisseroth, Karl; Kazhdan, Michael; Lichtman, Jeff; Reid, R. Clay; Smith, Stephen J.; Szalay, Alexander S.; Vogelstein, Joshua T.; Vogelstein, R. Jacob
2013-01-01
We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes— neural connectivity maps of the brain—using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at openconnecto.me. The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems—reads to parallel disk arrays and writes to solid-state storage—to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effec-tiveness of spatial data organization. PMID:24401992
Critical exponents of the explosive percolation transition
NASA Astrophysics Data System (ADS)
da Costa, R. A.; Dorogovtsev, S. N.; Goltsev, A. V.; Mendes, J. F. F.
2014-04-01
In a new type of percolation phase transition, which was observed in a set of nonequilibrium models, each new connection between vertices is chosen from a number of possibilities by an Achlioptas-like algorithm. This causes preferential merging of small components and delays the emergence of the percolation cluster. First simulations led to a conclusion that a percolation cluster in this irreversible process is born discontinuously, by a discontinuous phase transition, which results in the term "explosive percolation transition." We have shown that this transition is actually continuous (second order) though with an anomalously small critical exponent of the percolation cluster. Here we propose an efficient numerical method enabling us to find the critical exponents and other characteristics of this second-order transition for a representative set of explosive percolation models with different number of choices. The method is based on gluing together the numerical solutions of evolution equations for the cluster size distribution and power-law asymptotics. For each of the models, with high precision, we obtain critical exponents and the critical point.
Charging of nanoparticles in stationary plasma in a gas aggregation cluster source
NASA Astrophysics Data System (ADS)
Blažek, J.; Kousal, J.; Biederman, H.; Kylián, O.; Hanuš, J.; Slavínská, D.
2015-10-01
Clusters that grow into nanoparticles near the magnetron target of the gas aggregation cluster source (GAS) may acquire electric charge by collecting electrons and ions or through other mechanisms like secondary- or photo-electron emissions. The region of the GAS close to magnetron may be considered as stationary plasma. The steady state charge distribution on nanoparticles can be determined by means of three possible models—fluid model, kinetic model and model employing Monte Carlo simulations—of cluster charging. In the paper the mathematical and numerical aspects of these models are analyzed in detail and close links between them are clarified. Among others it is shown that Monte Carlo simulation may be considered as a particular numerical technique of solving kinetic equations. Similarly the equations of the fluid model result, after some approximation, from averaged kinetic equations. A new algorithm solving an in principle unlimited set of kinetic equations is suggested. Its efficiency is verified on physical models based on experimental input data.
Automatic Clustering Using FSDE-Forced Strategy Differential Evolution
NASA Astrophysics Data System (ADS)
Yasid, A.
2018-01-01
Clustering analysis is important in datamining for unsupervised data, cause no adequate prior knowledge. One of the important tasks is defining the number of clusters without user involvement that is known as automatic clustering. This study intends on acquiring cluster number automatically utilizing forced strategy differential evolution (AC-FSDE). Two mutation parameters, namely: constant parameter and variable parameter are employed to boost differential evolution performance. Four well-known benchmark datasets were used to evaluate the algorithm. Moreover, the result is compared with other state of the art automatic clustering methods. The experiment results evidence that AC-FSDE is better or competitive with other existing automatic clustering algorithm.
Hybrid fuzzy cluster ensemble framework for tumor clustering from biomolecular data.
Yu, Zhiwen; Chen, Hantao; You, Jane; Han, Guoqiang; Li, Le
2013-01-01
Cancer class discovery using biomolecular data is one of the most important tasks for cancer diagnosis and treatment. Tumor clustering from gene expression data provides a new way to perform cancer class discovery. Most of the existing research works adopt single-clustering algorithms to perform tumor clustering is from biomolecular data that lack robustness, stability, and accuracy. To further improve the performance of tumor clustering from biomolecular data, we introduce the fuzzy theory into the cluster ensemble framework for tumor clustering from biomolecular data, and propose four kinds of hybrid fuzzy cluster ensemble frameworks (HFCEF), named as HFCEF-I, HFCEF-II, HFCEF-III, and HFCEF-IV, respectively, to identify samples that belong to different types of cancers. The difference between HFCEF-I and HFCEF-II is that they adopt different ensemble generator approaches to generate a set of fuzzy matrices in the ensemble. Specifically, HFCEF-I applies the affinity propagation algorithm (AP) to perform clustering on the sample dimension and generates a set of fuzzy matrices in the ensemble based on the fuzzy membership function and base samples selected by AP. HFCEF-II adopts AP to perform clustering on the attribute dimension, generates a set of subspaces, and obtains a set of fuzzy matrices in the ensemble by performing fuzzy c-means on subspaces. Compared with HFCEF-I and HFCEF-II, HFCEF-III and HFCEF-IV consider the characteristics of HFCEF-I and HFCEF-II. HFCEF-III combines HFCEF-I and HFCEF-II in a serial way, while HFCEF-IV integrates HFCEF-I and HFCEF-II in a concurrent way. HFCEFs adopt suitable consensus functions, such as the fuzzy c-means algorithm or the normalized cut algorithm (Ncut), to summarize generated fuzzy matrices, and obtain the final results. The experiments on real data sets from UCI machine learning repository and cancer gene expression profiles illustrate that 1) the proposed hybrid fuzzy cluster ensemble frameworks work well on real data sets, especially biomolecular data, and 2) the proposed approaches are able to provide more robust, stable, and accurate results when compared with the state-of-the-art single clustering algorithms and traditional cluster ensemble approaches.
ABCluster: the artificial bee colony algorithm for cluster global optimization.
Zhang, Jun; Dolg, Michael
2015-10-07
Global optimization of cluster geometries is of fundamental importance in chemistry and an interesting problem in applied mathematics. In this work, we introduce a relatively new swarm intelligence algorithm, i.e. the artificial bee colony (ABC) algorithm proposed in 2005, to this field. It is inspired by the foraging behavior of a bee colony, and only three parameters are needed to control it. We applied it to several potential functions of quite different nature, i.e., the Coulomb-Born-Mayer, Lennard-Jones, Morse, Z and Gupta potentials. The benchmarks reveal that for long-ranged potentials the ABC algorithm is very efficient in locating the global minimum, while for short-ranged ones it is sometimes trapped into a local minimum funnel on a potential energy surface of large clusters. We have released an efficient, user-friendly, and free program "ABCluster" to realize the ABC algorithm. It is a black-box program for non-experts as well as experts and might become a useful tool for chemists to study clusters.
Hybrid approach of selecting hyperparameters of support vector machine for regression.
Jeng, Jin-Tsong
2006-06-01
To select the hyperparameters of the support vector machine for regression (SVR), a hybrid approach is proposed to determine the kernel parameter of the Gaussian kernel function and the epsilon value of Vapnik's epsilon-insensitive loss function. The proposed hybrid approach includes a competitive agglomeration (CA) clustering algorithm and a repeated SVR (RSVR) approach. Since the CA clustering algorithm is used to find the nearly "optimal" number of clusters and the centers of clusters in the clustering process, the CA clustering algorithm is applied to select the Gaussian kernel parameter. Additionally, an RSVR approach that relies on the standard deviation of a training error is proposed to obtain an epsilon in the loss function. Finally, two functions, one real data set (i.e., a time series of quarterly unemployment rate for West Germany) and an identification of nonlinear plant are used to verify the usefulness of the hybrid approach.
Fuzzy Document Clustering Approach using WordNet Lexical Categories
NASA Astrophysics Data System (ADS)
Gharib, Tarek F.; Fouad, Mohammed M.; Aref, Mostafa M.
Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. This area is growing rapidly mainly because of the strong need for analysing the huge and large amount of textual data that reside on internal file systems and the Web. Text document clustering provides an effective navigation mechanism to organize this large amount of data by grouping their documents into a small number of meaningful classes. In this paper we proposed a fuzzy text document clustering approach using WordNet lexical categories and Fuzzy c-Means algorithm. Some experiments are performed to compare efficiency of the proposed approach with the recently reported approaches. Experimental results show that Fuzzy clustering leads to great performance results. Fuzzy c-means algorithm overcomes other classical clustering algorithms like k-means and bisecting k-means in both clustering quality and running time efficiency.
2015-01-01
Background Though cluster analysis has become a routine analytic task for bioinformatics research, it is still arduous for researchers to assess the quality of a clustering result. To select the best clustering method and its parameters for a dataset, researchers have to run multiple clustering algorithms and compare them. However, such a comparison task with multiple clustering results is cognitively demanding and laborious. Results In this paper, we present XCluSim, a visual analytics tool that enables users to interactively compare multiple clustering results based on the Visual Information Seeking Mantra. We build a taxonomy for categorizing existing techniques of clustering results visualization in terms of the Gestalt principles of grouping. Using the taxonomy, we choose the most appropriate interactive visualizations for presenting individual clustering results from different types of clustering algorithms. The efficacy of XCluSim is shown through case studies with a bioinformatician. Conclusions Compared to other relevant tools, XCluSim enables users to compare multiple clustering results in a more scalable manner. Moreover, XCluSim supports diverse clustering algorithms and dedicated visualizations and interactions for different types of clustering results, allowing more effective exploration of details on demand. Through case studies with a bioinformatics researcher, we received positive feedback on the functionalities of XCluSim, including its ability to help identify stably clustered items across multiple clustering results. PMID:26328893
Shi, Weifang; Zeng, Weihua
2013-01-01
Reducing human vulnerability to chemical hazards in the industrialized city is a matter of great urgency. Vulnerability mapping is an alternative approach for providing vulnerability-reducing interventions in a region. This study presents a method for mapping human vulnerability to chemical hazards by using clustering analysis for effective vulnerability reduction. Taking the city of Shanghai as the study area, we measure human exposure to chemical hazards by using the proximity model with additionally considering the toxicity of hazardous substances, and capture the sensitivity and coping capacity with corresponding indicators. We perform an improved k-means clustering approach on the basis of genetic algorithm by using a 500 m × 500 m geographical grid as basic spatial unit. The sum of squared errors and silhouette coefficient are combined to measure the quality of clustering and to determine the optimal clustering number. Clustering result reveals a set of six typical human vulnerability patterns that show distinct vulnerability dimension combinations. The vulnerability mapping of the study area reflects cluster-specific vulnerability characteristics and their spatial distribution. Finally, we suggest specific points that can provide new insights in rationally allocating the limited funds for the vulnerability reduction of each cluster. PMID:23787337
Clustering Binary Data in the Presence of Masking Variables
ERIC Educational Resources Information Center
Brusco, Michael J.
2004-01-01
A number of important applications require the clustering of binary data sets. Traditional nonhierarchical cluster analysis techniques, such as the popular K-means algorithm, can often be successfully applied to these data sets. However, the presence of masking variables in a data set can impede the ability of the K-means algorithm to recover the…
Wolf, Antje; Kirschner, Karl N
2013-02-01
With improvements in computer speed and algorithm efficiency, MD simulations are sampling larger amounts of molecular and biomolecular conformations. Being able to qualitatively and quantitatively sift these conformations into meaningful groups is a difficult and important task, especially when considering the structure-activity paradigm. Here we present a study that combines two popular techniques, principal component (PC) analysis and clustering, for revealing major conformational changes that occur in molecular dynamics (MD) simulations. Specifically, we explored how clustering different PC subspaces effects the resulting clusters versus clustering the complete trajectory data. As a case example, we used the trajectory data from an explicitly solvated simulation of a bacteria's L11·23S ribosomal subdomain, which is a target of thiopeptide antibiotics. Clustering was performed, using K-means and average-linkage algorithms, on data involving the first two to the first five PC subspace dimensions. For the average-linkage algorithm we found that data-point membership, cluster shape, and cluster size depended on the selected PC subspace data. In contrast, K-means provided very consistent results regardless of the selected subspace. Since we present results on a single model system, generalization concerning the clustering of different PC subspaces of other molecular systems is currently premature. However, our hope is that this study illustrates a) the complexities in selecting the appropriate clustering algorithm, b) the complexities in interpreting and validating their results, and c) by combining PC analysis with subsequent clustering valuable dynamic and conformational information can be obtained.
A Multilevel Gamma-Clustering Layout Algorithm for Visualization of Biological Networks
Hruz, Tomas; Lucas, Christoph; Laule, Oliver; Zimmermann, Philip
2013-01-01
Visualization of large complex networks has become an indispensable part of systems biology, where organisms need to be considered as one complex system. The visualization of the corresponding network is challenging due to the size and density of edges. In many cases, the use of standard visualization algorithms can lead to high running times and poorly readable visualizations due to many edge crossings. We suggest an approach that analyzes the structure of the graph first and then generates a new graph which contains specific semantic symbols for regular substructures like dense clusters. We propose a multilevel gamma-clustering layout visualization algorithm (MLGA) which proceeds in three subsequent steps: (i) a multilevel γ-clustering is used to identify the structure of the underlying network, (ii) the network is transformed to a tree, and (iii) finally, the resulting tree which shows the network structure is drawn using a variation of a force-directed algorithm. The algorithm has a potential to visualize very large networks because it uses modern clustering heuristics which are optimized for large graphs. Moreover, most of the edges are removed from the visual representation which allows keeping the overview over complex graphs with dense subgraphs. PMID:23864855
A Fast Implementation of the ISODATA Clustering Algorithm
NASA Technical Reports Server (NTRS)
Memarsadeghi, Nargess; Mount, David M.; Netanyahu, Nathan S.; LeMoigne, Jacqueline
2005-01-01
Clustering is central to many image processing and remote sensing applications. ISODATA is one of the most popular and widely used clustering methods in geoscience applications, but it can run slowly, particularly with large data sets. We present a more efficient approach to ISODATA clustering, which achieves better running times by storing the points in a kd-tree and through a modification of the way in which the algorithm estimates the dispersion of each cluster. We also present an approximate version of the algorithm which allows the user to further improve the running time, at the expense of lower fidelity in computing the nearest cluster center to each point. We provide both theoretical and empirical justification that our modified approach produces clusterings that are very similar to those produced by the standard ISODATA approach. We also provide empirical studies on both synthetic data and remotely sensed Landsat and MODIS images that show that our approach has significantly lower running times.
A Fast Implementation of the Isodata Clustering Algorithm
NASA Technical Reports Server (NTRS)
Memarsadeghi, Nargess; Le Moigne, Jacqueline; Mount, David M.; Netanyahu, Nathan S.
2007-01-01
Clustering is central to many image processing and remote sensing applications. ISODATA is one of the most popular and widely used clustering methods in geoscience applications, but it can run slowly, particularly with large data sets. We present a more efficient approach to IsoDATA clustering, which achieves better running times by storing the points in a kd-tree and through a modification of the way in which the algorithm estimates the dispersion of each cluster. We also present an approximate version of the algorithm which allows the user to further improve the running time, at the expense of lower fidelity in computing the nearest cluster center to each point. We provide both theoretical and empirical justification that our modified approach produces clusterings that are very similar to those produced by the standard ISODATA approach. We also provide empirical studies on both synthetic data and remotely sensed Landsat and MODIS images that show that our approach has significantly lower running times.
Unsupervised classification of multivariate geostatistical data: Two algorithms
NASA Astrophysics Data System (ADS)
Romary, Thomas; Ors, Fabien; Rivoirard, Jacques; Deraisme, Jacques
2015-12-01
With the increasing development of remote sensing platforms and the evolution of sampling facilities in mining and oil industry, spatial datasets are becoming increasingly large, inform a growing number of variables and cover wider and wider areas. Therefore, it is often necessary to split the domain of study to account for radically different behaviors of the natural phenomenon over the domain and to simplify the subsequent modeling step. The definition of these areas can be seen as a problem of unsupervised classification, or clustering, where we try to divide the domain into homogeneous domains with respect to the values taken by the variables in hand. The application of classical clustering methods, designed for independent observations, does not ensure the spatial coherence of the resulting classes. Image segmentation methods, based on e.g. Markov random fields, are not adapted to irregularly sampled data. Other existing approaches, based on mixtures of Gaussian random functions estimated via the expectation-maximization algorithm, are limited to reasonable sample sizes and a small number of variables. In this work, we propose two algorithms based on adaptations of classical algorithms to multivariate geostatistical data. Both algorithms are model free and can handle large volumes of multivariate, irregularly spaced data. The first one proceeds by agglomerative hierarchical clustering. The spatial coherence is ensured by a proximity condition imposed for two clusters to merge. This proximity condition relies on a graph organizing the data in the coordinates space. The hierarchical algorithm can then be seen as a graph-partitioning algorithm. Following this interpretation, a spatial version of the spectral clustering algorithm is also proposed. The performances of both algorithms are assessed on toy examples and a mining dataset.
Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal
2008-01-01
Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request. Contact: lonshy@cs.huji.ac.il PMID:18586742
Competitive learning with pairwise constraints.
Covões, Thiago F; Hruschka, Eduardo R; Ghosh, Joydeep
2013-01-01
Constrained clustering has been an active research topic since the last decade. Most studies focus on batch-mode algorithms. This brief introduces two algorithms for on-line constrained learning, named on-line linear constrained vector quantization error (O-LCVQE) and constrained rival penalized competitive learning (C-RPCL). The former is a variant of the LCVQE algorithm for on-line settings, whereas the latter is an adaptation of the (on-line) RPCL algorithm to deal with constrained clustering. The accuracy results--in terms of the normalized mutual information (NMI)--from experiments with nine datasets show that the partitions induced by O-LCVQE are competitive with those found by the (batch-mode) LCVQE. Compared with this formidable baseline algorithm, it is surprising that C-RPCL can provide better partitions (in terms of the NMI) for most of the datasets. Also, experiments on a large dataset show that on-line algorithms for constrained clustering can significantly reduce the computational time.
Color image analysis technique for measuring of fat in meat: an application for the meat industry
NASA Astrophysics Data System (ADS)
Ballerini, Lucia; Hogberg, Anders; Lundstrom, Kerstin; Borgefors, Gunilla
2001-04-01
Intramuscular fat content in meat influences some important meat quality characteristics. The aim of the present study was to develop and apply image processing techniques to quantify intramuscular fat content in beefs together with the visual appearance of fat in meat (marbling). Color images of M. longissimus dorsi meat samples with a variability of intramuscular fat content and marbling were captured. Image analysis software was specially developed for the interpretation of these images. In particular, a segmentation algorithm (i.e. classification of different substances: fat, muscle and connective tissue) was optimized in order to obtain a proper classification and perform subsequent analysis. Segmentation of muscle from fat was achieved based on their characteristics in the 3D color space, and on the intrinsic fuzzy nature of these structures. The method is fully automatic and it combines a fuzzy clustering algorithm, the Fuzzy c-Means Algorithm, with a Genetic Algorithm. The percentages of various colors (i.e. substances) within the sample are then determined; the number, size distribution, and spatial distributions of the extracted fat flecks are measured. Measurements are correlated with chemical and sensory properties. Results so far show that advanced image analysis is useful for quantify the visual appearance of meat.