Cloud classification from satellite data using a fuzzy sets algorithm: A polar example
NASA Technical Reports Server (NTRS)
Key, J. R.; Maslanik, J. A.; Barry, R. G.
1988-01-01
Where spatial boundaries between phenomena are diffuse, classification methods which construct mutually exclusive clusters seem inappropriate. The Fuzzy c-means (FCM) algorithm assigns each observation to all clusters, with membership values as a function of distance to the cluster center. The FCM algorithm is applied to AVHRR data for the purpose of classifying polar clouds and surfaces. Careful analysis of the fuzzy sets can provide information on which spectral channels are best suited to the classification of particular features, and can help determine likely areas of misclassification. General agreement in the resulting classes and cloud fraction was found between the FCM algorithm, a manual classification, and an unsupervised maximum likelihood classifier.
On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms.
Chen, Chunlei; He, Li; Zhang, Huixiang; Zheng, Hao; Wang, Lei
2017-01-01
Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions.
On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms
He, Li; Zheng, Hao; Wang, Lei
2017-01-01
Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions. PMID:29123546
Anandakrishnan, Ramu; Onufriev, Alexey
2008-03-01
In statistical mechanics, the equilibrium properties of a physical system of particles can be calculated as the statistical average over accessible microstates of the system. In general, these calculations are computationally intractable since they involve summations over an exponentially large number of microstates. Clustering algorithms are one of the methods used to numerically approximate these sums. The most basic clustering algorithms first sub-divide the system into a set of smaller subsets (clusters). Then, interactions between particles within each cluster are treated exactly, while all interactions between different clusters are ignored. These smaller clusters have far fewer microstates, making the summation over these microstates, tractable. These algorithms have been previously used for biomolecular computations, but remain relatively unexplored in this context. Presented here, is a theoretical analysis of the error and computational complexity for the two most basic clustering algorithms that were previously applied in the context of biomolecular electrostatics. We derive a tight, computationally inexpensive, error bound for the equilibrium state of a particle computed via these clustering algorithms. For some practical applications, it is the root mean square error, which can be significantly lower than the error bound, that may be more important. We how that there is a strong empirical relationship between error bound and root mean square error, suggesting that the error bound could be used as a computationally inexpensive metric for predicting the accuracy of clustering algorithms for practical applications. An example of error analysis for such an application-computation of average charge of ionizable amino-acids in proteins-is given, demonstrating that the clustering algorithm can be accurate enough for practical purposes.
Spatial cluster detection using dynamic programming.
Sverchkov, Yuriy; Jiang, Xia; Cooper, Gregory F
2012-03-25
The task of spatial cluster detection involves finding spatial regions where some property deviates from the norm or the expected value. In a probabilistic setting this task can be expressed as finding a region where some event is significantly more likely than usual. Spatial cluster detection is of interest in fields such as biosurveillance, mining of astronomical data, military surveillance, and analysis of fMRI images. In almost all such applications we are interested both in the question of whether a cluster exists in the data, and if it exists, we are interested in finding the most accurate characterization of the cluster. We present a general dynamic programming algorithm for grid-based spatial cluster detection. The algorithm can be used for both Bayesian maximum a-posteriori (MAP) estimation of the most likely spatial distribution of clusters and Bayesian model averaging over a large space of spatial cluster distributions to compute the posterior probability of an unusual spatial clustering. The algorithm is explained and evaluated in the context of a biosurveillance application, specifically the detection and identification of Influenza outbreaks based on emergency department visits. A relatively simple underlying model is constructed for the purpose of evaluating the algorithm, and the algorithm is evaluated using the model and semi-synthetic test data. When compared to baseline methods, tests indicate that the new algorithm can improve MAP estimates under certain conditions: the greedy algorithm we compared our method to was found to be more sensitive to smaller outbreaks, while as the size of the outbreaks increases, in terms of area affected and proportion of individuals affected, our method overtakes the greedy algorithm in spatial precision and recall. The new algorithm performs on-par with baseline methods in the task of Bayesian model averaging. We conclude that the dynamic programming algorithm performs on-par with other available methods for spatial cluster detection and point to its low computational cost and extendability as advantages in favor of further research and use of the algorithm.
Spatial cluster detection using dynamic programming
2012-01-01
Background The task of spatial cluster detection involves finding spatial regions where some property deviates from the norm or the expected value. In a probabilistic setting this task can be expressed as finding a region where some event is significantly more likely than usual. Spatial cluster detection is of interest in fields such as biosurveillance, mining of astronomical data, military surveillance, and analysis of fMRI images. In almost all such applications we are interested both in the question of whether a cluster exists in the data, and if it exists, we are interested in finding the most accurate characterization of the cluster. Methods We present a general dynamic programming algorithm for grid-based spatial cluster detection. The algorithm can be used for both Bayesian maximum a-posteriori (MAP) estimation of the most likely spatial distribution of clusters and Bayesian model averaging over a large space of spatial cluster distributions to compute the posterior probability of an unusual spatial clustering. The algorithm is explained and evaluated in the context of a biosurveillance application, specifically the detection and identification of Influenza outbreaks based on emergency department visits. A relatively simple underlying model is constructed for the purpose of evaluating the algorithm, and the algorithm is evaluated using the model and semi-synthetic test data. Results When compared to baseline methods, tests indicate that the new algorithm can improve MAP estimates under certain conditions: the greedy algorithm we compared our method to was found to be more sensitive to smaller outbreaks, while as the size of the outbreaks increases, in terms of area affected and proportion of individuals affected, our method overtakes the greedy algorithm in spatial precision and recall. The new algorithm performs on-par with baseline methods in the task of Bayesian model averaging. Conclusions We conclude that the dynamic programming algorithm performs on-par with other available methods for spatial cluster detection and point to its low computational cost and extendability as advantages in favor of further research and use of the algorithm. PMID:22443103
Purpose-Driven Communities in Multiplex Networks: Thresholding User-Engaged Layer Aggregation
2016-06-01
dark networks is a non-trivial yet useful task. Because terrorists work hard to hide their relationships/network, analysts have an incomplete picture...them identify meaningful terrorist communities. This thesis introduces a general-purpose algorithm for community detection in multiplex dark networks...aggregation, dark networks, conductance, cluster adequacy, mod- ularity, Louvain method, shortest path interdiction 15. NUMBER OF PAGES 155 16. PRICE CODE
Noise-enhanced clustering and competitive learning algorithms.
Osoba, Osonde; Kosko, Bart
2013-01-01
Noise can provably speed up convergence in many centroid-based clustering algorithms. This includes the popular k-means clustering algorithm. The clustering noise benefit follows from the general noise benefit for the expectation-maximization algorithm because many clustering algorithms are special cases of the expectation-maximization algorithm. Simulations show that noise also speeds up convergence in stochastic unsupervised competitive learning, supervised competitive learning, and differential competitive learning. Copyright © 2012 Elsevier Ltd. All rights reserved.
Two generalizations of Kohonen clustering
NASA Technical Reports Server (NTRS)
Bezdek, James C.; Pal, Nikhil R.; Tsao, Eric C. K.
1993-01-01
The relationship between the sequential hard c-means (SHCM), learning vector quantization (LVQ), and fuzzy c-means (FCM) clustering algorithms is discussed. LVQ and SHCM suffer from several major problems. For example, they depend heavily on initialization. If the initial values of the cluster centers are outside the convex hull of the input data, such algorithms, even if they terminate, may not produce meaningful results in terms of prototypes for cluster representation. This is due in part to the fact that they update only the winning prototype for every input vector. The impact and interaction of these two families with Kohonen's self-organizing feature mapping (SOFM), which is not a clustering method, but which often leads ideas to clustering algorithms is discussed. Then two generalizations of LVQ that are explicitly designed as clustering algorithms are presented; these algorithms are referred to as generalized LVQ = GLVQ; and fuzzy LVQ = FLVQ. Learning rules are derived to optimize an objective function whose goal is to produce 'good clusters'. GLVQ/FLVQ (may) update every node in the clustering net for each input vector. Neither GLVQ nor FLVQ depends upon a choice for the update neighborhood or learning rate distribution - these are taken care of automatically. Segmentation of a gray tone image is used as a typical application of these algorithms to illustrate the performance of GLVQ/FLVQ.
Automatic detection of erythemato-squamous diseases using k-means clustering.
Ubeyli, Elif Derya; Doğdu, Erdoğan
2010-04-01
A new approach based on the implementation of k-means clustering is presented for automated detection of erythemato-squamous diseases. The purpose of clustering techniques is to find a structure for the given data by finding similarities between data according to data characteristics. The studied domain contained records of patients with known diagnosis. The k-means clustering algorithm's task was to classify the data points, in this case the patients with attribute data, to one of the five clusters. The algorithm was used to detect the five erythemato-squamous diseases when 33 features defining five disease indications were used. The purpose is to determine an optimum classification scheme for this problem. The present research demonstrated that the features well represent the erythemato-squamous diseases and the k-means clustering algorithm's task achieved high classification accuracies for only five erythemato-squamous diseases.
Clustering for Binary Data Sets by Using Genetic Algorithm-Incremental K-means
NASA Astrophysics Data System (ADS)
Saharan, S.; Baragona, R.; Nor, M. E.; Salleh, R. M.; Asrah, N. M.
2018-04-01
This research was initially driven by the lack of clustering algorithms that specifically focus in binary data. To overcome this gap in knowledge, a promising technique for analysing this type of data became the main subject in this research, namely Genetic Algorithms (GA). For the purpose of this research, GA was combined with the Incremental K-means (IKM) algorithm to cluster the binary data streams. In GAIKM, the objective function was based on a few sufficient statistics that may be easily and quickly calculated on binary numbers. The implementation of IKM will give an advantage in terms of fast convergence. The results show that GAIKM is an efficient and effective new clustering algorithm compared to the clustering algorithms and to the IKM itself. In conclusion, the GAIKM outperformed other clustering algorithms such as GCUK, IKM, Scalable K-means (SKM) and K-means clustering and paves the way for future research involving missing data and outliers.
Generalized fuzzy C-means clustering algorithm with improved fuzzy partitions.
Zhu, Lin; Chung, Fu-Lai; Wang, Shitong
2009-06-01
The fuzziness index m has important influence on the clustering result of fuzzy clustering algorithms, and it should not be forced to fix at the usual value m = 2. In view of its distinctive features in applications and its limitation in having m = 2 only, a recent advance of fuzzy clustering called fuzzy c-means clustering with improved fuzzy partitions (IFP-FCM) is extended in this paper, and a generalized algorithm called GIFP-FCM for more effective clustering is proposed. By introducing a novel membership constraint function, a new objective function is constructed, and furthermore, GIFP-FCM clustering is derived. Meanwhile, from the viewpoints of L(p) norm distance measure and competitive learning, the robustness and convergence of the proposed algorithm are analyzed. Furthermore, the classical fuzzy c-means algorithm (FCM) and IFP-FCM can be taken as two special cases of the proposed algorithm. Several experimental results including its application to noisy image texture segmentation are presented to demonstrate its average advantage over FCM and IFP-FCM in both clustering and robustness capabilities.
NASA Astrophysics Data System (ADS)
Konno, Yohko; Suzuki, Keiji
This paper describes an approach to development of a solution algorithm of a general-purpose for large scale problems using “Local Clustering Organization (LCO)” as a new solution for Job-shop scheduling problem (JSP). Using a performance effective large scale scheduling in the study of usual LCO, a solving JSP keep stability induced better solution is examined. In this study for an improvement of a performance of a solution for JSP, processes to a optimization by LCO is examined, and a scheduling solution-structure is extended to a new solution-structure based on machine-division. A solving method introduced into effective local clustering for the solution-structure is proposed as an extended LCO. An extended LCO has an algorithm which improves scheduling evaluation efficiently by clustering of parallel search which extends over plural machines. A result verified by an application of extended LCO on various scale of problems proved to conduce to minimizing make-span and improving on the stable performance.
NASA Astrophysics Data System (ADS)
Karimi, Hamed; Rosenberg, Gili; Katzgraber, Helmut G.
2017-10-01
We present and apply a general-purpose, multistart algorithm for improving the performance of low-energy samplers used for solving optimization problems. The algorithm iteratively fixes the value of a large portion of the variables to values that have a high probability of being optimal. The resulting problems are smaller and less connected, and samplers tend to give better low-energy samples for these problems. The algorithm is trivially parallelizable since each start in the multistart algorithm is independent, and could be applied to any heuristic solver that can be run multiple times to give a sample. We present results for several classes of hard problems solved using simulated annealing, path-integral quantum Monte Carlo, parallel tempering with isoenergetic cluster moves, and a quantum annealer, and show that the success metrics and the scaling are improved substantially. When combined with this algorithm, the quantum annealer's scaling was substantially improved for native Chimera graph problems. In addition, with this algorithm the scaling of the time to solution of the quantum annealer is comparable to the Hamze-de Freitas-Selby algorithm on the weak-strong cluster problems introduced by Boixo et al. Parallel tempering with isoenergetic cluster moves was able to consistently solve three-dimensional spin glass problems with 8000 variables when combined with our method, whereas without our method it could not solve any.
Implementation of MPEG-2 encoder to multiprocessor system using multiple MVPs (TMS320C80)
NASA Astrophysics Data System (ADS)
Kim, HyungSun; Boo, Kenny; Chung, SeokWoo; Choi, Geon Y.; Lee, YongJin; Jeon, JaeHo; Park, Hyun Wook
1997-05-01
This paper presents the efficient algorithm mapping for the real-time MPEG-2 encoding on the KAIST image computing system (KICS), which has a parallel architecture using five multimedia video processors (MVPs). The MVP is a general purpose digital signal processor (DSP) of Texas Instrument. It combines one floating-point processor and four fixed- point DSPs on a single chip. The KICS uses the MVP as a primary processing element (PE). Two PEs form a cluster, and there are two processing clusters in the KICS. Real-time MPEG-2 encoder is implemented through the spatial and the functional partitioning strategies. Encoding process of spatially partitioned half of the video input frame is assigned to ne processing cluster. Two PEs perform the functionally partitioned MPEG-2 encoding tasks in the pipelined operation mode. One PE of a cluster carries out the transform coding part and the other performs the predictive coding part of the MPEG-2 encoding algorithm. One MVP among five MVPs is used for system control and interface with host computer. This paper introduces an implementation of the MPEG-2 algorithm with a parallel processing architecture.
Sensitivity evaluation of dynamic speckle activity measurements using clustering methods.
Etchepareborda, Pablo; Federico, Alejandro; Kaufmann, Guillermo H
2010-07-01
We evaluate and compare the use of competitive neural networks, self-organizing maps, the expectation-maximization algorithm, K-means, and fuzzy C-means techniques as partitional clustering methods, when the sensitivity of the activity measurement of dynamic speckle images needs to be improved. The temporal history of the acquired intensity generated by each pixel is analyzed in a wavelet decomposition framework, and it is shown that the mean energy of its corresponding wavelet coefficients provides a suited feature space for clustering purposes. The sensitivity obtained by using the evaluated clustering techniques is also compared with the well-known methods of Konishi-Fujii, weighted generalized differences, and wavelet entropy. The performance of the partitional clustering approach is evaluated using simulated dynamic speckle patterns and also experimental data.
Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks.
Wang, Chenguang; Song, Yangqiu; El-Kishky, Ahmed; Roth, Dan; Zhang, Ming; Han, Jiawei
2015-08-01
One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, Word-Net. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features.
Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks
Wang, Chenguang; Song, Yangqiu; El-Kishky, Ahmed; Roth, Dan; Zhang, Ming; Han, Jiawei
2015-01-01
One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the experiments, we use two existing knowledge bases as our sources of world knowledge. One is Freebase, which is collaboratively collected knowledge about entities and their organizations. The other is YAGO2, a knowledge base automatically extracted from Wikipedia and maps knowledge to the linguistic knowledge base, Word-Net. Experimental results on two text benchmark datasets (20newsgroups and RCV1) show that incorporating world knowledge as indirect supervision can significantly outperform the state-of-the-art clustering algorithms as well as clustering algorithms enhanced with world knowledge features. PMID:26705504
Hardware accelerator design for change detection in smart camera
NASA Astrophysics Data System (ADS)
Singh, Sanjay; Dunga, Srinivasa Murali; Saini, Ravi; Mandal, A. S.; Shekhar, Chandra; Chaudhury, Santanu; Vohra, Anil
2011-10-01
Smart Cameras are important components in Human Computer Interaction. In any remote surveillance scenario, smart cameras have to take intelligent decisions to select frames of significant changes to minimize communication and processing overhead. Among many of the algorithms for change detection, one based on clustering based scheme was proposed for smart camera systems. However, such an algorithm could achieve low frame rate far from real-time requirements on a general purpose processors (like PowerPC) available on FPGAs. This paper proposes the hardware accelerator capable of detecting real time changes in a scene, which uses clustering based change detection scheme. The system is designed and simulated using VHDL and implemented on Xilinx XUP Virtex-IIPro FPGA board. Resulted frame rate is 30 frames per second for QVGA resolution in gray scale.
Data depth based clustering analysis
Jeong, Myeong -Hun; Cai, Yaping; Sullivan, Clair J.; ...
2016-01-01
Here, this paper proposes a new algorithm for identifying patterns within data, based on data depth. Such a clustering analysis has an enormous potential to discover previously unknown insights from existing data sets. Many clustering algorithms already exist for this purpose. However, most algorithms are not affine invariant. Therefore, they must operate with different parameters after the data sets are rotated, scaled, or translated. Further, most clustering algorithms, based on Euclidean distance, can be sensitive to noises because they have no global perspective. Parameter selection also significantly affects the clustering results of each algorithm. Unlike many existing clustering algorithms, themore » proposed algorithm, called data depth based clustering analysis (DBCA), is able to detect coherent clusters after the data sets are affine transformed without changing a parameter. It is also robust to noises because using data depth can measure centrality and outlyingness of the underlying data. Further, it can generate relatively stable clusters by varying the parameter. The experimental comparison with the leading state-of-the-art alternatives demonstrates that the proposed algorithm outperforms DBSCAN and HDBSCAN in terms of affine invariance, and exceeds or matches the ro-bustness to noises of DBSCAN or HDBSCAN. The robust-ness to parameter selection is also demonstrated through the case study of clustering twitter data.« less
NASA Astrophysics Data System (ADS)
Thanos, Konstantinos-Georgios; Thomopoulos, Stelios C. A.
2014-06-01
The study in this paper belongs to a more general research of discovering facial sub-clusters in different ethnicity face databases. These new sub-clusters along with other metadata (such as race, sex, etc.) lead to a vector for each face in the database where each vector component represents the likelihood of participation of a given face to each cluster. This vector is then used as a feature vector in a human identification and tracking system based on face and other biometrics. The first stage in this system involves a clustering method which evaluates and compares the clustering results of five different clustering algorithms (average, complete, single hierarchical algorithm, k-means and DIGNET), and selects the best strategy for each data collection. In this paper we present the comparative performance of clustering results of DIGNET and four clustering algorithms (average, complete, single hierarchical and k-means) on fabricated 2D and 3D samples, and on actual face images from various databases, using four different standard metrics. These metrics are the silhouette figure, the mean silhouette coefficient, the Hubert test Γ coefficient, and the classification accuracy for each clustering result. The results showed that, in general, DIGNET gives more trustworthy results than the other algorithms when the metrics values are above a specific acceptance threshold. However when the evaluation results metrics have values lower than the acceptance threshold but not too low (too low corresponds to ambiguous results or false results), then it is necessary for the clustering results to be verified by the other algorithms.
Graphics Processing Unit Assisted Thermographic Compositing
NASA Technical Reports Server (NTRS)
Ragasa, Scott; McDougal, Matthew; Russell, Sam
2012-01-01
Objective: To develop a software application utilizing general purpose graphics processing units (GPUs) for the analysis of large sets of thermographic data. Background: Over the past few years, an increasing effort among scientists and engineers to utilize the GPU in a more general purpose fashion is allowing for supercomputer level results at individual workstations. As data sets grow, the methods to work them grow at an equal, and often great, pace. Certain common computations can take advantage of the massively parallel and optimized hardware constructs of the GPU to allow for throughput that was previously reserved for compute clusters. These common computations have high degrees of data parallelism, that is, they are the same computation applied to a large set of data where the result does not depend on other data elements. Signal (image) processing is one area were GPUs are being used to greatly increase the performance of certain algorithms and analysis techniques. Technical Methodology/Approach: Apply massively parallel algorithms and data structures to the specific analysis requirements presented when working with thermographic data sets.
Implementation of spectral clustering on microarray data of carcinoma using k-means algorithm
NASA Astrophysics Data System (ADS)
Frisca, Bustamam, Alhadi; Siswantining, Titin
2017-03-01
Clustering is one of data analysis methods that aims to classify data which have similar characteristics in the same group. Spectral clustering is one of the most popular modern clustering algorithms. As an effective clustering technique, spectral clustering method emerged from the concepts of spectral graph theory. Spectral clustering method needs partitioning algorithm. There are some partitioning methods including PAM, SOM, Fuzzy c-means, and k-means. Based on the research that has been done by Capital and Choudhury in 2013, when using Euclidian distance k-means algorithm provide better accuracy than PAM algorithm. So in this paper we use k-means as our partition algorithm. The major advantage of spectral clustering is in reducing data dimension, especially in this case to reduce the dimension of large microarray dataset. Microarray data is a small-sized chip made of a glass plate containing thousands and even tens of thousands kinds of genes in the DNA fragments derived from doubling cDNA. Application of microarray data is widely used to detect cancer, for the example is carcinoma, in which cancer cells express the abnormalities in his genes. The purpose of this research is to classify the data that have high similarity in the same group and the data that have low similarity in the others. In this research, Carcinoma microarray data using 7457 genes. The result of partitioning using k-means algorithm is two clusters.
Data mining in bioinformatics using Weka.
Frank, Eibe; Hall, Mark; Trigg, Len; Holmes, Geoffrey; Witten, Ian H
2004-10-12
The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. http://www.cs.waikato.ac.nz/ml/weka.
Polarization transformation as an algorithm for automatic generalization and quality assessment
NASA Astrophysics Data System (ADS)
Qian, Haizhong; Meng, Liqiu
2007-06-01
Since decades it has been a dream of cartographers to computationally mimic the generalization processes in human brains for the derivation of various small-scale target maps or databases from a large-scale source map or database. This paper addresses in a systematic way the polarization transformation (PT) - a new algorithm that serves both the purpose of automatic generalization of discrete features and the quality assurance. By means of PT, two dimensional point clusters or line networks in the Cartesian system can be transformed into a polar coordinate system, which then can be unfolded as a single spectrum line r = f(α), where r and a stand for the polar radius and the polar angle respectively. After the transformation, the original features will correspond to nodes on the spectrum line delimited between 0° and 360° along the horizontal axis, and between the minimum and maximum polar radius along the vertical axis. Since PT is a lossless transformation, it allows a straighforward analysis and comparison of the original and generalized distributions, thus automatic generalization and quality assurance can be down in this way. Examples illustrate that PT algorithm meets with the requirement of generalization of discrete spatial features and is more scientific.
Weighted graph cuts without eigenvectors a multilevel approach.
Dhillon, Inderjit S; Guan, Yuqiang; Kulis, Brian
2007-11-01
A variety of clustering algorithms have recently been proposed to handle data that is not linearly separable; spectral clustering and kernel k-means are two of the main methods. In this paper, we discuss an equivalence between the objective functions used in these seemingly different methods--in particular, a general weighted kernel k-means objective is mathematically equivalent to a weighted graph clustering objective. We exploit this equivalence to develop a fast, high-quality multilevel algorithm that directly optimizes various weighted graph clustering objectives, such as the popular ratio cut, normalized cut, and ratio association criteria. This eliminates the need for any eigenvector computation for graph clustering problems, which can be prohibitive for very large graphs. Previous multilevel graph partitioning methods, such as Metis, have suffered from the restriction of equal-sized clusters; our multilevel algorithm removes this restriction by using kernel k-means to optimize weighted graph cuts. Experimental results show that our multilevel algorithm outperforms a state-of-the-art spectral clustering algorithm in terms of speed, memory usage, and quality. We demonstrate that our algorithm is applicable to large-scale clustering tasks such as image segmentation, social network analysis and gene network analysis.
Long-term surface EMG monitoring using K-means clustering and compressive sensing
NASA Astrophysics Data System (ADS)
Balouchestani, Mohammadreza; Krishnan, Sridhar
2015-05-01
In this work, we present an advanced K-means clustering algorithm based on Compressed Sensing theory (CS) in combination with the K-Singular Value Decomposition (K-SVD) method for Clustering of long-term recording of surface Electromyography (sEMG) signals. The long-term monitoring of sEMG signals aims at recording of the electrical activity produced by muscles which are very useful procedure for treatment and diagnostic purposes as well as for detection of various pathologies. The proposed algorithm is examined for three scenarios of sEMG signals including healthy person (sEMG-Healthy), a patient with myopathy (sEMG-Myopathy), and a patient with neuropathy (sEMG-Neuropathr), respectively. The proposed algorithm can easily scan large sEMG datasets of long-term sEMG recording. We test the proposed algorithm with Principal Component Analysis (PCA) and Linear Correlation Coefficient (LCC) dimensionality reduction methods. Then, the output of the proposed algorithm is fed to K-Nearest Neighbours (K-NN) and Probabilistic Neural Network (PNN) classifiers in order to calclute the clustering performance. The proposed algorithm achieves a classification accuracy of 99.22%. This ability allows reducing 17% of Average Classification Error (ACE), 9% of Training Error (TE), and 18% of Root Mean Square Error (RMSE). The proposed algorithm also reduces 14% clustering energy consumption compared to the existing K-Means clustering algorithm.
NASA Astrophysics Data System (ADS)
Ghaffarian, Saman; Ghaffarian, Salar
2014-11-01
This paper proposes an improved FastICA model named as Purposive FastICA (PFICA) with initializing by a simple color space transformation and a novel masking approach to automatically detect buildings from high resolution Google Earth imagery. ICA and FastICA algorithms are defined as Blind Source Separation (BSS) techniques for unmixing source signals using the reference data sets. In order to overcome the limitations of the ICA and FastICA algorithms and make them purposeful, we developed a novel method involving three main steps: 1-Improving the FastICA algorithm using Moore-Penrose pseudo inverse matrix model, 2-Automated seeding of the PFICA algorithm based on LUV color space and proposed simple rules to split image into three regions; shadow + vegetation, baresoil + roads and buildings, respectively, 3-Masking out the final building detection results from PFICA outputs utilizing the K-means clustering algorithm with two number of clusters and conducting simple morphological operations to remove noises. Evaluation of the results illustrates that buildings detected from dense and suburban districts with divers characteristics and color combinations using our proposed method have 88.6% and 85.5% overall pixel-based and object-based precision performances, respectively.
Towards a PTAS for the generalized TSP in grid clusters
NASA Astrophysics Data System (ADS)
Khachay, Michael; Neznakhina, Katherine
2016-10-01
The Generalized Traveling Salesman Problem (GTSP) is a combinatorial optimization problem, which is to find a minimum cost cycle visiting one point (city) from each cluster exactly. We consider a geometric case of this problem, where n nodes are given inside the integer grid (in the Euclidean plane), each grid cell is a unit square. Clusters are induced by cells `populated' by nodes of the given instance. Even in this special setting, the GTSP remains intractable enclosing the classic Euclidean TSP on the plane. Recently, it was shown that the problem has (1.5+8√2+ɛ)-approximation algorithm with complexity bound depending on n and k polynomially, where k is the number of clusters. In this paper, we propose two approximation algorithms for the Euclidean GTSP on grid clusters. For any fixed k, both algorithms are PTAS. Time complexity of the first one remains polynomial for k = O(log n) while the second one is a PTAS, when k = n - O(log n).
Balouchestani, Mohammadreza; Krishnan, Sridhar
2014-01-01
Long-term recording of Electrocardiogram (ECG) signals plays an important role in health care systems for diagnostic and treatment purposes of heart diseases. Clustering and classification of collecting data are essential parts for detecting concealed information of P-QRS-T waves in the long-term ECG recording. Currently used algorithms do have their share of drawbacks: 1) clustering and classification cannot be done in real time; 2) they suffer from huge energy consumption and load of sampling. These drawbacks motivated us in developing novel optimized clustering algorithm which could easily scan large ECG datasets for establishing low power long-term ECG recording. In this paper, we present an advanced K-means clustering algorithm based on Compressed Sensing (CS) theory as a random sampling procedure. Then, two dimensionality reduction methods: Principal Component Analysis (PCA) and Linear Correlation Coefficient (LCC) followed by sorting the data using the K-Nearest Neighbours (K-NN) and Probabilistic Neural Network (PNN) classifiers are applied to the proposed algorithm. We show our algorithm based on PCA features in combination with K-NN classifier shows better performance than other methods. The proposed algorithm outperforms existing algorithms by increasing 11% classification accuracy. In addition, the proposed algorithm illustrates classification accuracy for K-NN and PNN classifiers, and a Receiver Operating Characteristics (ROC) area of 99.98%, 99.83%, and 99.75% respectively.
Hebbian self-organizing integrate-and-fire networks for data clustering.
Landis, Florian; Ott, Thomas; Stoop, Ruedi
2010-01-01
We propose a Hebbian learning-based data clustering algorithm using spiking neurons. The algorithm is capable of distinguishing between clusters and noisy background data and finds an arbitrary number of clusters of arbitrary shape. These properties render the approach particularly useful for visual scene segmentation into arbitrarily shaped homogeneous regions. We present several application examples, and in order to highlight the advantages and the weaknesses of our method, we systematically compare the results with those from standard methods such as the k-means and Ward's linkage clustering. The analysis demonstrates that not only the clustering ability of the proposed algorithm is more powerful than those of the two concurrent methods, the time complexity of the method is also more modest than that of its generally used strongest competitor.
Lukashin, A V; Fuchs, R
2001-05-01
Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and samples. In the present paper, we focus on several important issues related to clustering algorithms that have not yet been fully studied. We describe a simple and robust algorithm for the clustering of temporal gene expression profiles that is based on the simulated annealing procedure. In general, this algorithm guarantees to eventually find the globally optimal distribution of genes over clusters. We introduce an iterative scheme that serves to evaluate quantitatively the optimal number of clusters for each specific data set. The scheme is based on standard approaches used in regular statistical tests. The basic idea is to organize the search of the optimal number of clusters simultaneously with the optimization of the distribution of genes over clusters. The efficiency of the proposed algorithm has been evaluated by means of a reverse engineering experiment, that is, a situation in which the correct distribution of genes over clusters is known a priori. The employment of this statistically rigorous test has shown that our algorithm places greater than 90% genes into correct clusters. Finally, the algorithm has been tested on real gene expression data (expression changes during yeast cell cycle) for which the fundamental patterns of gene expression and the assignment of genes to clusters are well understood from numerous previous studies.
Multi-Threaded Algorithms for GPGPU in the ATLAS High Level Trigger
NASA Astrophysics Data System (ADS)
Conde Muíño, P.; ATLAS Collaboration
2017-10-01
General purpose Graphics Processor Units (GPGPU) are being evaluated for possible future inclusion in an upgraded ATLAS High Level Trigger farm. We have developed a demonstrator including GPGPU implementations of Inner Detector and Muon tracking and Calorimeter clustering within the ATLAS software framework. ATLAS is a general purpose particle physics experiment located on the LHC collider at CERN. The ATLAS Trigger system consists of two levels, with Level-1 implemented in hardware and the High Level Trigger implemented in software running on a farm of commodity CPU. The High Level Trigger reduces the trigger rate from the 100 kHz Level-1 acceptance rate to 1.5 kHz for recording, requiring an average per-event processing time of ∼ 250 ms for this task. The selection in the high level trigger is based on reconstructing tracks in the Inner Detector and Muon Spectrometer and clusters of energy deposited in the Calorimeter. Performing this reconstruction within the available farm resources presents a significant challenge that will increase significantly with future LHC upgrades. During the LHC data taking period starting in 2021, luminosity will reach up to three times the original design value. Luminosity will increase further to 7.5 times the design value in 2026 following LHC and ATLAS upgrades. Corresponding improvements in the speed of the reconstruction code will be needed to provide the required trigger selection power within affordable computing resources. Key factors determining the potential benefit of including GPGPU as part of the HLT processor farm are: the relative speed of the CPU and GPGPU algorithm implementations; the relative execution times of the GPGPU algorithms and serial code remaining on the CPU; the number of GPGPU required, and the relative financial cost of the selected GPGPU. We give a brief overview of the algorithms implemented and present new measurements that compare the performance of various configurations exploiting GPGPU cards.
NASA Astrophysics Data System (ADS)
Franke, R.
2016-11-01
In many networks discovered in biology, medicine, neuroscience and other disciplines special properties like a certain degree distribution and hierarchical cluster structure (also called communities) can be observed as general organizing principles. Detecting the cluster structure of an unknown network promises to identify functional subdivisions, hierarchy and interactions on a mesoscale. It is not trivial choosing an appropriate detection algorithm because there are multiple network, cluster and algorithmic properties to be considered. Edges can be weighted and/or directed, clusters overlap or build a hierarchy in several ways. Algorithms differ not only in runtime, memory requirements but also in allowed network and cluster properties. They are based on a specific definition of what a cluster is, too. On the one hand, a comprehensive network creation model is needed to build a large variety of benchmark networks with different reasonable structures to compare algorithms. On the other hand, if a cluster structure is already known, it is desirable to separate effects of this structure from other network properties. This can be done with null model networks that mimic an observed cluster structure to improve statistics on other network features. A third important application is the general study of properties in networks with different cluster structures, possibly evolving over time. Currently there are good benchmark and creation models available. But what is left is a precise sandbox model to build hierarchical, overlapping and directed clusters for undirected or directed, binary or weighted complex random networks on basis of a sophisticated blueprint. This gap shall be closed by the model CHIMERA (Cluster Hierarchy Interconnection Model for Evaluation, Research and Analysis) which will be introduced and described here for the first time.
Stream Clustering of Growing Objects
NASA Astrophysics Data System (ADS)
Siddiqui, Zaigham Faraz; Spiliopoulou, Myra
We study incremental clustering of objects that grow and accumulate over time. The objects come from a multi-table stream e.g. streams of
Kamali, Tahereh; Stashuk, Daniel
2016-10-01
Robust and accurate segmentation of brain white matter (WM) fiber bundles assists in diagnosing and assessing progression or remission of neuropsychiatric diseases such as schizophrenia, autism and depression. Supervised segmentation methods are infeasible in most applications since generating gold standards is too costly. Hence, there is a growing interest in designing unsupervised methods. However, most conventional unsupervised methods require the number of clusters be known in advance which is not possible in most applications. The purpose of this study is to design an unsupervised segmentation algorithm for brain white matter fiber bundles which can automatically segment fiber bundles using intrinsic diffusion tensor imaging data information without considering any prior information or assumption about data distributions. Here, a new density based clustering algorithm called neighborhood distance entropy consistency (NDEC), is proposed which discovers natural clusters within data by simultaneously utilizing both local and global density information. The performance of NDEC is compared with other state of the art clustering algorithms including chameleon, spectral clustering, DBSCAN and k-means using Johns Hopkins University publicly available diffusion tensor imaging data. The performance of NDEC and other employed clustering algorithms were evaluated using dice ratio as an external evaluation criteria and density based clustering validation (DBCV) index as an internal evaluation metric. Across all employed clustering algorithms, NDEC obtained the highest average dice ratio (0.94) and DBCV value (0.71). NDEC can find clusters with arbitrary shapes and densities and consequently can be used for WM fiber bundle segmentation where there is no distinct boundary between various bundles. NDEC may also be used as an effective tool in other pattern recognition and medical diagnostic systems in which discovering natural clusters within data is a necessity. Copyright © 2016 Elsevier B.V. All rights reserved.
Collective translational and rotational Monte Carlo cluster move for general pairwise interaction
NASA Astrophysics Data System (ADS)
Růžička, Štěpán; Allen, Michael P.
2014-09-01
Virtual move Monte Carlo is a cluster algorithm which was originally developed for strongly attractive colloidal, molecular, or atomistic systems in order to both approximate the collective dynamics and avoid sampling of unphysical kinetic traps. In this paper, we present the algorithm in the form, which selects the moving cluster through a wider class of virtual states and which is applicable to general pairwise interactions, including hard-core repulsion. The newly proposed way of selecting the cluster increases the acceptance probability by up to several orders of magnitude, especially for rotational moves. The results have their applications in simulations of systems interacting via anisotropic potentials both to enhance the sampling of the phase space and to approximate the dynamics.
BioImageXD: an open, general-purpose and high-throughput image-processing platform.
Kankaanpää, Pasi; Paavolainen, Lassi; Tiitta, Silja; Karjalainen, Mikko; Päivärinne, Joacim; Nieminen, Jonna; Marjomäki, Varpu; Heino, Jyrki; White, Daniel J
2012-06-28
BioImageXD puts open-source computer science tools for three-dimensional visualization and analysis into the hands of all researchers, through a user-friendly graphical interface tuned to the needs of biologists. BioImageXD has no restrictive licenses or undisclosed algorithms and enables publication of precise, reproducible and modifiable workflows. It allows simple construction of processing pipelines and should enable biologists to perform challenging analyses of complex processes. We demonstrate its performance in a study of integrin clustering in response to selected inhibitors.
Electronic and magnetic properties of small rhodium clusters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Soon, Yee Yeen; Yoon, Tiem Leong; Lim, Thong Leng
2015-04-24
We report a theoretical study of the electronic and magnetic properties of rhodium-atomic clusters. The lowest energy structures at the semi-empirical level of rhodium clusters are first obtained from a novel global-minimum search algorithm, known as PTMBHGA, where Gupta potential is used to describe the atomic interaction among the rhodium atoms. The structures are then re-optimized at the density functional theory (DFT) level with exchange-correlation energy approximated by Perdew-Burke-Ernzerhof generalized gradient approximation. For the purpose of calculating the magnetic moment of a given cluster, we calculate the optimized structure as a function of the spin multiplicity within the DFT framework.more » The resultant magnetic moments with the lowest energies so obtained allow us to work out the magnetic moment as a function of cluster size. Rhodium atomic clusters are found to display a unique variation in the magnetic moment as the cluster size varies. However, Rh{sub 4} and Rh{sub 6} are found to be nonmagnetic. Electronic structures of the magnetic ground-state structures are also investigated within the DFT framework. The results are compared against those based on different theoretical approaches available in the literature.« less
Fuzzy Document Clustering Approach using WordNet Lexical Categories
NASA Astrophysics Data System (ADS)
Gharib, Tarek F.; Fouad, Mohammed M.; Aref, Mostafa M.
Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. This area is growing rapidly mainly because of the strong need for analysing the huge and large amount of textual data that reside on internal file systems and the Web. Text document clustering provides an effective navigation mechanism to organize this large amount of data by grouping their documents into a small number of meaningful classes. In this paper we proposed a fuzzy text document clustering approach using WordNet lexical categories and Fuzzy c-Means algorithm. Some experiments are performed to compare efficiency of the proposed approach with the recently reported approaches. Experimental results show that Fuzzy clustering leads to great performance results. Fuzzy c-means algorithm overcomes other classical clustering algorithms like k-means and bisecting k-means in both clustering quality and running time efficiency.
NASA Astrophysics Data System (ADS)
Cui, Jia; Hong, Bei; Jiang, Xuepeng; Chen, Qinghua
2017-05-01
With the purpose of reinforcing correlation analysis of risk assessment threat factors, a dynamic assessment method of safety risks based on particle filtering is proposed, which takes threat analysis as the core. Based on the risk assessment standards, the method selects threat indicates, applies a particle filtering algorithm to calculate influencing weight of threat indications, and confirms information system risk levels by combining with state estimation theory. In order to improve the calculating efficiency of the particle filtering algorithm, the k-means cluster algorithm is introduced to the particle filtering algorithm. By clustering all particles, the author regards centroid as the representative to operate, so as to reduce calculated amount. The empirical experience indicates that the method can embody the relation of mutual dependence and influence in risk elements reasonably. Under the circumstance of limited information, it provides the scientific basis on fabricating a risk management control strategy.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Campbell, W; Miften, M; Jones, B
Purpose: Pancreatic SBRT relies on extremely accurate delivery of ablative radiation doses to the target, and intra-fractional tracking of fiducial markers can facilitate improvements in dose delivery. However, this requires algorithms that are able to find fiducial markers with high speed and accuracy. The purpose of this study was to develop a novel marker tracking algorithm that is robust against many of the common errors seen with traditional template matching techniques. Methods: Using CBCT projection images, a method was developed to create detailed template images of fiducial marker clusters without prior knowledge of the number of markers, their positions, ormore » their orientations. Briefly, the method (i) enhances markers in projection images, (ii) stabilizes the cluster’s position, (iii) reconstructs the cluster in 3D, and (iv) precomputes a set of static template images dependent on gantry angle. Furthermore, breathing data were used to produce 4D reconstructions of clusters, yielding dynamic template images dependent on gantry angle and breathing amplitude. To test these two approaches, static and dynamic templates were used to track the motion of marker clusters in more than 66,000 projection images from 75 CBCT scans of 15 pancreatic SBRT patients. Results: For both static and dynamic templates, the new technique was able to locate marker clusters present in projection images 100% of the time. The algorithm was also able to correctly locate markers in several instances where only some of the markers were visible due to insufficient field-of-view. In cases where clusters exhibited deformation and/or rotation during breathing, dynamic templates resulted in cross-correlation scores up to 70% higher than static templates. Conclusion: Patient-specific templates provided complete tracking of fiducial marker clusters in CBCT scans, and dynamic templates helped to provide higher cross-correlation scores for deforming/rotating clusters. This novel algorithm provides an extremely accurate method to detect fiducial markers during treatment. Research funding provided by Varian Medical Systems to Miften and Jones.« less
NASA Astrophysics Data System (ADS)
Ma, Xiaoke; Wang, Bingbo; Yu, Liang
2018-01-01
Community detection is fundamental for revealing the structure-functionality relationship in complex networks, which involves two issues-the quantitative function for community as well as algorithms to discover communities. Despite significant research on either of them, few attempt has been made to establish the connection between the two issues. To attack this problem, a generalized quantification function is proposed for community in weighted networks, which provides a framework that unifies several well-known measures. Then, we prove that the trace optimization of the proposed measure is equivalent with the objective functions of algorithms such as nonnegative matrix factorization, kernel K-means as well as spectral clustering. It serves as the theoretical foundation for designing algorithms for community detection. On the second issue, a semi-supervised spectral clustering algorithm is developed by exploring the equivalence relation via combining the nonnegative matrix factorization and spectral clustering. Different from the traditional semi-supervised algorithms, the partial supervision is integrated into the objective of the spectral algorithm. Finally, through extensive experiments on both artificial and real world networks, we demonstrate that the proposed method improves the accuracy of the traditional spectral algorithms in community detection.
Handwritten text line segmentation by spectral clustering
NASA Astrophysics Data System (ADS)
Han, Xuecheng; Yao, Hui; Zhong, Guoqiang
2017-02-01
Since handwritten text lines are generally skewed and not obviously separated, text line segmentation of handwritten document images is still a challenging problem. In this paper, we propose a novel text line segmentation algorithm based on the spectral clustering. Given a handwritten document image, we convert it to a binary image first, and then compute the adjacent matrix of the pixel points. We apply spectral clustering on this similarity metric and use the orthogonal kmeans clustering algorithm to group the text lines. Experiments on Chinese handwritten documents database (HIT-MW) demonstrate the effectiveness of the proposed method.
Effective traffic features selection algorithm for cyber-attacks samples
NASA Astrophysics Data System (ADS)
Li, Yihong; Liu, Fangzheng; Du, Zhenyu
2018-05-01
By studying the defense scheme of Network attacks, this paper propose an effective traffic features selection algorithm based on k-means++ clustering to deal with the problem of high dimensionality of traffic features which extracted from cyber-attacks samples. Firstly, this algorithm divide the original feature set into attack traffic feature set and background traffic feature set by the clustering. Then, we calculates the variation of clustering performance after removing a certain feature. Finally, evaluating the degree of distinctiveness of the feature vector according to the result. Among them, the effective feature vector is whose degree of distinctiveness exceeds the set threshold. The purpose of this paper is to select out the effective features from the extracted original feature set. In this way, it can reduce the dimensionality of the features so as to reduce the space-time overhead of subsequent detection. The experimental results show that the proposed algorithm is feasible and it has some advantages over other selection algorithms.
NASA Technical Reports Server (NTRS)
Leutenegger, Scott T.; Horton, Graham
1994-01-01
Recently the Multi-Level algorithm was introduced as a general purpose solver for the solution of steady state Markov chains. In this paper, we consider the performance of the Multi-Level algorithm for solving Nearly Completely Decomposable (NCD) Markov chains, for which special-purpose iteractive aggregation/disaggregation algorithms such as the Koury-McAllister-Stewart (KMS) method have been developed that can exploit the decomposability of the the Markov chain. We present experimental results indicating that the general-purpose Multi-Level algorithm is competitive, and can be significantly faster than the special-purpose KMS algorithm when Gauss-Seidel and Gaussian Elimination are used for solving the individual blocks.
Graphics Processing Unit Assisted Thermographic Compositing
NASA Technical Reports Server (NTRS)
Ragasa, Scott; McDougal, Matthew; Russell, Sam
2013-01-01
Objective: To develop a software application utilizing general purpose graphics processing units (GPUs) for the analysis of large sets of thermographic data. Background: Over the past few years, an increasing effort among scientists and engineers to utilize the GPU in a more general purpose fashion is allowing for supercomputer level results at individual workstations. As data sets grow, the methods to work them grow at an equal, and often greater, pace. Certain common computations can take advantage of the massively parallel and optimized hardware constructs of the GPU to allow for throughput that was previously reserved for compute clusters. These common computations have high degrees of data parallelism, that is, they are the same computation applied to a large set of data where the result does not depend on other data elements. Signal (image) processing is one area were GPUs are being used to greatly increase the performance of certain algorithms and analysis techniques.
An agglomerative hierarchical clustering approach to visualisation in Bayesian clustering problems
Dawson, Kevin J.; Belkhir, Khalid
2009-01-01
Clustering problems (including the clustering of individuals into outcrossing populations, hybrid generations, full-sib families and selfing lines) have recently received much attention in population genetics. In these clustering problems, the parameter of interest is a partition of the set of sampled individuals, - the sample partition. In a fully Bayesian approach to clustering problems of this type, our knowledge about the sample partition is represented by a probability distribution on the space of possible sample partitions. Since the number of possible partitions grows very rapidly with the sample size, we can not visualise this probability distribution in its entirety, unless the sample is very small. As a solution to this visualisation problem, we recommend using an agglomerative hierarchical clustering algorithm, which we call the exact linkage algorithm. This algorithm is a special case of the maximin clustering algorithm that we introduced previously. The exact linkage algorithm is now implemented in our software package Partition View. The exact linkage algorithm takes the posterior co-assignment probabilities as input, and yields as output a rooted binary tree, - or more generally, a forest of such trees. Each node of this forest defines a set of individuals, and the node height is the posterior co-assignment probability of this set. This provides a useful visual representation of the uncertainty associated with the assignment of individuals to categories. It is also a useful starting point for a more detailed exploration of the posterior distribution in terms of the co-assignment probabilities. PMID:19337306
NASA Astrophysics Data System (ADS)
Dayananda, Karanam Ravichandran; Straub, Jeremy
2017-05-01
This paper proposes a new hybrid algorithm for security, which incorporates both distributed and hierarchal approaches. It uses a mobile data collector (MDC) to collect information in order to save energy of sensor nodes in a wireless sensor network (WSN) as, in most networks, these sensor nodes have limited energy. Wireless sensor networks are prone to security problems because, among other things, it is possible to use a rogue sensor node to eavesdrop on or alter the information being transmitted. To prevent this, this paper introduces a security algorithm for MDC-based WSNs. A key use of this algorithm is to protect the confidentiality of the information sent by the sensor nodes. The sensor nodes are deployed in a random fashion and form group structures called clusters. Each cluster has a cluster head. The cluster head collects data from the other nodes using the time-division multiple access protocol. The sensor nodes send their data to the cluster head for transmission to the base station node for further processing. The MDC acts as an intermediate node between the cluster head and base station. The MDC, using its dynamic acyclic graph path, collects the data from the cluster head and sends it to base station. This approach is useful for applications including warfighting, intelligent building and medicine. To assess the proposed system, the paper presents a comparison of its performance with other approaches and algorithms that can be used for similar purposes.
A ground truth based comparative study on clustering of gene expression data.
Zhu, Yitan; Wang, Zuyi; Miller, David J; Clarke, Robert; Xuan, Jianhua; Hoffman, Eric P; Wang, Yue
2008-05-01
Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG toolkit (VIsual Statistical Data Analyzer--VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.
Parallel Density-Based Clustering for Discovery of Ionospheric Phenomena
NASA Astrophysics Data System (ADS)
Pankratius, V.; Gowanlock, M.; Blair, D. M.
2015-12-01
Ionospheric total electron content maps derived from global networks of dual-frequency GPS receivers can reveal a plethora of ionospheric features in real-time and are key to space weather studies and natural hazard monitoring. However, growing data volumes from expanding sensor networks are making manual exploratory studies challenging. As the community is heading towards Big Data ionospheric science, automation and Computer-Aided Discovery become indispensable tools for scientists. One problem of machine learning methods is that they require domain-specific adaptations in order to be effective and useful for scientists. Addressing this problem, our Computer-Aided Discovery approach allows scientists to express various physical models as well as perturbation ranges for parameters. The search space is explored through an automated system and parallel processing of batched workloads, which finds corresponding matches and similarities in empirical data. We discuss density-based clustering as a particular method we employ in this process. Specifically, we adapt Density-Based Spatial Clustering of Applications with Noise (DBSCAN). This algorithm groups geospatial data points based on density. Clusters of points can be of arbitrary shape, and the number of clusters is not predetermined by the algorithm; only two input parameters need to be specified: (1) a distance threshold, (2) a minimum number of points within that threshold. We discuss an implementation of DBSCAN for batched workloads that is amenable to parallelization on manycore architectures such as Intel's Xeon Phi accelerator with 60+ general-purpose cores. This manycore parallelization can cluster large volumes of ionospheric total electronic content data quickly. Potential applications for cluster detection include the visualization, tracing, and examination of traveling ionospheric disturbances or other propagating phenomena. Acknowledgments. We acknowledge support from NSF ACI-1442997 (PI V. Pankratius).
Fast detection of vascular plaque in optical coherence tomography images using a reduced feature set
NASA Astrophysics Data System (ADS)
Prakash, Ammu; Ocana Macias, Mariano; Hewko, Mark; Sowa, Michael; Sherif, Sherif
2018-03-01
Optical coherence tomography (OCT) images are capable of detecting vascular plaque by using the full set of 26 Haralick textural features and a standard K-means clustering algorithm. However, the use of the full set of 26 textural features is computationally expensive and may not be feasible for real time implementation. In this work, we identified a reduced set of 3 textural feature which characterizes vascular plaque and used a generalized Fuzzy C-means clustering algorithm. Our work involves three steps: 1) the reduction of a full set 26 textural feature to a reduced set of 3 textural features by using genetic algorithm (GA) optimization method 2) the implementation of an unsupervised generalized clustering algorithm (Fuzzy C-means) on the reduced feature space, and 3) the validation of our results using histology and actual photographic images of vascular plaque. Our results show an excellent match with histology and actual photographic images of vascular tissue. Therefore, our results could provide an efficient pre-clinical tool for the detection of vascular plaque in real time OCT imaging.
Optimizing Cluster Heads for Energy Efficiency in Large-Scale Heterogeneous Wireless Sensor Networks
Gu, Yi; Wu, Qishi; Rao, Nageswara S. V.
2010-01-01
Many complex sensor network applications require deploying a large number of inexpensive and small sensors in a vast geographical region to achieve quality through quantity. Hierarchical clustering is generally considered as an efficient and scalable way to facilitate the management and operation of such large-scale networks and minimize the total energy consumption for prolonged lifetime. Judicious selection of cluster heads for data integration and communication is critical to the success of applications based on hierarchical sensor networks organized as layered clusters. We investigate the problem of selecting sensor nodes in a predeployed sensor network to be the cluster heads tomore » minimize the total energy needed for data gathering. We rigorously derive an analytical formula to optimize the number of cluster heads in sensor networks under uniform node distribution, and propose a Distance-based Crowdedness Clustering algorithm to determine the cluster heads in sensor networks under general node distribution. The results from an extensive set of experiments on a large number of simulated sensor networks illustrate the performance superiority of the proposed solution over the clustering schemes based on k -means algorithm.« less
GRAPE-6A: A Single-Card GRAPE-6 for Parallel PC-GRAPE Cluster Systems
NASA Astrophysics Data System (ADS)
Fukushige, Toshiyuki; Makino, Junichiro; Kawai, Atsushi
2005-12-01
In this paper, we describe the design and performance of GRAPE-6A, a special-purpose computer for gravitational many-body simulations. It was designed to be used with a PC cluster, in which each node has one GRAPE-6A. Such a configuration is particularly cost-effective in running parallel tree algorithms. Though the use of parallel tree algorithms was possible with the original GRAPE-6 hardware, it was not very cost-effective since a single GRAPE-6 board was still too fast and too expensive. Therefore, we designed GRAPE-6A as a single PCI card to minimize the reproduction cost and to optimize the computing speed. The peak performance is 130 Gflops for one GRAPE-6A board and 3.1 Tflops for our 24 node cluster. We describe the implementation of the tree, TreePM and individual timestep algorithms on both a single GRAPE-6A system and GRAPE-6A cluster. Using the tree algorithm on our 16-node GRAPE-6A system, we can complete a collisionless simulation with 100 million particles (8000 steps) within 10 days.
Machine-learned cluster identification in high-dimensional data.
Ultsch, Alfred; Lötsch, Jörn
2017-02-01
High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. Ward clustering imposed cluster structures on cluster-less "golf ball", "cuboid" and "S-shaped" data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Hsu, Arthur L; Tang, Sen-Lin; Halgamuge, Saman K
2003-11-01
Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). JAVA software of dynamic SOM tree algorithm is available upon request for academic use. A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf
Fokkema, M; Smits, N; Zeileis, A; Hothorn, T; Kelderman, H
2017-10-25
Identification of subgroups of patients for whom treatment A is more effective than treatment B, and vice versa, is of key importance to the development of personalized medicine. Tree-based algorithms are helpful tools for the detection of such interactions, but none of the available algorithms allow for taking into account clustered or nested dataset structures, which are particularly common in psychological research. Therefore, we propose the generalized linear mixed-effects model tree (GLMM tree) algorithm, which allows for the detection of treatment-subgroup interactions, while accounting for the clustered structure of a dataset. The algorithm uses model-based recursive partitioning to detect treatment-subgroup interactions, and a GLMM to estimate the random-effects parameters. In a simulation study, GLMM trees show higher accuracy in recovering treatment-subgroup interactions, higher predictive accuracy, and lower type II error rates than linear-model-based recursive partitioning and mixed-effects regression trees. Also, GLMM trees show somewhat higher predictive accuracy than linear mixed-effects models with pre-specified interaction effects, on average. We illustrate the application of GLMM trees on an individual patient-level data meta-analysis on treatments for depression. We conclude that GLMM trees are a promising exploratory tool for the detection of treatment-subgroup interactions in clustered datasets.
Applications of modern statistical methods to analysis of data in physical science
NASA Astrophysics Data System (ADS)
Wicker, James Eric
Modern methods of statistical and computational analysis offer solutions to dilemmas confronting researchers in physical science. Although the ideas behind modern statistical and computational analysis methods were originally introduced in the 1970's, most scientists still rely on methods written during the early era of computing. These researchers, who analyze increasingly voluminous and multivariate data sets, need modern analysis methods to extract the best results from their studies. The first section of this work showcases applications of modern linear regression. Since the 1960's, many researchers in spectroscopy have used classical stepwise regression techniques to derive molecular constants. However, problems with thresholds of entry and exit for model variables plagues this analysis method. Other criticisms of this kind of stepwise procedure include its inefficient searching method, the order in which variables enter or leave the model and problems with overfitting data. We implement an information scoring technique that overcomes the assumptions inherent in the stepwise regression process to calculate molecular model parameters. We believe that this kind of information based model evaluation can be applied to more general analysis situations in physical science. The second section proposes new methods of multivariate cluster analysis. The K-means algorithm and the EM algorithm, introduced in the 1960's and 1970's respectively, formed the basis of multivariate cluster analysis methodology for many years. However, several shortcomings of these methods include strong dependence on initial seed values and inaccurate results when the data seriously depart from hypersphericity. We propose new cluster analysis methods based on genetic algorithms that overcomes the strong dependence on initial seed values. In addition, we propose a generalization of the Genetic K-means algorithm which can accurately identify clusters with complex hyperellipsoidal covariance structures. We then use this new algorithm in a genetic algorithm based Expectation-Maximization process that can accurately calculate parameters describing complex clusters in a mixture model routine. Using the accuracy of this GEM algorithm, we assign information scores to cluster calculations in order to best identify the number of mixture components in a multivariate data set. We will showcase how these algorithms can be used to process multivariate data from astronomical observations.
Zhang, Zhaoyang; Fang, Hua; Wang, Honggang
2016-06-01
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services.
Zhang, Zhaoyang; Wang, Honggang
2016-01-01
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering is more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services. PMID:27126063
Di Pietro, C; Di Pietro, V; Emmanuele, G; Ferro, A; Maugeri, T; Modica, E; Pigola, G; Pulvirenti, A; Purrello, M; Ragusa, M; Scalia, M; Shasha, D; Travali, S; Zimmitti, V
2003-01-01
In this paper we present a new Multiple Sequence Alignment (MSA) algorithm called AntiClusAl. The method makes use of the commonly use idea of aligning homologous sequences belonging to classes generated by some clustering algorithm, and then continue the alignment process ina bottom-up way along a suitable tree structure. The final result is then read at the root of the tree. Multiple sequence alignment in each cluster makes use of the progressive alignment with the 1-median (center) of the cluster. The 1-median of set S of sequences is the element of S which minimizes the average distance from any other sequence in S. Its exact computation requires quadratic time. The basic idea of our proposed algorithm is to make use of a simple and natural algorithmic technique based on randomized tournaments which has been successfully applied to large size search problems in general metric spaces. In particular a clustering algorithm called Antipole tree and an approximate linear 1-median computation are used. Our algorithm compared with Clustal W, a widely used tool to MSA, shows a better running time results with fully comparable alignment quality. A successful biological application showing high aminoacid conservation during evolution of Xenopus laevis SOD2 is also cited.
Personalized Medicine in Veterans with Traumatic Brain Injuries
2013-05-01
Pair-Group Method using Arithmetic averages ( UPGMA ) based on cosine correlation of row mean centered log2 signal values; this was the top 50%-tile...cluster- ing was performed by the UPGMA method using Cosine correlation as the similarity metric. For comparative purposes, clustered heat maps included...non-mTBI cases were subjected to unsupervised hierarchical clustering analysis using the UPGMA algorithm with cosine correlation as the similarity
`Inter-Arrival Time' Inspired Algorithm and its Application in Clustering and Molecular Phylogeny
NASA Astrophysics Data System (ADS)
Kolekar, Pandurang S.; Kale, Mohan M.; Kulkarni-Kale, Urmila
2010-10-01
Bioinformatics, being multidisciplinary field, involves applications of various methods from allied areas of Science for data mining using computational approaches. Clustering and molecular phylogeny is one of the key areas in Bioinformatics, which help in study of classification and evolution of organisms. Molecular phylogeny algorithms can be divided into distance based and character based methods. But most of these methods are dependent on pre-alignment of sequences and become computationally intensive with increase in size of data and hence demand alternative efficient approaches. `Inter arrival time distribution' (IATD) is a popular concept in the theory of stochastic system modeling but its potential in molecular data analysis has not been fully explored. The present study reports application of IATD in Bioinformatics for clustering and molecular phylogeny. The proposed method provides IATDs of nucleotides in genomic sequences. The distance function based on statistical parameters of IATDs is proposed and distance matrix thus obtained is used for the purpose of clustering and molecular phylogeny. The method is applied on a dataset of 3' non-coding region sequences (NCR) of Dengue virus type 3 (DENV-3), subtype III, reported in 2008. The phylogram thus obtained revealed the geographical distribution of DENV-3 isolates. Sri Lankan DENV-3 isolates were further observed to be clustered in two sub-clades corresponding to pre and post Dengue hemorrhagic fever emergence groups. These results are consistent with those reported earlier, which are obtained using pre-aligned sequence data as an input. These findings encourage applications of the IATD based method in molecular phylogenetic analysis in particular and data mining in general.
Unsupervised tattoo segmentation combining bottom-up and top-down cues
NASA Astrophysics Data System (ADS)
Allen, Josef D.; Zhao, Nan; Yuan, Jiangbo; Liu, Xiuwen
2011-06-01
Tattoo segmentation is challenging due to the complexity and large variance in tattoo structures. We have developed a segmentation algorithm for finding tattoos in an image. Our basic idea is split-merge: split each tattoo image into clusters through a bottom-up process, learn to merge the clusters containing skin and then distinguish tattoo from the other skin via top-down prior in the image itself. Tattoo segmentation with unknown number of clusters is transferred to a figureground segmentation. We have applied our segmentation algorithm on a tattoo dataset and the results have shown that our tattoo segmentation system is efficient and suitable for further tattoo classification and retrieval purpose.
Clustering with Missing Values: No Imputation Required
NASA Technical Reports Server (NTRS)
Wagstaff, Kiri
2004-01-01
Clustering algorithms can identify groups in large data sets, such as star catalogs and hyperspectral images. In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization). Imputed values are treated as just as reliable as the truly observed data, but they are only as good as the assumptions used to create them. In contrast, we present a method for encoding partially observed features as a set of supplemental soft constraints and introduce the KSC algorithm, which incorporates constraints into the clustering process. In experiments on artificial data and data from the Sloan Digital Sky Survey, we show that soft constraints are an effective way to enable clustering with missing values.
Banerjee, Arindam; Ghosh, Joydeep
2004-05-01
Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, spkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques. Index Terms-Balanced clustering, expectation maximization (EM), frequency-sensitive competitive learning (FSCL), high-dimensional clustering, kmeans, normalized data, scalable clustering, streaming data, text clustering.
Improvements on a privacy-protection algorithm for DNA sequences with generalization lattices.
Li, Guang; Wang, Yadong; Su, Xiaohong
2012-10-01
When developing personal DNA databases, there must be an appropriate guarantee of anonymity, which means that the data cannot be related back to individuals. DNA lattice anonymization (DNALA) is a successful method for making personal DNA sequences anonymous. However, it uses time-consuming multiple sequence alignment and a low-accuracy greedy clustering algorithm. Furthermore, DNALA is not an online algorithm, and so it cannot quickly return results when the database is updated. This study improves the DNALA method. Specifically, we replaced the multiple sequence alignment in DNALA with global pairwise sequence alignment to save time, and we designed a hybrid clustering algorithm comprised of a maximum weight matching (MWM)-based algorithm and an online algorithm. The MWM-based algorithm is more accurate than the greedy algorithm in DNALA and has the same time complexity. The online algorithm can process data quickly when the database is updated. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
A Fast Implementation of the ISOCLUS Algorithm
NASA Technical Reports Server (NTRS)
Memarsadeghi, Nargess; Mount, David M.; Netanyahu, Nathan S.; LeMoigne, Jacqueline
2003-01-01
Unsupervised clustering is a fundamental tool in numerous image processing and remote sensing applications. For example, unsupervised clustering is often used to obtain vegetation maps of an area of interest. This approach is useful when reliable training data are either scarce or expensive, and when relatively little a priori information about the data is available. Unsupervised clustering methods play a significant role in the pursuit of unsupervised classification. One of the most popular and widely used clustering schemes for remote sensing applications is the ISOCLUS algorithm, which is based on the ISODATA method. The algorithm is given a set of n data points (or samples) in d-dimensional space, an integer k indicating the initial number of clusters, and a number of additional parameters. The general goal is to compute a set of cluster centers in d-space. Although there is no specific optimization criterion, the algorithm is similar in spirit to the well known k-means clustering method in which the objective is to minimize the average squared distance of each point to its nearest center, called the average distortion. One significant feature of ISOCLUS over k-means is that clusters may be merged or split, and so the final number of clusters may be different from the number k supplied as part of the input. This algorithm will be described in later in this paper. The ISOCLUS algorithm can run very slowly, particularly on large data sets. Given its wide use in remote sensing, its efficient computation is an important goal. We have developed a fast implementation of the ISOCLUS algorithm. Our improvement is based on a recent acceleration to the k-means algorithm, the filtering algorithm, by Kanungo et al.. They showed that, by storing the data in a kd-tree, it was possible to significantly reduce the running time of k-means. We have adapted this method for the ISOCLUS algorithm. For technical reasons, which are explained later, it is necessary to make a minor modification to the ISOCLUS specification. We provide empirical evidence, on both synthetic and Landsat image data sets, that our algorithm's performance is essentially the same as that of ISOCLUS, but with significantly lower running times. We show that our algorithm runs from 3 to 30 times faster than a straightforward implementation of ISOCLUS. Our adaptation of the filtering algorithm involves the efficient computation of a number of cluster statistics that are needed for ISOCLUS, but not for k-means.
NASA Technical Reports Server (NTRS)
Janich, Karl W.
2005-01-01
The At-Least version of the Generalized Minimum Spanning Tree Problem (L-GMST) is a problem in which the optimal solution connects all defined clusters of nodes in a given network at a minimum cost. The L-GMST is NPHard; therefore, metaheuristic algorithms have been used to find reasonable solutions to the problem as opposed to computationally feasible exact algorithms, which many believe do not exist for such a problem. One such metaheuristic uses a swarm-intelligent Ant Colony System (ACS) algorithm, in which agents converge on a solution through the weighing of local heuristics, such as the shortest available path and the number of agents that recently used a given path. However, in a network using a solution derived from the ACS algorithm, some nodes may move around to different clusters and cause small changes in the network makeup. Rerunning the algorithm from the start would be somewhat inefficient due to the significance of the changes, so a genetic algorithm based on the top few solutions found in the ACS algorithm is proposed to quickly and efficiently adapt the network to these small changes.
[Automatic Sleep Stage Classification Based on an Improved K-means Clustering Algorithm].
Xiao, Shuyuan; Wang, Bei; Zhang, Jian; Zhang, Qunfeng; Zou, Junzhong
2016-10-01
Sleep stage scoring is a hotspot in the field of medicine and neuroscience.Visual inspection of sleep is laborious and the results may be subjective to different clinicians.Automatic sleep stage classification algorithm can be used to reduce the manual workload.However,there are still limitations when it encounters complicated and changeable clinical cases.The purpose of this paper is to develop an automatic sleep staging algorithm based on the characteristics of actual sleep data.In the proposed improved K-means clustering algorithm,points were selected as the initial centers by using a concept of density to avoid the randomness of the original K-means algorithm.Meanwhile,the cluster centers were updated according to the‘Three-Sigma Rule’during the iteration to abate the influence of the outliers.The proposed method was tested and analyzed on the overnight sleep data of the healthy persons and patients with sleep disorders after continuous positive airway pressure(CPAP)treatment.The automatic sleep stage classification results were compared with the visual inspection by qualified clinicians and the averaged accuracy reached 76%.With the analysis of morphological diversity of sleep data,it was proved that the proposed improved K-means algorithm was feasible and valid for clinical practice.
Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal
2008-07-01
UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.
Anatomisation with slicing: a new privacy preservation approach for multiple sensitive attributes.
Susan, V Shyamala; Christopher, T
2016-01-01
An enormous quantity of personal health information is available in recent decades and tampering of any part of this information imposes a great risk to the health care field. Existing anonymization methods are only apt for single sensitive and low dimensional data to keep up with privacy specifically like generalization and bucketization. In this paper, an anonymization technique is proposed that is a combination of the benefits of anatomization, and enhanced slicing approach adhering to the principle of k-anonymity and l-diversity for the purpose of dealing with high dimensional data along with multiple sensitive data. The anatomization approach dissociates the correlation observed between the quasi identifier attributes and sensitive attributes (SA) and yields two separate tables with non-overlapping attributes. In the enhanced slicing algorithm, vertical partitioning does the grouping of the correlated SA in ST together and thereby minimizes the dimensionality by employing the advanced clustering algorithm. In order to get the optimal size of buckets, tuple partitioning is conducted by MFA. The experimental outcomes indicate that the proposed method can preserve privacy of data with numerous SA. The anatomization approach minimizes the loss of information and slicing algorithm helps in the preservation of correlation and utility which in turn results in reducing the data dimensionality and information loss. The advanced clustering algorithms prove its efficiency by minimizing the time and complexity. Furthermore, this work sticks to the principle of k-anonymity, l-diversity and thus avoids privacy threats like membership, identity and attributes disclosure.
Srinivasan, A.; Galbán, C.J.; Johnson, T.D.; Chenevert, T.L.; Ross, B.D.; Mukherji, S.K.
2014-01-01
Purpose The objective of our study was to analyze the differences between apparent diffusion coefficient (ADC) partitions (created using the K-Means algorithm) between benign and malignant neck lesions and evaluate its benefit in distinguishing these entities. Material and methods MRI studies of 10 benign and 10 malignant proven neck pathologies were post-processed on a PC using in-house software developed in MATLAB (The MathWorks, Inc., Natick, MA). Lesions were manually contoured by two neuroradiologists with the ADC values within each lesion clustered into two (low ADC-ADCL, high ADC-ADCH) and three partitions (ADCL, intermediate ADC-ADCI, ADCH) using the K-Means clustering algorithm. An unpaired two-tailed Student’s t-test was performed for all metrics to determine statistical differences in the means between the benign and malignant pathologies. Results Statistically significant difference between the mean ADCL clusters in benign and malignant pathologies was seen in the 3 cluster models of both readers (p=0.03, 0.022 respectively) and the 2 cluster model of reader 2 (p=0.04) with the other metrics (ADCH, ADCI, whole lesion mean ADC) not revealing any significant differences. Receiver operating characteristics curves demonstrated the quantitative difference in mean ADCH and ADCL in both the 2 and 3 cluster models to be predictive of malignancy (2 clusters: p=0.008, area under curve=0.850, 3 clusters: p=0.01, area under curve=0.825). Conclusion The K-Means clustering algorithm that generates partitions of large datasets may provide a better characterization of neck pathologies and may be of additional benefit in distinguishing benign and malignant neck pathologies compared to whole lesion mean ADC alone. PMID:20007723
High-dimensional cluster analysis with the Masked EM Algorithm
Kadir, Shabnam N.; Goodman, Dan F. M.; Harris, Kenneth D.
2014-01-01
Cluster analysis faces two problems in high dimensions: first, the “curse of dimensionality” that can lead to overfitting and poor generalization performance; and second, the sheer time taken for conventional algorithms to process large amounts of high-dimensional data. We describe a solution to these problems, designed for the application of “spike sorting” for next-generation high channel-count neural probes. In this problem, only a small subset of features provide information about the cluster member-ship of any one data vector, but this informative feature subset is not the same for all data points, rendering classical feature selection ineffective. We introduce a “Masked EM” algorithm that allows accurate and time-efficient clustering of up to millions of points in thousands of dimensions. We demonstrate its applicability to synthetic data, and to real-world high-channel-count spike sorting data. PMID:25149694
NASA Astrophysics Data System (ADS)
Schaefer, A. M.; Daniell, J. E.; Wenzel, F.
2014-12-01
Earthquake clustering tends to be an increasingly important part of general earthquake research especially in terms of seismic hazard assessment and earthquake forecasting and prediction approaches. The distinct identification and definition of foreshocks, aftershocks, mainshocks and secondary mainshocks is taken into account using a point based spatio-temporal clustering algorithm originating from the field of classic machine learning. This can be further applied for declustering purposes to separate background seismicity from triggered seismicity. The results are interpreted and processed to assemble 3D-(x,y,t) earthquake clustering maps which are based on smoothed seismicity records in space and time. In addition, multi-dimensional Gaussian functions are used to capture clustering parameters for spatial distribution and dominant orientations. Clusters are further processed using methodologies originating from geostatistics, which have been mostly applied and developed in mining projects during the last decades. A 2.5D variogram analysis is applied to identify spatio-temporal homogeneity in terms of earthquake density and energy output. The results are mitigated using Kriging to provide an accurate mapping solution for clustering features. As a case study, seismic data of New Zealand and the United States is used, covering events since the 1950s, from which an earthquake cluster catalogue is assembled for most of the major events, including a detailed analysis of the Landers and Christchurch sequences.
Spatial pattern recognition of seismic events in South West Colombia
NASA Astrophysics Data System (ADS)
Benítez, Hernán D.; Flórez, Juan F.; Duque, Diana P.; Benavides, Alberto; Lucía Baquero, Olga; Quintero, Jiber
2013-09-01
Recognition of seismogenic zones in geographical regions supports seismic hazard studies. This recognition is usually based on visual, qualitative and subjective analysis of data. Spatial pattern recognition provides a well founded means to obtain relevant information from large amounts of data. The purpose of this work is to identify and classify spatial patterns in instrumental data of the South West Colombian seismic database. In this research, clustering tendency analysis validates whether seismic database possesses a clustering structure. A non-supervised fuzzy clustering algorithm creates groups of seismic events. Given the sensitivity of fuzzy clustering algorithms to centroid initial positions, we proposed a methodology to initialize centroids that generates stable partitions with respect to centroid initialization. As a result of this work, a public software tool provides the user with the routines developed for clustering methodology. The analysis of the seismogenic zones obtained reveals meaningful spatial patterns in South-West Colombia. The clustering analysis provides a quantitative location and dispersion of seismogenic zones that facilitates seismological interpretations of seismic activities in South West Colombia.
A semi-supervised classification algorithm using the TAD-derived background as training data
NASA Astrophysics Data System (ADS)
Fan, Lei; Ambeau, Brittany; Messinger, David W.
2013-05-01
In general, spectral image classification algorithms fall into one of two categories: supervised and unsupervised. In unsupervised approaches, the algorithm automatically identifies clusters in the data without a priori information about those clusters (except perhaps the expected number of them). Supervised approaches require an analyst to identify training data to learn the characteristics of the clusters such that they can then classify all other pixels into one of the pre-defined groups. The classification algorithm presented here is a semi-supervised approach based on the Topological Anomaly Detection (TAD) algorithm. The TAD algorithm defines background components based on a mutual k-Nearest Neighbor graph model of the data, along with a spectral connected components analysis. Here, the largest components produced by TAD are used as regions of interest (ROI's),or training data for a supervised classification scheme. By combining those ROI's with a Gaussian Maximum Likelihood (GML) or a Minimum Distance to the Mean (MDM) algorithm, we are able to achieve a semi supervised classification method. We test this classification algorithm against data collected by the HyMAP sensor over the Cooke City, MT area and University of Pavia scene.
Wolf, Antje; Kirschner, Karl N
2013-02-01
With improvements in computer speed and algorithm efficiency, MD simulations are sampling larger amounts of molecular and biomolecular conformations. Being able to qualitatively and quantitatively sift these conformations into meaningful groups is a difficult and important task, especially when considering the structure-activity paradigm. Here we present a study that combines two popular techniques, principal component (PC) analysis and clustering, for revealing major conformational changes that occur in molecular dynamics (MD) simulations. Specifically, we explored how clustering different PC subspaces effects the resulting clusters versus clustering the complete trajectory data. As a case example, we used the trajectory data from an explicitly solvated simulation of a bacteria's L11·23S ribosomal subdomain, which is a target of thiopeptide antibiotics. Clustering was performed, using K-means and average-linkage algorithms, on data involving the first two to the first five PC subspace dimensions. For the average-linkage algorithm we found that data-point membership, cluster shape, and cluster size depended on the selected PC subspace data. In contrast, K-means provided very consistent results regardless of the selected subspace. Since we present results on a single model system, generalization concerning the clustering of different PC subspaces of other molecular systems is currently premature. However, our hope is that this study illustrates a) the complexities in selecting the appropriate clustering algorithm, b) the complexities in interpreting and validating their results, and c) by combining PC analysis with subsequent clustering valuable dynamic and conformational information can be obtained.
NASA Astrophysics Data System (ADS)
Rasim; Junaeti, E.; Wirantika, R.
2018-01-01
Accurate forecasting for the sale of a product depends on the forecasting method used. The purpose of this research is to build motorcycle sales forecasting application using Fuzzy Time Series method combined with interval determination using automatic clustering algorithm. Forecasting is done using the sales data of motorcycle sales in the last ten years. Then the error rate of forecasting is measured using Means Percentage Error (MPE) and Means Absolute Percentage Error (MAPE). The results of forecasting in the one-year period obtained in this study are included in good accuracy.
PCA based clustering for brain tumor segmentation of T1w MRI images.
Kaya, Irem Ersöz; Pehlivanlı, Ayça Çakmak; Sekizkardeş, Emine Gezmez; Ibrikci, Turgay
2017-03-01
Medical images are huge collections of information that are difficult to store and process consuming extensive computing time. Therefore, the reduction techniques are commonly used as a data pre-processing step to make the image data less complex so that a high-dimensional data can be identified by an appropriate low-dimensional representation. PCA is one of the most popular multivariate methods for data reduction. This paper is focused on T1-weighted MRI images clustering for brain tumor segmentation with dimension reduction by different common Principle Component Analysis (PCA) algorithms. Our primary aim is to present a comparison between different variations of PCA algorithms on MRIs for two cluster methods. Five most common PCA algorithms; namely the conventional PCA, Probabilistic Principal Component Analysis (PPCA), Expectation Maximization Based Principal Component Analysis (EM-PCA), Generalize Hebbian Algorithm (GHA), and Adaptive Principal Component Extraction (APEX) were applied to reduce dimensionality in advance of two clustering algorithms, K-Means and Fuzzy C-Means. In the study, the T1-weighted MRI images of the human brain with brain tumor were used for clustering. In addition to the original size of 512 lines and 512 pixels per line, three more different sizes, 256 × 256, 128 × 128 and 64 × 64, were included in the study to examine their effect on the methods. The obtained results were compared in terms of both the reconstruction errors and the Euclidean distance errors among the clustered images containing the same number of principle components. According to the findings, the PPCA obtained the best results among all others. Furthermore, the EM-PCA and the PPCA assisted K-Means algorithm to accomplish the best clustering performance in the majority as well as achieving significant results with both clustering algorithms for all size of T1w MRI images. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal
2008-01-01
Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request. Contact: lonshy@cs.huji.ac.il PMID:18586742
An acceleration framework for synthetic aperture radar algorithms
NASA Astrophysics Data System (ADS)
Kim, Youngsoo; Gloster, Clay S.; Alexander, Winser E.
2017-04-01
Algorithms for radar signal processing, such as Synthetic Aperture Radar (SAR) are computationally intensive and require considerable execution time on a general purpose processor. Reconfigurable logic can be used to off-load the primary computational kernel onto a custom computing machine in order to reduce execution time by an order of magnitude as compared to kernel execution on a general purpose processor. Specifically, Field Programmable Gate Arrays (FPGAs) can be used to accelerate these kernels using hardware-based custom logic implementations. In this paper, we demonstrate a framework for algorithm acceleration. We used SAR as a case study to illustrate the potential for algorithm acceleration offered by FPGAs. Initially, we profiled the SAR algorithm and implemented a homomorphic filter using a hardware implementation of the natural logarithm. Experimental results show a linear speedup by adding reasonably small processing elements in Field Programmable Gate Array (FPGA) as opposed to using a software implementation running on a typical general purpose processor.
Using clustering and a modified classification algorithm for automatic text summarization
NASA Astrophysics Data System (ADS)
Aries, Abdelkrime; Oufaida, Houda; Nouali, Omar
2013-01-01
In this paper we describe a modified classification method destined for extractive summarization purpose. The classification in this method doesn't need a learning corpus; it uses the input text to do that. First, we cluster the document sentences to exploit the diversity of topics, then we use a learning algorithm (here we used Naive Bayes) on each cluster considering it as a class. After obtaining the classification model, we calculate the score of a sentence in each class, using a scoring model derived from classification algorithm. These scores are used, then, to reorder the sentences and extract the first ones as the output summary. We conducted some experiments using a corpus of scientific papers, and we have compared our results to another summarization system called UNIS.1 Also, we experiment the impact of clustering threshold tuning, on the resulted summary, as well as the impact of adding more features to the classifier. We found that this method is interesting, and gives good performance, and the addition of new features (which is simple using this method) can improve summary's accuracy.
Identifying multiple influential spreaders based on generalized closeness centrality
NASA Astrophysics Data System (ADS)
Liu, Huan-Li; Ma, Chuang; Xiang, Bing-Bing; Tang, Ming; Zhang, Hai-Feng
2018-02-01
To maximize the spreading influence of multiple spreaders in complex networks, one important fact cannot be ignored: the multiple spreaders should be dispersively distributed in networks, which can effectively reduce the redundance of information spreading. For this purpose, we define a generalized closeness centrality (GCC) index by generalizing the closeness centrality index to a set of nodes. The problem converts to how to identify multiple spreaders such that an objective function has the minimal value. By comparing with the K-means clustering algorithm, we find that the optimization problem is very similar to the problem of minimizing the objective function in the K-means method. Therefore, how to find multiple nodes with the highest GCC value can be approximately solved by the K-means method. Two typical transmission dynamics-epidemic spreading process and rumor spreading process are implemented in real networks to verify the good performance of our proposed method.
Clustering methods for the optimization of atomic cluster structure
NASA Astrophysics Data System (ADS)
Bagattini, Francesco; Schoen, Fabio; Tigli, Luca
2018-04-01
In this paper, we propose a revised global optimization method and apply it to large scale cluster conformation problems. In the 1990s, the so-called clustering methods were considered among the most efficient general purpose global optimization techniques; however, their usage has quickly declined in recent years, mainly due to the inherent difficulties of clustering approaches in large dimensional spaces. Inspired from the machine learning literature, we redesigned clustering methods in order to deal with molecular structures in a reduced feature space. Our aim is to show that by suitably choosing a good set of geometrical features coupled with a very efficient descent method, an effective optimization tool is obtained which is capable of finding, with a very high success rate, all known putative optima for medium size clusters without any prior information, both for Lennard-Jones and Morse potentials. The main result is that, beyond being a reliable approach, the proposed method, based on the idea of starting a computationally expensive deep local search only when it seems worth doing so, is capable of saving a huge amount of searches with respect to an analogous algorithm which does not employ a clustering phase. In this paper, we are not claiming the superiority of the proposed method compared to specific, refined, state-of-the-art procedures, but rather indicating a quite straightforward way to save local searches by means of a clustering scheme working in a reduced variable space, which might prove useful when included in many modern methods.
Hsu, David
2015-09-27
Clustering methods are often used to model energy consumption for two reasons. First, clustering is often used to process data and to improve the predictive accuracy of subsequent energy models. Second, stable clusters that are reproducible with respect to non-essential changes can be used to group, target, and interpret observed subjects. However, it is well known that clustering methods are highly sensitive to the choice of algorithms and variables. This can lead to misleading assessments of predictive accuracy and mis-interpretation of clusters in policymaking. This paper therefore introduces two methods to the modeling of energy consumption in buildings: clusterwise regression,more » also known as latent class regression, which integrates clustering and regression simultaneously; and cluster validation methods to measure stability. Using a large dataset of multifamily buildings in New York City, clusterwise regression is compared to common two-stage algorithms that use K-means and model-based clustering with linear regression. Predictive accuracy is evaluated using 20-fold cross validation, and the stability of the perturbed clusters is measured using the Jaccard coefficient. These results show that there seems to be an inherent tradeoff between prediction accuracy and cluster stability. This paper concludes by discussing which clustering methods may be appropriate for different analytical purposes.« less
NASA Astrophysics Data System (ADS)
Ruske, S. T.; Topping, D. O.; Foot, V. E.; Kaye, P. H.; Stanley, W. R.; Morse, A. P.; Crawford, I.; Gallagher, M. W.
2016-12-01
Characterisation of bio-aerosols has important implications within Environment and Public Health sectors. Recent developments in Ultra-Violet Light Induced Fluorescence (UV-LIF) detectors such as the Wideband Integrated bio-aerosol Spectrometer (WIBS) and the newly introduced Multiparameter bio-aerosol Spectrometer (MBS) has allowed for the real time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal Spores and pollen. This new generation of instruments has enabled ever-larger data sets to be compiled with the aim of studying more complex environments, yet the algorithms used for specie classification remain largely invalidated. It is therefore imperative that we validate the performance of different algorithms that can be used for the task of classification, which is the focus of this study. For unsupervised learning we test Hierarchical Agglomerative Clustering with various different linkages. For supervised learning, ten methods were tested; including decision trees, ensemble methods: Random Forests, Gradient Boosting and AdaBoost; two implementations for support vector machines: libsvm and liblinear; Gaussian methods: Gaussian naïve Bayesian, quadratic and linear discriminant analysis and finally the k-nearest neighbours algorithm. The methods were applied to two different data sets measured using a new Multiparameter bio-aerosol Spectrometer. We find that clustering, in general, performs slightly worse than the supervised learning methods correctly classifying, at best, only 72.7 and 91.1 percent for the two data sets. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 88.1 and 97.8 percent of the testing data respectively across the two data sets. We discuss the wider relevance of these results with regards to challenging existing classification in real-world environments.
Performance evaluation of PCA-based spike sorting algorithms.
Adamos, Dimitrios A; Kosmidis, Efstratios K; Theophilidis, George
2008-09-01
Deciphering the electrical activity of individual neurons from multi-unit noisy recordings is critical for understanding complex neural systems. A widely used spike sorting algorithm is being evaluated for single-electrode nerve trunk recordings. The algorithm is based on principal component analysis (PCA) for spike feature extraction. In the neuroscience literature it is generally assumed that the use of the first two or most commonly three principal components is sufficient. We estimate the optimum PCA-based feature space by evaluating the algorithm's performance on simulated series of action potentials. A number of modifications are made to the open source nev2lkit software to enable systematic investigation of the parameter space. We introduce a new metric to define clustering error considering over-clustering more favorable than under-clustering as proposed by experimentalists for our data. Both the program patch and the metric are available online. Correlated and white Gaussian noise processes are superimposed to account for biological and artificial jitter in the recordings. We report that the employment of more than three principal components is in general beneficial for all noise cases considered. Finally, we apply our results to experimental data and verify that the sorting process with four principal components is in agreement with a panel of electrophysiology experts.
Cluster and propensity based approximation of a network
2013-01-01
Background The models in this article generalize current models for both correlation networks and multigraph networks. Correlation networks are widely applied in genomics research. In contrast to general networks, it is straightforward to test the statistical significance of an edge in a correlation network. It is also easy to decompose the underlying correlation matrix and generate informative network statistics such as the module eigenvector. However, correlation networks only capture the connections between numeric variables. An open question is whether one can find suitable decompositions of the similarity measures employed in constructing general networks. Multigraph networks are attractive because they support likelihood based inference. Unfortunately, it is unclear how to adjust current statistical methods to detect the clusters inherent in many data sets. Results Here we present an intuitive and parsimonious parametrization of a general similarity measure such as a network adjacency matrix. The cluster and propensity based approximation (CPBA) of a network not only generalizes correlation network methods but also multigraph methods. In particular, it gives rise to a novel and more realistic multigraph model that accounts for clustering and provides likelihood based tests for assessing the significance of an edge after controlling for clustering. We present a novel Majorization-Minimization (MM) algorithm for estimating the parameters of the CPBA. To illustrate the practical utility of the CPBA of a network, we apply it to gene expression data and to a bi-partite network model for diseases and disease genes from the Online Mendelian Inheritance in Man (OMIM). Conclusions The CPBA of a network is theoretically appealing since a) it generalizes correlation and multigraph network methods, b) it improves likelihood based significance tests for edge counts, c) it directly models higher-order relationships between clusters, and d) it suggests novel clustering algorithms. The CPBA of a network is implemented in Fortran 95 and bundled in the freely available R package PropClust. PMID:23497424
Task-specific image partitioning.
Kim, Sungwoong; Nowozin, Sebastian; Kohli, Pushmeet; Yoo, Chang D
2013-02-01
Image partitioning is an important preprocessing step for many of the state-of-the-art algorithms used for performing high-level computer vision tasks. Typically, partitioning is conducted without regard to the task in hand. We propose a task-specific image partitioning framework to produce a region-based image representation that will lead to a higher task performance than that reached using any task-oblivious partitioning framework and existing supervised partitioning framework, albeit few in number. The proposed method partitions the image by means of correlation clustering, maximizing a linear discriminant function defined over a superpixel graph. The parameters of the discriminant function that define task-specific similarity/dissimilarity among superpixels are estimated based on structured support vector machine (S-SVM) using task-specific training data. The S-SVM learning leads to a better generalization ability while the construction of the superpixel graph used to define the discriminant function allows a rich set of features to be incorporated to improve discriminability and robustness. We evaluate the learned task-aware partitioning algorithms on three benchmark datasets. Results show that task-aware partitioning leads to better labeling performance than the partitioning computed by the state-of-the-art general-purpose and supervised partitioning algorithms. We believe that the task-specific image partitioning paradigm is widely applicable to improving performance in high-level image understanding tasks.
Astrophysical data mining with GPU. A case study: Genetic classification of globular clusters
NASA Astrophysics Data System (ADS)
Cavuoti, S.; Garofalo, M.; Brescia, M.; Paolillo, M.; Pescape', A.; Longo, G.; Ventre, G.
2014-01-01
We present a multi-purpose genetic algorithm, designed and implemented with GPGPU/CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200× in the training phase with respect to the CPU based version.
NASA Astrophysics Data System (ADS)
Kumar, Rakesh; Chandrawat, Rajesh Kumar; Garg, B. P.; Joshi, Varun
2017-07-01
Opening the new firm or branch with desired execution is very relevant to facility location problem. Along the lines to locate the new ambulances and firehouses, the government desires to minimize average response time for emergencies from all residents of cities. So finding the best location is biggest challenge in day to day life. These type of problems were named as facility location problems. A lot of algorithms have been developed to handle these problems. In this paper, we review five algorithms that were applied to facility location problems. The significance of clustering in facility location problems is also presented. First we compare Fuzzy c-means clustering (FCM) algorithm with alternating heuristic (AH) algorithm, then with Particle Swarm Optimization (PSO) algorithms using different type of distance function. The data was clustered with the help of FCM and then we apply median model and min-max problem model on that data. After finding optimized locations using these algorithms we find the distance from optimized location point to the demanded point with different distance techniques and compare the results. At last, we design a general example to validate the feasibility of the five algorithms for facilities location optimization, and authenticate the advantages and drawbacks of them.
Orientation domains: A mobile grid clustering algorithm with spherical corrections
NASA Astrophysics Data System (ADS)
Mencos, Joana; Gratacós, Oscar; Farré, Mercè; Escalante, Joan; Arbués, Pau; Muñoz, Josep Anton
2012-12-01
An algorithm has been designed and tested which was devised as a tool assisting the analysis of geological structures solely from orientation data. More specifically, the algorithm was intended for the analysis of geological structures that can be approached as planar and piecewise features, like many folded strata. Input orientation data is expressed as pairs of angles (azimuth and dip). The algorithm starts by considering the data in Cartesian coordinates. This is followed by a search for an initial clustering solution, which is achieved by comparing the results output from the systematic shift of a regular rigid grid over the data. This initial solution is optimal (achieves minimum square error) once the grid size and the shift increment are fixed. Finally, the algorithm corrects for the variable spread that is generally expected from the data type using a reshaped non-rigid grid. The algorithm is size-oriented, which implies the application of conditions over cluster size through all the process in contrast to density-oriented algorithms, also widely used when dealing with spatial data. Results are derived in few seconds and, when tested over synthetic examples, they were found to be consistent and reliable. This makes the algorithm a valuable alternative to the time-consuming traditional approaches available to geologists.
Local Higher-Order Graph Clustering
Yin, Hao; Benson, Austin R.; Leskovec, Jure; Gleich, David F.
2018-01-01
Local graph clustering methods aim to find a cluster of nodes by exploring a small region of the graph. These methods are attractive because they enable targeted clustering around a given seed node and are faster than traditional global graph clustering methods because their runtime does not depend on the size of the input graph. However, current local graph partitioning methods are not designed to account for the higher-order structures crucial to the network, nor can they effectively handle directed networks. Here we introduce a new class of local graph clustering methods that address these issues by incorporating higher-order network information captured by small subgraphs, also called network motifs. We develop the Motif-based Approximate Personalized PageRank (MAPPR) algorithm that finds clusters containing a seed node with minimal motif conductance, a generalization of the conductance metric for network motifs. We generalize existing theory to prove the fast running time (independent of the size of the graph) and obtain theoretical guarantees on the cluster quality (in terms of motif conductance). We also develop a theory of node neighborhoods for finding sets that have small motif conductance, and apply these results to the case of finding good seed nodes to use as input to the MAPPR algorithm. Experimental validation on community detection tasks in both synthetic and real-world networks, shows that our new framework MAPPR outperforms the current edge-based personalized PageRank methodology. PMID:29770258
Improved Ant Colony Clustering Algorithm and Its Performance Study
Gao, Wei
2016-01-01
Clustering analysis is used in many disciplines and applications; it is an important tool that descriptively identifies homogeneous groups of objects based on attribute values. The ant colony clustering algorithm is a swarm-intelligent method used for clustering problems that is inspired by the behavior of ant colonies that cluster their corpses and sort their larvae. A new abstraction ant colony clustering algorithm using a data combination mechanism is proposed to improve the computational efficiency and accuracy of the ant colony clustering algorithm. The abstraction ant colony clustering algorithm is used to cluster benchmark problems, and its performance is compared with the ant colony clustering algorithm and other methods used in existing literature. Based on similar computational difficulties and complexities, the results show that the abstraction ant colony clustering algorithm produces results that are not only more accurate but also more efficiently determined than the ant colony clustering algorithm and the other methods. Thus, the abstraction ant colony clustering algorithm can be used for efficient multivariate data clustering. PMID:26839533
Identification of piecewise affine systems based on fuzzy PCA-guided robust clustering technique
NASA Astrophysics Data System (ADS)
Khanmirza, Esmaeel; Nazarahari, Milad; Mousavi, Alireza
2016-12-01
Hybrid systems are a class of dynamical systems whose behaviors are based on the interaction between discrete and continuous dynamical behaviors. Since a general method for the analysis of hybrid systems is not available, some researchers have focused on specific types of hybrid systems. Piecewise affine (PWA) systems are one of the subsets of hybrid systems. The identification of PWA systems includes the estimation of the parameters of affine subsystems and the coefficients of the hyperplanes defining the partition of the state-input domain. In this paper, we have proposed a PWA identification approach based on a modified clustering technique. By using a fuzzy PCA-guided robust k-means clustering algorithm along with neighborhood outlier detection, the two main drawbacks of the well-known clustering algorithms, i.e., the poor initialization and the presence of outliers, are eliminated. Furthermore, this modified clustering technique enables us to determine the number of subsystems without any prior knowledge about system. In addition, applying the structure of the state-input domain, that is, considering the time sequence of input-output pairs, provides a more efficient clustering algorithm, which is the other novelty of this work. Finally, the proposed algorithm has been evaluated by parameter identification of an IGV servo actuator. Simulation together with experiment analysis has proved the effectiveness of the proposed method.
Identifying At-Risk Students in General Chemistry via Cluster Analysis of Affective Characteristics
ERIC Educational Resources Information Center
Chan, Julia Y. K.; Bauer, Christopher F.
2014-01-01
The purpose of this study is to identify academically at-risk students in first-semester general chemistry using affective characteristics via cluster analysis. Through the clustering of six preselected affective variables, three distinct affective groups were identified: low (at-risk), medium, and high. Students in the low affective group…
Effect of data truncation in an implementation of pixel clustering on a custom computing machine
NASA Astrophysics Data System (ADS)
Leeser, Miriam E.; Theiler, James P.; Estlick, Michael; Kitaryeva, Natalya V.; Szymanski, John J.
2000-10-01
We investigate the effect of truncating the precision of hyperspectral image data for the purpose of more efficiently segmenting the image using a variant of k-means clustering. We describe the implementation of the algorithm on field-programmable gate array (FPGA) hardware. Truncating the data to only a few bits per pixel in each spectral channel permits a more compact hardware design, enabling greater parallelism, and ultimately a more rapid execution. It also enables the storage of larger images in the onboard memory. In exchange for faster clustering, however, one trades off the quality of the produced segmentation. We find, however, that the clustering algorithm can tolerate considerable data truncation with little degradation in cluster quality. This robustness to truncated data can be extended by computing the cluster centers to a few more bits of precision than the data. Since there are so many more pixels than centers, the more aggressive data truncation leads to significant gains in the number of pixels that can be stored in memory and processed in hardware concurrently.
Clustering of financial time series
NASA Astrophysics Data System (ADS)
D'Urso, Pierpaolo; Cappelli, Carmela; Di Lallo, Dario; Massari, Riccardo
2013-05-01
This paper addresses the topic of classifying financial time series in a fuzzy framework proposing two fuzzy clustering models both based on GARCH models. In general clustering of financial time series, due to their peculiar features, needs the definition of suitable distance measures. At this aim, the first fuzzy clustering model exploits the autoregressive representation of GARCH models and employs, in the framework of a partitioning around medoids algorithm, the classical autoregressive metric. The second fuzzy clustering model, also based on partitioning around medoids algorithm, uses the Caiado distance, a Mahalanobis-like distance, based on estimated GARCH parameters and covariances that takes into account the information about the volatility structure of time series. In order to illustrate the merits of the proposed fuzzy approaches an application to the problem of classifying 29 time series of Euro exchange rates against international currencies is presented and discussed, also comparing the fuzzy models with their crisp version.
Community structure from spectral properties in complex networks
NASA Astrophysics Data System (ADS)
Servedio, V. D. P.; Colaiori, F.; Capocci, A.; Caldarelli, G.
2005-06-01
We analyze the spectral properties of complex networks focusing on their relation to the community structure, and develop an algorithm based on correlations among components of different eigenvectors. The algorithm applies to general weighted networks, and, in a suitably modified version, to the case of directed networks. Our method allows to correctly detect communities in sharply partitioned graphs, however it is useful to the analysis of more complex networks, without a well defined cluster structure, as social and information networks. As an example, we test the algorithm on a large scale data-set from a psychological experiment of free word association, where it proves to be successful both in clustering words, and in uncovering mental association patterns.
A novel complex networks clustering algorithm based on the core influence of nodes.
Tong, Chao; Niu, Jianwei; Dai, Bin; Xie, Zhongyu
2014-01-01
In complex networks, cluster structure, identified by the heterogeneity of nodes, has become a common and important topological property. Network clustering methods are thus significant for the study of complex networks. Currently, many typical clustering algorithms have some weakness like inaccuracy and slow convergence. In this paper, we propose a clustering algorithm by calculating the core influence of nodes. The clustering process is a simulation of the process of cluster formation in sociology. The algorithm detects the nodes with core influence through their betweenness centrality, and builds the cluster's core structure by discriminant functions. Next, the algorithm gets the final cluster structure after clustering the rest of the nodes in the network by optimizing method. Experiments on different datasets show that the clustering accuracy of this algorithm is superior to the classical clustering algorithm (Fast-Newman algorithm). It clusters faster and plays a positive role in revealing the real cluster structure of complex networks precisely.
NASA Astrophysics Data System (ADS)
Cannata, A.; Montalto, P.; Aliotta, M.; Cassisi, C.; Pulvirenti, A.; Privitera, E.; Patanè, D.
2011-04-01
Active volcanoes generate sonic and infrasonic signals, whose investigation provides useful information for both monitoring purposes and the study of the dynamics of explosive phenomena. At Mt. Etna volcano (Italy), a pattern recognition system based on infrasonic waveform features has been developed. First, by a parametric power spectrum method, the features describing and characterizing the infrasound events were extracted: peak frequency and quality factor. Then, together with the peak-to-peak amplitude, these features constituted a 3-D ‘feature space’; by Density-Based Spatial Clustering of Applications with Noise algorithm (DBSCAN) three clusters were recognized inside it. After the clustering process, by using a common location method (semblance method) and additional volcanological information concerning the intensity of the explosive activity, we were able to associate each cluster to a particular source vent and/or a kind of volcanic activity. Finally, for automatic event location, clusters were used to train a model based on Support Vector Machine, calculating optimal hyperplanes able to maximize the margins of separation among the clusters. After the training phase this system automatically allows recognizing the active vent with no location algorithm and by using only a single station.
Change detection in synthetic aperture radar images based on image fusion and fuzzy clustering.
Gong, Maoguo; Zhou, Zhiqiang; Ma, Jingjing
2012-04-01
This paper presents an unsupervised distribution-free change detection approach for synthetic aperture radar (SAR) images based on an image fusion strategy and a novel fuzzy clustering algorithm. The image fusion technique is introduced to generate a difference image by using complementary information from a mean-ratio image and a log-ratio image. In order to restrain the background information and enhance the information of changed regions in the fused difference image, wavelet fusion rules based on an average operator and minimum local area energy are chosen to fuse the wavelet coefficients for a low-frequency band and a high-frequency band, respectively. A reformulated fuzzy local-information C-means clustering algorithm is proposed for classifying changed and unchanged regions in the fused difference image. It incorporates the information about spatial context in a novel fuzzy way for the purpose of enhancing the changed information and of reducing the effect of speckle noise. Experiments on real SAR images show that the image fusion strategy integrates the advantages of the log-ratio operator and the mean-ratio operator and gains a better performance. The change detection results obtained by the improved fuzzy clustering algorithm exhibited lower error than its preexistences.
Sequence spaces [Formula: see text] and [Formula: see text] with application in clustering.
Khan, Mohd Shoaib; Alamri, Badriah As; Mursaleen, M; Lohani, Qm Danish
2017-01-01
Distance measures play a central role in evolving the clustering technique. Due to the rich mathematical background and natural implementation of [Formula: see text] distance measures, researchers were motivated to use them in almost every clustering process. Beside [Formula: see text] distance measures, there exist several distance measures. Sargent introduced a special type of distance measures [Formula: see text] and [Formula: see text] which is closely related to [Formula: see text]. In this paper, we generalized the Sargent sequence spaces through introduction of [Formula: see text] and [Formula: see text] sequence spaces. Moreover, it is shown that both spaces are BK -spaces, and one is a dual of another. Further, we have clustered the two-moon dataset by using an induced [Formula: see text]-distance measure (induced by the Sargent sequence space [Formula: see text]) in the k-means clustering algorithm. The clustering result established the efficacy of replacing the Euclidean distance measure by the [Formula: see text]-distance measure in the k-means algorithm.
Pourhassan, Mojgan; Neumann, Frank
2018-06-22
The generalized travelling salesperson problem is an important NP-hard combinatorial optimization problem for which meta-heuristics, such as local search and evolutionary algorithms, have been used very successfully. Two hierarchical approaches with different neighbourhood structures, namely a Cluster-Based approach and a Node-Based approach, have been proposed by Hu and Raidl (2008) for solving this problem. In this paper, local search algorithms and simple evolutionary algorithms based on these approaches are investigated from a theoretical perspective. For local search algorithms, we point out the complementary abilities of the two approaches by presenting instances where they mutually outperform each other. Afterwards, we introduce an instance which is hard for both approaches when initialized on a particular point of the search space, but where a variable neighbourhood search combining them finds the optimal solution in polynomial time. Then we turn our attention to analysing the behaviour of simple evolutionary algorithms that use these approaches. We show that the Node-Based approach solves the hard instance of the Cluster-Based approach presented in Corus et al. (2016) in polynomial time. Furthermore, we prove an exponential lower bound on the optimization time of the Node-Based approach for a class of Euclidean instances.
A Genetic Algorithm That Exchanges Neighboring Centers for Fuzzy c-Means Clustering
ERIC Educational Resources Information Center
Chahine, Firas Safwan
2012-01-01
Clustering algorithms are widely used in pattern recognition and data mining applications. Due to their computational efficiency, partitional clustering algorithms are better suited for applications with large datasets than hierarchical clustering algorithms. K-means is among the most popular partitional clustering algorithm, but has a major…
Chen, Zhijia; Zhu, Yuanchang; Di, Yanqiang; Feng, Shaochong
2015-01-01
In IaaS (infrastructure as a service) cloud environment, users are provisioned with virtual machines (VMs). To allocate resources for users dynamically and effectively, accurate resource demands predicting is essential. For this purpose, this paper proposes a self-adaptive prediction method using ensemble model and subtractive-fuzzy clustering based fuzzy neural network (ESFCFNN). We analyze the characters of user preferences and demands. Then the architecture of the prediction model is constructed. We adopt some base predictors to compose the ensemble model. Then the structure and learning algorithm of fuzzy neural network is researched. To obtain the number of fuzzy rules and the initial value of the premise and consequent parameters, this paper proposes the fuzzy c-means combined with subtractive clustering algorithm, that is, the subtractive-fuzzy clustering. Finally, we adopt different criteria to evaluate the proposed method. The experiment results show that the method is accurate and effective in predicting the resource demands. PMID:25691896
NASA Astrophysics Data System (ADS)
Teramae, Tatsuya; Kushida, Daisuke; Takemori, Fumiaki; Kitamura, Akira
Authors proposed the estimation method combining k-means algorithm and NN for evaluating massage. However, this estimation method has a problem that discrimination ratio is decreased to new user. There are two causes of this problem. One is that generalization of NN is bad. Another one is that clustering result by k-means algorithm has not high correlation coefficient in a class. Then, this research proposes k-means algorithm according to correlation coefficient and incremental learning for NN. The proposed k-means algorithm is method included evaluation function based on correlation coefficient. Incremental learning is method that NN is learned by new data and initialized weight based on the existing data. The effect of proposed methods are verified by estimation result using EEG data when testee is given massage.
NASA Technical Reports Server (NTRS)
Lennington, R. K.; Rassbach, M. E.
1979-01-01
Discussed in this report is the clustering algorithm CLASSY, including detailed descriptions of its general structure and mathematical background and of the various major subroutines. The report provides a development of the logic and equations used with specific reference to program variables. Some comments on timing and proposed optimization techniques are included.
A traveling salesman approach for predicting protein functions.
Johnson, Olin; Liu, Jing
2006-10-12
Protein-protein interaction information can be used to predict unknown protein functions and to help study biological pathways. Here we present a new approach utilizing the classic Traveling Salesman Problem to study the protein-protein interactions and to predict protein functions in budding yeast Saccharomyces cerevisiae. We apply the global optimization tool from combinatorial optimization algorithms to cluster the yeast proteins based on the global protein interaction information. We then use this clustering information to help us predict protein functions. We use our algorithm together with the direct neighbor algorithm 1 on characterized proteins and compare the prediction accuracy of the two methods. We show our algorithm can produce better predictions than the direct neighbor algorithm, which only considers the immediate neighbors of the query protein. Our method is a promising one to be used as a general tool to predict functions of uncharacterized proteins and a successful sample of using computer science knowledge and algorithms to study biological problems.
A traveling salesman approach for predicting protein functions
Johnson, Olin; Liu, Jing
2006-01-01
Background Protein-protein interaction information can be used to predict unknown protein functions and to help study biological pathways. Results Here we present a new approach utilizing the classic Traveling Salesman Problem to study the protein-protein interactions and to predict protein functions in budding yeast Saccharomyces cerevisiae. We apply the global optimization tool from combinatorial optimization algorithms to cluster the yeast proteins based on the global protein interaction information. We then use this clustering information to help us predict protein functions. We use our algorithm together with the direct neighbor algorithm [1] on characterized proteins and compare the prediction accuracy of the two methods. We show our algorithm can produce better predictions than the direct neighbor algorithm, which only considers the immediate neighbors of the query protein. Conclusion Our method is a promising one to be used as a general tool to predict functions of uncharacterized proteins and a successful sample of using computer science knowledge and algorithms to study biological problems. PMID:17147783
NASA Astrophysics Data System (ADS)
Alfarizy, A. D.; Indahwati; Sartono, B.
2017-03-01
Indonesia is the largest Hollywood movie industry target market in Southeast Asia in 2015. Hollywood movies distributed in Indonesia targeted people in all range of ages including children. Low awareness of guiding children while watching movies make them could watch any rated films even the unsuitable ones for their ages. Even after being translated into Bahasa and passed the censorship phase, words that uncomfortable for children to watch still exist. The purpose of this research is to cluster box office Hollywood movies based on Indonesian subtitle, revenue, IMDb user rating and genres as one of the reference for adults to choose right movies for their children to watch. Text mining is used to extract words from the subtitles and count the frequency for three group of words (bad words, sexual words and terror words), while Partition Around Medoids (PAM) Algorithm with Gower similarity coefficient as proximity matrix is used as clustering method. We clustered 624 movies from 2006 until first half of 2016 from IMDb. Cluster with highest silhouette coefficient value (0.36) is the one with 5 clusters. Animation, Adventure and Comedy movies with high revenue like in cluster 5 is recommended for children to watch, while Comedy movies with high revenue like in cluster 4 should be avoided to watch.
SOTXTSTREAM: Density-based self-organizing clustering of text streams.
Bryant, Avory C; Cios, Krzysztof J
2017-01-01
A streaming data clustering algorithm is presented building upon the density-based self-organizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets.
Detection and clustering of features in aerial images by neuron network-based algorithm
NASA Astrophysics Data System (ADS)
Vozenilek, Vit
2015-12-01
The paper presents the algorithm for detection and clustering of feature in aerial photographs based on artificial neural networks. The presented approach is not focused on the detection of specific topographic features, but on the combination of general features analysis and their use for clustering and backward projection of clusters to aerial image. The basis of the algorithm is a calculation of the total error of the network and a change of weights of the network to minimize the error. A classic bipolar sigmoid was used for the activation function of the neurons and the basic method of backpropagation was used for learning. To verify that a set of features is able to represent the image content from the user's perspective, the web application was compiled (ASP.NET on the Microsoft .NET platform). The main achievements include the knowledge that man-made objects in aerial images can be successfully identified by detection of shapes and anomalies. It was also found that the appropriate combination of comprehensive features that describe the colors and selected shapes of individual areas can be useful for image analysis.
A segmentation/clustering model for the analysis of array CGH data.
Picard, F; Robin, S; Lebarbier, E; Daudin, J-J
2007-09-01
Microarray-CGH (comparative genomic hybridization) experiments are used to detect and map chromosomal imbalances. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose representative sequences share the same relative copy number on average. Segmentation methods constitute a natural framework for the analysis, but they do not provide a biological status for the detected segments. We propose a new model for this segmentation/clustering problem, combining a segmentation model with a mixture model. We present a new hybrid algorithm called dynamic programming-expectation maximization (DP-EM) to estimate the parameters of the model by maximum likelihood. This algorithm combines DP and the EM algorithm. We also propose a model selection heuristic to select the number of clusters and the number of segments. An example of our procedure is presented, based on publicly available data sets. We compare our method to segmentation methods and to hidden Markov models, and we show that the new segmentation/clustering model is a promising alternative that can be applied in the more general context of signal processing.
R, GeethaRamani; Balasubramanian, Lakshmi
2018-07-01
Macula segmentation and fovea localization is one of the primary tasks in retinal analysis as they are responsible for detailed vision. Existing approaches required segmentation of retinal structures viz. optic disc and blood vessels for this purpose. This work avoids knowledge of other retinal structures and attempts data mining techniques to segment macula. Unsupervised clustering algorithm is exploited for this purpose. Selection of initial cluster centres has a great impact on performance of clustering algorithms. A heuristic based clustering in which initial centres are selected based on measures defining statistical distribution of data is incorporated in the proposed methodology. The initial phase of proposed framework includes image cropping, green channel extraction, contrast enhancement and application of mathematical closing. Then, the pre-processed image is subjected to heuristic based clustering yielding a binary map. The binary image is post-processed to eliminate unwanted components. Finally, the component which possessed the minimum intensity is finalized as macula and its centre constitutes the fovea. The proposed approach outperforms existing works by reporting that 100%,of HRF, 100% of DRIVE, 96.92% of DIARETDB0, 97.75% of DIARETDB1, 98.81% of HEI-MED, 90% of STARE and 99.33% of MESSIDOR images satisfy the 1R criterion, a standard adopted for evaluating performance of macula and fovea identification. The proposed system thus helps the ophthalmologists in identifying the macula thereby facilitating to identify if any abnormality is present within the macula region. Copyright © 2018 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Wagstaff, Kiri L.
2012-03-01
On obtaining a new data set, the researcher is immediately faced with the challenge of obtaining a high-level understanding from the observations. What does a typical item look like? What are the dominant trends? How many distinct groups are included in the data set, and how is each one characterized? Which observable values are common, and which rarely occur? Which items stand out as anomalies or outliers from the rest of the data? This challenge is exacerbated by the steady growth in data set size [11] as new instruments push into new frontiers of parameter space, via improvements in temporal, spatial, and spectral resolution, or by the desire to "fuse" observations from different modalities and instruments into a larger-picture understanding of the same underlying phenomenon. Data clustering algorithms provide a variety of solutions for this task. They can generate summaries, locate outliers, compress data, identify dense or sparse regions of feature space, and build data models. It is useful to note up front that "clusters" in this context refer to groups of items within some descriptive feature space, not (necessarily) to "galaxy clusters" which are dense regions in physical space. The goal of this chapter is to survey a variety of data clustering methods, with an eye toward their applicability to astronomical data analysis. In addition to improving the individual researcher’s understanding of a given data set, clustering has led directly to scientific advances, such as the discovery of new subclasses of stars [14] and gamma-ray bursts (GRBs) [38]. All clustering algorithms seek to identify groups within a data set that reflect some observed, quantifiable structure. Clustering is traditionally an unsupervised approach to data analysis, in the sense that it operates without any direct guidance about which items should be assigned to which clusters. There has been a recent trend in the clustering literature toward supporting semisupervised or constrained clustering, in which some partial information about item assignments or other components of the resulting output are already known and must be accommodated by the solution. Some algorithms seek a partition of the data set into distinct clusters, while others build a hierarchy of nested clusters that can capture taxonomic relationships. Some produce a single optimal solution, while others construct a probabilistic model of cluster membership. More formally, clustering algorithms operate on a data set X composed of items represented by one or more features (dimensions). These could include physical location, such as right ascension and declination, as well as other properties such as brightness, color, temporal change, size, texture, and so on. Let D be the number of dimensions used to represent each item, xi ∈ RD. The clustering goal is to produce an organization P of the items in X that optimizes an objective function f : P -> R, which quantifies the quality of solution P. Often f is defined so as to maximize similarity within a cluster and minimize similarity between clusters. To that end, many algorithms make use of a measure d : X x X -> R of the distance between two items. A partitioning algorithm produces a set of clusters P = {c1, . . . , ck} such that the clusters are nonoverlapping (c_i intersected with c_j = empty set, i != j) subsets of the data set (Union_i c_i=X). Hierarchical algorithms produce a series of partitions P = {p1, . . . , pn }. For a complete hierarchy, the number of partitions n’= n, the number of items in the data set; the top partition is a single cluster containing all items, and the bottom partition contains n clusters, each containing a single item. For model-based clustering, each cluster c_j is represented by a model m_j , such as the cluster center or a Gaussian distribution. The wide array of available clustering algorithms may seem bewildering, and covering all of them is beyond the scope of this chapter. Choosing among them for a particular application involves considerations of the kind of data being analyzed, algorithm runtime efficiency, and how much prior knowledge is available about the problem domain, which can dictate the nature of clusters sought. Fundamentally, the clustering method and its representations of clusters carries with it a definition of what a cluster is, and it is important that this be aligned with the analysis goals for the problem at hand. In this chapter, I emphasize this point by identifying for each algorithm the cluster representation as a model, m_j , even for algorithms that are not typically thought of as creating a “model.” This chapter surveys a basic collection of clustering methods useful to any practitioner who is interested in applying clustering to a new data set. The algorithms include k-means (Section 25.2), EM (Section 25.3), agglomerative (Section 25.4), and spectral (Section 25.5) clustering, with side mentions of variants such as kernel k-means and divisive clustering. The chapter also discusses each algorithm’s strengths and limitations and provides pointers to additional in-depth reading for each subject. Section 25.6 discusses methods for incorporating domain knowledge into the clustering process. This chapter concludes with a brief survey of interesting applications of clustering methods to astronomy data (Section 25.7). The chapter begins with k-means because it is both generally accessible and so widely used that understanding it can be considered a necessary prerequisite for further work in the field. EM can be viewed as a more sophisticated version of k-means that uses a generative model for each cluster and probabilistic item assignments. Agglomerative clustering is the most basic form of hierarchical clustering and provides a basis for further exploration of algorithms in that vein. Spectral clustering permits a departure from feature-vector-based clustering and can operate on data sets instead represented as affinity, or similarity matrices—cases in which only pairwise information is known. The list of algorithms covered in this chapter is representative of those most commonly in use, but it is by no means comprehensive. There is an extensive collection of existing books on clustering that provide additional background and depth. Three early books that remain useful today are Anderberg’s Cluster Analysis for Applications [3], Hartigan’s Clustering Algorithms [25], and Gordon’s Classification [22]. The latter covers basics on similarity measures, partitioning and hierarchical algorithms, fuzzy clustering, overlapping clustering, conceptual clustering, validations methods, and visualization or data reduction techniques such as principal components analysis (PCA),multidimensional scaling, and self-organizing maps. More recently, Jain et al. provided a useful and informative survey [27] of a variety of different clustering algorithms, including those mentioned here as well as fuzzy, graph-theoretic, and evolutionary clustering. Everitt’s Cluster Analysis [19] provides a modern overview of algorithms, similarity measures, and evaluation methods.
Clustering-based energy-saving algorithm in ultra-dense network
NASA Astrophysics Data System (ADS)
Huang, Junwei; Zhou, Pengguang; Teng, Deyang; Zhang, Renchi; Xu, Hao
2017-06-01
In Ultra-dense Networks (UDN), dense deployment of low power small base stations will cause serious small cells interference and a large amount of energy consumption. The purpose of this paper is to explore the method of reducing small cells interference and energy saving system in UDN, and we innovatively propose a sleep-waking-active (SWA) scheme. The scheme decreases the user outage causing by failure to detect users’ service requests, shortens the opening time of active base stations directly switching to sleep mode; we further proposes a Vertex Surrounding Clustering(VSC) algorithm, which first colours the small cells with the most strongest interference and next extends to the adjacent small cells. VSC algorithm can use the least colour to stain the small cell, reduce the number of iterations and promote the efficiency of colouring. The simulation results show that SWA scheme can effectively improve the system Energy Efficiency (EE), the VSC algorithm can reduce the small cells interference and optimize the users’ Spectrum Efficiency (SE) and throughput.
A robust multilevel simultaneous eigenvalue solver
NASA Technical Reports Server (NTRS)
Costiner, Sorin; Taasan, Shlomo
1993-01-01
Multilevel (ML) algorithms for eigenvalue problems are often faced with several types of difficulties such as: the mixing of approximated eigenvectors by the solution process, the approximation of incomplete clusters of eigenvectors, the poor representation of solution on coarse levels, and the existence of close or equal eigenvalues. Algorithms that do not treat appropriately these difficulties usually fail, or their performance degrades when facing them. These issues motivated the development of a robust adaptive ML algorithm which treats these difficulties, for the calculation of a few eigenvectors and their corresponding eigenvalues. The main techniques used in the new algorithm include: the adaptive completion and separation of the relevant clusters on different levels, the simultaneous treatment of solutions within each cluster, and the robustness tests which monitor the algorithm's efficiency and convergence. The eigenvectors' separation efficiency is based on a new ML projection technique generalizing the Rayleigh Ritz projection, combined with a technique, the backrotations. These separation techniques, when combined with an FMG formulation, in many cases lead to algorithms of O(qN) complexity, for q eigenvectors of size N on the finest level. Previously developed ML algorithms are less focused on the mentioned difficulties. Moreover, algorithms which employ fine level separation techniques are of O(q(sub 2)N) complexity and usually do not overcome all these difficulties. Computational examples are presented where Schrodinger type eigenvalue problems in 2-D and 3-D, having equal and closely clustered eigenvalues, are solved with the efficiency of the Poisson multigrid solver. A second order approximation is obtained in O(qN) work, where the total computational work is equivalent to only a few fine level relaxations per eigenvector.
An Efficient Optimization Method for Solving Unsupervised Data Classification Problems.
Shabanzadeh, Parvaneh; Yusof, Rubiyah
2015-01-01
Unsupervised data classification (or clustering) analysis is one of the most useful tools and a descriptive task in data mining that seeks to classify homogeneous groups of objects based on similarity and is used in many medical disciplines and various applications. In general, there is no single algorithm that is suitable for all types of data, conditions, and applications. Each algorithm has its own advantages, limitations, and deficiencies. Hence, research for novel and effective approaches for unsupervised data classification is still active. In this paper a heuristic algorithm, Biogeography-Based Optimization (BBO) algorithm, was adapted for data clustering problems by modifying the main operators of BBO algorithm, which is inspired from the natural biogeography distribution of different species. Similar to other population-based algorithms, BBO algorithm starts with an initial population of candidate solutions to an optimization problem and an objective function that is calculated for them. To evaluate the performance of the proposed algorithm assessment was carried on six medical and real life datasets and was compared with eight well known and recent unsupervised data classification algorithms. Numerical results demonstrate that the proposed evolutionary optimization algorithm is efficient for unsupervised data classification.
Simplification of multiple Fourier series - An example of algorithmic approach
NASA Technical Reports Server (NTRS)
Ng, E. W.
1981-01-01
This paper describes one example of multiple Fourier series which originate from a problem of spectral analysis of time series data. The example is exercised here with an algorithmic approach which can be generalized for other series manipulation on a computer. The generalized approach is presently pursued towards applications to a variety of multiple series and towards a general purpose algorithm for computer algebra implementation.
The global Minmax k-means algorithm.
Wang, Xiaoyan; Bai, Yanping
2016-01-01
The global k -means algorithm is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure from suitable initial positions, and employs k -means to minimize the sum of the intra-cluster variances. However the global k -means algorithm sometimes results singleton clusters and the initial positions sometimes are bad, after a bad initialization, poor local optimal can be easily obtained by k -means algorithm. In this paper, we modified the global k -means algorithm to eliminate the singleton clusters at first, and then we apply MinMax k -means clustering error method to global k -means algorithm to overcome the effect of bad initialization, proposed the global Minmax k -means algorithm. The proposed clustering method is tested on some popular data sets and compared to the k -means algorithm, the global k -means algorithm and the MinMax k -means algorithm. The experiment results show our proposed algorithm outperforms other algorithms mentioned in the paper.
NASA Technical Reports Server (NTRS)
Socolovsky, Eduardo A.; Bushnell, Dennis M. (Technical Monitor)
2002-01-01
The cosine or correlation measures of similarity used to cluster high dimensional data are interpreted as projections, and the orthogonal components are used to define a complementary dissimilarity measure to form a similarity-dissimilarity measure pair. Using a geometrical approach, a number of properties of this pair is established. This approach is also extended to general inner-product spaces of any dimension. These properties include the triangle inequality for the defined dissimilarity measure, error estimates for the triangle inequality and bounds on both measures that can be obtained with a few floating-point operations from previously computed values of the measures. The bounds and error estimates for the similarity and dissimilarity measures can be used to reduce the computational complexity of clustering algorithms and enhance their scalability, and the triangle inequality allows the design of clustering algorithms for high dimensional distributed data.
Dipnall, Joanna F; Pasco, Julie A; Berk, Michael; Williams, Lana J; Dodd, Seetal; Jacka, Felice N; Meyer, Denny
2016-01-01
Depression is commonly comorbid with many other somatic diseases and symptoms. Identification of individuals in clusters with comorbid symptoms may reveal new pathophysiological mechanisms and treatment targets. The aim of this research was to combine machine-learning (ML) algorithms with traditional regression techniques by utilising self-reported medical symptoms to identify and describe clusters of individuals with increased rates of depression from a large cross-sectional community based population epidemiological study. A multi-staged methodology utilising ML and traditional statistical techniques was performed using the community based population National Health and Nutrition Examination Study (2009-2010) (N = 3,922). A Self-organised Mapping (SOM) ML algorithm, combined with hierarchical clustering, was performed to create participant clusters based on 68 medical symptoms. Binary logistic regression, controlling for sociodemographic confounders, was used to then identify the key clusters of participants with higher levels of depression (PHQ-9≥10, n = 377). Finally, a Multiple Additive Regression Tree boosted ML algorithm was run to identify the important medical symptoms for each key cluster within 17 broad categories: heart, liver, thyroid, respiratory, diabetes, arthritis, fractures and osteoporosis, skeletal pain, blood pressure, blood transfusion, cholesterol, vision, hearing, psoriasis, weight, bowels and urinary. Five clusters of participants, based on medical symptoms, were identified to have significantly increased rates of depression compared to the cluster with the lowest rate: odds ratios ranged from 2.24 (95% CI 1.56, 3.24) to 6.33 (95% CI 1.67, 24.02). The ML boosted regression algorithm identified three key medical condition categories as being significantly more common in these clusters: bowel, pain and urinary symptoms. Bowel-related symptoms was found to dominate the relative importance of symptoms within the five key clusters. This methodology shows promise for the identification of conditions in general populations and supports the current focus on the potential importance of bowel symptoms and the gut in mental health research.
The composite sequential clustering technique for analysis of multispectral scanner data
NASA Technical Reports Server (NTRS)
Su, M. Y.
1972-01-01
The clustering technique consists of two parts: (1) a sequential statistical clustering which is essentially a sequential variance analysis, and (2) a generalized K-means clustering. In this composite clustering technique, the output of (1) is a set of initial clusters which are input to (2) for further improvement by an iterative scheme. This unsupervised composite technique was employed for automatic classification of two sets of remote multispectral earth resource observations. The classification accuracy by the unsupervised technique is found to be comparable to that by traditional supervised maximum likelihood classification techniques. The mathematical algorithms for the composite sequential clustering program and a detailed computer program description with job setup are given.
A Fast Implementation of the ISOCLUS Algorithm
NASA Technical Reports Server (NTRS)
Memarsadeghi, Nargess; Mount, David M.; Netanyahu, Nathan S.; LeMoigne, Jacqueline
2003-01-01
Unsupervised clustering is a fundamental building block in numerous image processing applications. One of the most popular and widely used clustering schemes for remote sensing applications is the ISOCLUS algorithm, which is based on the ISODATA method. The algorithm is given a set of n data points in d-dimensional space, an integer k indicating the initial number of clusters, and a number of additional parameters. The general goal is to compute the coordinates of a set of cluster centers in d-space, such that those centers minimize the mean squared distance from each data point to its nearest center. This clustering algorithm is similar to another well-known clustering method, called k-means. One significant feature of ISOCLUS over k-means is that the actual number of clusters reported might be fewer or more than the number supplied as part of the input. The algorithm uses different heuristics to determine whether to merge lor split clusters. As ISOCLUS can run very slowly, particularly on large data sets, there has been a growing .interest in the remote sensing community in computing it efficiently. We have developed a faster implementation of the ISOCLUS algorithm. Our improvement is based on a recent acceleration to the k-means algorithm of Kanungo, et al. They showed that, by using a kd-tree data structure for storing the data, it is possible to reduce the running time of k-means. We have adapted this method for the ISOCLUS algorithm, and we show that it is possible to achieve essentially the same results as ISOCLUS on large data sets, but with significantly lower running times. This adaptation involves computing a number of cluster statistics that are needed for ISOCLUS but not for k-means. Both the k-means and ISOCLUS algorithms are based on iterative schemes, in which nearest neighbors are calculated until some convergence criterion is satisfied. Each iteration requires that the nearest center for each data point be computed. Naively, this requires O(kn) time, where k denotes the current number of centers. Traditional techniques for accelerating nearest neighbor searching involve storing the k centers in a data structure. However, because of the iterative nature of the algorithm, this data structure would need to be rebuilt with each new iteration. Our approach is to store the data points in a kd-tree data structure. The assignment of points to nearest neighbors is carried out by a filtering process, which successively eliminates centers that can not possibly be the nearest neighbor for a given region of space. This algorithm is significantly faster, because large groups of data points can be assigned to their nearest center in a single operation. Preliminary results on a number of real Landsat datasets show that our revised ISOCLUS-like scheme runs about twice as fast.
Clustering PPI data by combining FA and SHC method.
Lei, Xiujuan; Ying, Chao; Wu, Fang-Xiang; Xu, Jin
2015-01-01
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value.
Clustering PPI data by combining FA and SHC method
2015-01-01
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value. PMID:25707632
Vigelius, Matthias; Meyer, Bernd
2012-01-01
For many biological applications, a macroscopic (deterministic) treatment of reaction-drift-diffusion systems is insufficient. Instead, one has to properly handle the stochastic nature of the problem and generate true sample paths of the underlying probability distribution. Unfortunately, stochastic algorithms are computationally expensive and, in most cases, the large number of participating particles renders the relevant parameter regimes inaccessible. In an attempt to address this problem we present a genuine stochastic, multi-dimensional algorithm that solves the inhomogeneous, non-linear, drift-diffusion problem on a mesoscopic level. Our method improves on existing implementations in being multi-dimensional and handling inhomogeneous drift and diffusion. The algorithm is well suited for an implementation on data-parallel hardware architectures such as general-purpose graphics processing units (GPUs). We integrate the method into an operator-splitting approach that decouples chemical reactions from the spatial evolution. We demonstrate the validity and applicability of our algorithm with a comprehensive suite of standard test problems that also serve to quantify the numerical accuracy of the method. We provide a freely available, fully functional GPU implementation. Integration into Inchman, a user-friendly web service, that allows researchers to perform parallel simulations of reaction-drift-diffusion systems on GPU clusters is underway. PMID:22506001
Class imbalance in unsupervised change detection - A diagnostic analysis from urban remote sensing
NASA Astrophysics Data System (ADS)
Leichtle, Tobias; Geiß, Christian; Lakes, Tobia; Taubenböck, Hannes
2017-08-01
Automatic monitoring of changes on the Earth's surface is an intrinsic capability and simultaneously a persistent methodological challenge in remote sensing, especially regarding imagery with very-high spatial resolution (VHR) and complex urban environments. In order to enable a high level of automatization, the change detection problem is solved in an unsupervised way to alleviate efforts associated with collection of properly encoded prior knowledge. In this context, this paper systematically investigates the nature and effects of class distribution and class imbalance in an unsupervised binary change detection application based on VHR imagery over urban areas. For this purpose, a diagnostic framework for sensitivity analysis of a large range of possible degrees of class imbalance is presented, which is of particular importance with respect to unsupervised approaches where the content of images and thus the occurrence and the distribution of classes are generally unknown a priori. Furthermore, this framework can serve as a general technique to evaluate model transferability in any two-class classification problem. The applied change detection approach is based on object-based difference features calculated from VHR imagery and subsequent unsupervised two-class clustering using k-means, genetic k-means and self-organizing map (SOM) clustering. The results from two test sites with different structural characteristics of the built environment demonstrated that classification performance is generally worse in imbalanced class distribution settings while best results were reached in balanced or close to balanced situations. Regarding suitable accuracy measures for evaluating model performance in imbalanced settings, this study revealed that the Kappa statistics show significant response to class distribution while the true skill statistic was widely insensitive to imbalanced classes. In general, the genetic k-means clustering algorithm achieved the most robust results with respect to class imbalance while the SOM clustering exhibited a distinct optimization towards a balanced distribution of classes.
Parallel Wavefront Analysis for a 4D Interferometer
NASA Technical Reports Server (NTRS)
Rao, Shanti R.
2011-01-01
This software provides a programming interface for automating data collection with a PhaseCam interferometer from 4D Technology, and distributing the image-processing algorithm across a cluster of general-purpose computers. Multiple instances of 4Sight (4D Technology s proprietary software) run on a networked cluster of computers. Each connects to a single server (the controller) and waits for instructions. The controller directs the interferometer to several images, then assigns each image to a different computer for processing. When the image processing is finished, the server directs one of the computers to collate and combine the processed images, saving the resulting measurement in a file on a disk. The available software captures approximately 100 images and analyzes them immediately. This software separates the capture and analysis processes, so that analysis can be done at a different time and faster by running the algorithm in parallel across several processors. The PhaseCam family of interferometers can measure an optical system in milliseconds, but it takes many seconds to process the data so that it is usable. In characterizing an adaptive optics system, like the next generation of astronomical observatories, thousands of measurements are required, and the processing time quickly becomes excessive. A programming interface distributes data processing for a PhaseCam interferometer across a Windows computing cluster. A scriptable controller program coordinates data acquisition from the interferometer, storage on networked hard disks, and parallel processing. Idle time of the interferometer is minimized. This architecture is implemented in Python and JavaScript, and may be altered to fit a customer s needs.
General purpose molecular dynamics simulations fully implemented on graphics processing units
NASA Astrophysics Data System (ADS)
Anderson, Joshua A.; Lorenz, Chris D.; Travesset, A.
2008-05-01
Graphics processing units (GPUs), originally developed for rendering real-time effects in computer games, now provide unprecedented computational power for scientific applications. In this paper, we develop a general purpose molecular dynamics code that runs entirely on a single GPU. It is shown that our GPU implementation provides a performance equivalent to that of fast 30 processor core distributed memory cluster. Our results show that GPUs already provide an inexpensive alternative to such clusters and discuss implications for the future.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lalonde, Michel, E-mail: mlalonde15@rogers.com; Wassenaar, Richard; Wells, R. Glenn
2014-07-15
Purpose: Phase analysis of single photon emission computed tomography (SPECT) radionuclide angiography (RNA) has been investigated for its potential to predict the outcome of cardiac resynchronization therapy (CRT). However, phase analysis may be limited in its potential at predicting CRT outcome as valuable information may be lost by assuming that time-activity curves (TAC) follow a simple sinusoidal shape. A new method, cluster analysis, is proposed which directly evaluates the TACs and may lead to a better understanding of dyssynchrony patterns and CRT outcome. Cluster analysis algorithms were developed and optimized to maximize their ability to predict CRT response. Methods: Aboutmore » 49 patients (N = 27 ischemic etiology) received a SPECT RNA scan as well as positron emission tomography (PET) perfusion and viability scans prior to undergoing CRT. A semiautomated algorithm sampled the left ventricle wall to produce 568 TACs from SPECT RNA data. The TACs were then subjected to two different cluster analysis techniques, K-means, and normal average, where several input metrics were also varied to determine the optimal settings for the prediction of CRT outcome. Each TAC was assigned to a cluster group based on the comparison criteria and global and segmental cluster size and scores were used as measures of dyssynchrony and used to predict response to CRT. A repeated random twofold cross-validation technique was used to train and validate the cluster algorithm. Receiver operating characteristic (ROC) analysis was used to calculate the area under the curve (AUC) and compare results to those obtained for SPECT RNA phase analysis and PET scar size analysis methods. Results: Using the normal average cluster analysis approach, the septal wall produced statistically significant results for predicting CRT results in the ischemic population (ROC AUC = 0.73;p < 0.05 vs. equal chance ROC AUC = 0.50) with an optimal operating point of 71% sensitivity and 60% specificity. Cluster analysis results were similar to SPECT RNA phase analysis (ROC AUC = 0.78, p = 0.73 vs cluster AUC; sensitivity/specificity = 59%/89%) and PET scar size analysis (ROC AUC = 0.73, p = 1.0 vs cluster AUC; sensitivity/specificity = 76%/67%). Conclusions: A SPECT RNA cluster analysis algorithm was developed for the prediction of CRT outcome. Cluster analysis results produced results equivalent to those obtained from Fourier and scar analysis.« less
NASA Astrophysics Data System (ADS)
Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos
2011-01-01
General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.
Agounad, Said; Aassif, El Houcein; Khandouch, Younes; Maze, Gérard; Décultot, Dominique
2018-02-01
The acoustic scattering of a plane wave by an elastic cylindrical shell is studied. A new approach is developed to predict the form function of an immersed cylindrical shell of the radius ratio b/a ('b' is the inner radius and 'a' is the outer radius). The prediction of the backscattered form function is investigated by a combined approach between fuzzy clustering algorithms and bio-inspired algorithms. Four famous fuzzy clustering algorithms: the fuzzy c-means (FCM), the Gustafson-Kessel algorithm (GK), the fuzzy c-regression model (FCRM) and the Gath-Geva algorithm (GG) are combined with particle swarm optimization and genetic algorithm. The symmetric and antisymmetric circumferential waves A, S 0 , A 1 , S 1 and S 2 are investigated in a reduced frequency (k 1 a) range extends over 0.1
A formal concept analysis approach to consensus clustering of multi-experiment expression data
2014-01-01
Background Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals. Conclusions The proposed FCA-enhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices. PMID:24885407
Information Clustering Based on Fuzzy Multisets.
ERIC Educational Resources Information Center
Miyamoto, Sadaaki
2003-01-01
Proposes a fuzzy multiset model for information clustering with application to information retrieval on the World Wide Web. Highlights include search engines; term clustering; document clustering; algorithms for calculating cluster centers; theoretical properties concerning clustering algorithms; and examples to show how the algorithms work.…
Vision based obstacle detection and grouping for helicopter guidance
NASA Technical Reports Server (NTRS)
Sridhar, Banavar; Chatterji, Gano
1993-01-01
Electro-optical sensors can be used to compute range to objects in the flight path of a helicopter. The computation is based on the optical flow/motion at different points in the image. The motion algorithms provide a sparse set of ranges to discrete features in the image sequence as a function of azimuth and elevation. For obstacle avoidance guidance and display purposes, these discrete set of ranges, varying from a few hundreds to several thousands, need to be grouped into sets which correspond to objects in the real world. This paper presents a new method for object segmentation based on clustering the sparse range information provided by motion algorithms together with the spatial relation provided by the static image. The range values are initially grouped into clusters based on depth. Subsequently, the clusters are modified by using the K-means algorithm in the inertial horizontal plane and the minimum spanning tree algorithms in the image plane. The object grouping allows interpolation within a group and enables the creation of dense range maps. Researchers in robotics have used densely scanned sequence of laser range images to build three-dimensional representation of the outside world. Thus, modeling techniques developed for dense range images can be extended to sparse range images. The paper presents object segmentation results for a sequence of flight images.
An improved clustering algorithm based on reverse learning in intelligent transportation
NASA Astrophysics Data System (ADS)
Qiu, Guoqing; Kou, Qianqian; Niu, Ting
2017-05-01
With the development of artificial intelligence and data mining technology, big data has gradually entered people's field of vision. In the process of dealing with large data, clustering is an important processing method. By introducing the reverse learning method in the clustering process of PAM clustering algorithm, to further improve the limitations of one-time clustering in unsupervised clustering learning, and increase the diversity of clustering clusters, so as to improve the quality of clustering. The algorithm analysis and experimental results show that the algorithm is feasible.
A roadmap of clustering algorithms: finding a match for a biomedical application.
Andreopoulos, Bill; An, Aijun; Wang, Xiaogang; Schroeder, Michael
2009-05-01
Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.
An image-guided tool to prevent hospital acquired infections
NASA Astrophysics Data System (ADS)
Nagy, Melinda; Szilágyi, László; Lehotsky, Ákos; Haidegger, Tamás; Benyó, Balázs
2011-03-01
Hospital Acquired Infections (HAI) represent the fourth leading cause of death in the United States, and claims hundreds of thousands of lives annually in the rest of the world. This paper presents a novel low-cost mobile device|called Stery-Hand|that helps to avoid HAI by improving hand hygiene control through providing an objective evaluation of the quality of hand washing. The use of the system is intuitive: having performed hand washing with a soap mixed with UV re ective powder, the skin appears brighter in UV illumination on the disinfected surfaces. Washed hands are inserted into the Stery-Hand box, where a digital image is taken under UV lighting. Automated image processing algorithms are employed in three steps to evaluate the quality of hand washing. First, the contour of the hand is extracted in order to distinguish the hand from the background. Next, a semi-supervised clustering algorithm classies the pixels of the hand into three groups, corresponding to clean, partially clean and dirty areas. The clustering algorithm is derived from the histogram-based quick fuzzy c-means approach, using a priori information extracted from reference images, evaluated by experts. Finally, the identied areas are adjusted to suppress shading eects, and quantied in order to give a verdict on hand disinfection quality. The proposed methodology was validated through tests using hundreds of images recorded in our laboratory. The proposed system was found robust and accurate, producing correct estimation for over 98% of the test cases. Stery-Hand may be employed in general practice, and it may also serve educational purposes.
Efficient clustering aggregation based on data fragments.
Wu, Ou; Hu, Weiming; Maybank, Stephen J; Zhu, Mingliang; Li, Bing
2012-06-01
Clustering aggregation, known as clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a single better clustering. Existing clustering aggregation algorithms are applied directly to data points, in what is referred to as the point-based approach. The algorithms are inefficient if the number of data points is large. We define an efficient approach for clustering aggregation based on data fragments. In this fragment-based approach, a data fragment is any subset of the data that is not split by any of the clustering results. To establish the theoretical bases of the proposed approach, we prove that clustering aggregation can be performed directly on data fragments under two widely used goodness measures for clustering aggregation taken from the literature. Three new clustering aggregation algorithms are described. The experimental results obtained using several public data sets show that the new algorithms have lower computational complexity than three well-known existing point-based clustering aggregation algorithms (Agglomerative, Furthest, and LocalSearch); nevertheless, the new algorithms do not sacrifice the accuracy.
Improvement and speed optimization of numerical tsunami modelling program using OpenMP technology
NASA Astrophysics Data System (ADS)
Chernov, A.; Zaytsev, A.; Yalciner, A.; Kurkin, A.
2009-04-01
Currently, the basic problem of tsunami modeling is low speed of calculations which is unacceptable for services of the operative notification. Existing algorithms of numerical modeling of hydrodynamic processes of tsunami waves are developed without taking the opportunities of modern computer facilities. There is an opportunity to have considerable acceleration of process of calculations by using parallel algorithms. We discuss here new approach to parallelization tsunami modeling code using OpenMP Technology (for multiprocessing systems with the general memory). Nowadays, multiprocessing systems are easily accessible for everyone. The cost of the use of such systems becomes much lower comparing to the costs of clusters. This opportunity also benefits all programmers to apply multithreading algorithms on desktop computers of researchers. Other important advantage of the given approach is the mechanism of the general memory - there is no necessity to send data on slow networks (for example Ethernet). All memory is the common for all computing processes; it causes almost linear scalability of the program and processes. In the new version of NAMI DANCE using OpenMP technology and multi-threading algorithm provide 80% gain in speed in comparison with the one-thread version for dual-processor unit. The speed increased and 320% gain was attained for four core processor unit of PCs. Thus, it was possible to reduce considerably time of performance of calculations on the scientific workstations (desktops) without complete change of the program and user interfaces. The further modernization of algorithms of preparation of initial data and processing of results using OpenMP looks reasonable. The final version of NAMI DANCE with the increased computational speed can be used not only for research purposes but also in real time Tsunami Warning Systems.
MINE: Module Identification in Networks
2011-01-01
Background Graphical models of network associations are useful for both visualizing and integrating multiple types of association data. Identifying modules, or groups of functionally related gene products, is an important challenge in analyzing biological networks. However, existing tools to identify modules are insufficient when applied to dense networks of experimentally derived interaction data. To address this problem, we have developed an agglomerative clustering method that is able to identify highly modular sets of gene products within highly interconnected molecular interaction networks. Results MINE outperforms MCODE, CFinder, NEMO, SPICi, and MCL in identifying non-exclusive, high modularity clusters when applied to the C. elegans protein-protein interaction network. The algorithm generally achieves superior geometric accuracy and modularity for annotated functional categories. In comparison with the most closely related algorithm, MCODE, the top clusters identified by MINE are consistently of higher density and MINE is less likely to designate overlapping modules as a single unit. MINE offers a high level of granularity with a small number of adjustable parameters, enabling users to fine-tune cluster results for input networks with differing topological properties. Conclusions MINE was created in response to the challenge of discovering high quality modules of gene products within highly interconnected biological networks. The algorithm allows a high degree of flexibility and user-customisation of results with few adjustable parameters. MINE outperforms several popular clustering algorithms in identifying modules with high modularity and obtains good overall recall and precision of functional annotations in protein-protein interaction networks from both S. cerevisiae and C. elegans. PMID:21605434
BiCluE - Exact and heuristic algorithms for weighted bi-cluster editing of biomedical data
2013-01-01
Background The explosion of biological data has dramatically reformed today's biology research. The biggest challenge to biologists and bioinformaticians is the integration and analysis of large quantity of data to provide meaningful insights. One major problem is the combined analysis of data from different types. Bi-cluster editing, as a special case of clustering, which partitions two different types of data simultaneously, might be used for several biomedical scenarios. However, the underlying algorithmic problem is NP-hard. Results Here we contribute with BiCluE, a software package designed to solve the weighted bi-cluster editing problem. It implements (1) an exact algorithm based on fixed-parameter tractability and (2) a polynomial-time greedy heuristics based on solving the hardest part, edge deletions, first. We evaluated its performance on artificial graphs. Afterwards we exemplarily applied our implementation on real world biomedical data, GWAS data in this case. BiCluE generally works on any kind of data types that can be modeled as (weighted or unweighted) bipartite graphs. Conclusions To our knowledge, this is the first software package solving the weighted bi-cluster editing problem. BiCluE as well as the supplementary results are available online at http://biclue.mpi-inf.mpg.de. PMID:24565035
A clustering method of Chinese medicine prescriptions based on modified firefly algorithm.
Yuan, Feng; Liu, Hong; Chen, Shou-Qiang; Xu, Liang
2016-12-01
This paper is aimed to study the clustering method for Chinese medicine (CM) medical cases. The traditional K-means clustering algorithm had shortcomings such as dependence of results on the selection of initial value, trapping in local optimum when processing prescriptions form CM medical cases. Therefore, a new clustering method based on the collaboration of firefly algorithm and simulated annealing algorithm was proposed. This algorithm dynamically determined the iteration of firefly algorithm and simulates sampling of annealing algorithm by fitness changes, and increased the diversity of swarm through expansion of the scope of the sudden jump, thereby effectively avoiding premature problem. The results from confirmatory experiments for CM medical cases suggested that, comparing with traditional K-means clustering algorithms, this method was greatly improved in the individual diversity and the obtained clustering results, the computing results from this method had a certain reference value for cluster analysis on CM prescriptions.
ClusterViz: A Cytoscape APP for Cluster Analysis of Biological Network.
Wang, Jianxin; Zhong, Jiancheng; Chen, Gang; Li, Min; Wu, Fang-xiang; Pan, Yi
2015-01-01
Cluster analysis of biological networks is one of the most important approaches for identifying functional modules and predicting protein functions. Furthermore, visualization of clustering results is crucial to uncover the structure of biological networks. In this paper, ClusterViz, an APP of Cytoscape 3 for cluster analysis and visualization, has been developed. In order to reduce complexity and enable extendibility for ClusterViz, we designed the architecture of ClusterViz based on the framework of Open Services Gateway Initiative. According to the architecture, the implementation of ClusterViz is partitioned into three modules including interface of ClusterViz, clustering algorithms and visualization and export. ClusterViz fascinates the comparison of the results of different algorithms to do further related analysis. Three commonly used clustering algorithms, FAG-EC, EAGLE and MCODE, are included in the current version. Due to adopting the abstract interface of algorithms in module of the clustering algorithms, more clustering algorithms can be included for the future use. To illustrate usability of ClusterViz, we provided three examples with detailed steps from the important scientific articles, which show that our tool has helped several research teams do their research work on the mechanism of the biological networks.
Jothi, R; Mohanty, Sraban Kumar; Ojha, Aparajita
2016-04-01
Gene expression data clustering is an important biological process in DNA microarray analysis. Although there have been many clustering algorithms for gene expression analysis, finding a suitable and effective clustering algorithm is always a challenging problem due to the heterogeneous nature of gene profiles. Minimum Spanning Tree (MST) based clustering algorithms have been successfully employed to detect clusters of varying shapes and sizes. This paper proposes a novel clustering algorithm using Eigenanalysis on Minimum Spanning Tree based neighborhood graph (E-MST). As MST of a set of points reflects the similarity of the points with their neighborhood, the proposed algorithm employs a similarity graph obtained from k(') rounds of MST (k(')-MST neighborhood graph). By studying the spectral properties of the similarity matrix obtained from k(')-MST graph, the proposed algorithm achieves improved clustering results. We demonstrate the efficacy of the proposed algorithm on 12 gene expression datasets. Experimental results show that the proposed algorithm performs better than the standard clustering algorithms. Copyright © 2016 Elsevier Ltd. All rights reserved.
Sun, Liping; Luo, Yonglong; Ding, Xintao; Zhang, Ji
2014-01-01
An important component of a spatial clustering algorithm is the distance measure between sample points in object space. In this paper, the traditional Euclidean distance measure is replaced with innovative obstacle distance measure for spatial clustering under obstacle constraints. Firstly, we present a path searching algorithm to approximate the obstacle distance between two points for dealing with obstacles and facilitators. Taking obstacle distance as similarity metric, we subsequently propose the artificial immune clustering with obstacle entity (AICOE) algorithm for clustering spatial point data in the presence of obstacles and facilitators. Finally, the paper presents a comparative analysis of AICOE algorithm and the classical clustering algorithms. Our clustering model based on artificial immune system is also applied to the case of public facility location problem in order to establish the practical applicability of our approach. By using the clone selection principle and updating the cluster centers based on the elite antibodies, the AICOE algorithm is able to achieve the global optimum and better clustering effect.
Fast Constrained Spectral Clustering and Cluster Ensemble with Random Projection
Liu, Wenfen
2017-01-01
Constrained spectral clustering (CSC) method can greatly improve the clustering accuracy with the incorporation of constraint information into spectral clustering and thus has been paid academic attention widely. In this paper, we propose a fast CSC algorithm via encoding landmark-based graph construction into a new CSC model and applying random sampling to decrease the data size after spectral embedding. Compared with the original model, the new algorithm has the similar results with the increase of its model size asymptotically; compared with the most efficient CSC algorithm known, the new algorithm runs faster and has a wider range of suitable data sets. Meanwhile, a scalable semisupervised cluster ensemble algorithm is also proposed via the combination of our fast CSC algorithm and dimensionality reduction with random projection in the process of spectral ensemble clustering. We demonstrate by presenting theoretical analysis and empirical results that the new cluster ensemble algorithm has advantages in terms of efficiency and effectiveness. Furthermore, the approximate preservation of random projection in clustering accuracy proved in the stage of consensus clustering is also suitable for the weighted k-means clustering and thus gives the theoretical guarantee to this special kind of k-means clustering where each point has its corresponding weight. PMID:29312447
Algorithm of reducing the false positives in IDS based on correlation Analysis
NASA Astrophysics Data System (ADS)
Liu, Jianyi; Li, Sida; Zhang, Ru
2018-03-01
This paper proposes an algorithm of reducing the false positives in IDS based on correlation Analysis. Firstly, the algorithm analyzes the distinguishing characteristics of false positives and real alarms, and preliminary screen the false positives; then use the method of attribute similarity clustering to the alarms and further reduces the amount of alarms; finally, according to the characteristics of multi-step attack, associated it by the causal relationship. The paper also proposed a reverse causation algorithm based on the attack association method proposed by the predecessors, turning alarm information into a complete attack path. Experiments show that the algorithm simplifies the number of alarms, improve the efficiency of alarm processing, and contribute to attack purposes identification and alarm accuracy improvement.
Techniques to derive geometries for image-based Eulerian computations
Dillard, Seth; Buchholz, James; Vigmostad, Sarah; Kim, Hyunggun; Udaykumar, H.S.
2014-01-01
Purpose The performance of three frequently used level set-based segmentation methods is examined for the purpose of defining features and boundary conditions for image-based Eulerian fluid and solid mechanics models. The focus of the evaluation is to identify an approach that produces the best geometric representation from a computational fluid/solid modeling point of view. In particular, extraction of geometries from a wide variety of imaging modalities and noise intensities, to supply to an immersed boundary approach, is targeted. Design/methodology/approach Two- and three-dimensional images, acquired from optical, X-ray CT, and ultrasound imaging modalities, are segmented with active contours, k-means, and adaptive clustering methods. Segmentation contours are converted to level sets and smoothed as necessary for use in fluid/solid simulations. Results produced by the three approaches are compared visually and with contrast ratio, signal-to-noise ratio, and contrast-to-noise ratio measures. Findings While the active contours method possesses built-in smoothing and regularization and produces continuous contours, the clustering methods (k-means and adaptive clustering) produce discrete (pixelated) contours that require smoothing using speckle-reducing anisotropic diffusion (SRAD). Thus, for images with high contrast and low to moderate noise, active contours are generally preferable. However, adaptive clustering is found to be far superior to the other two methods for images possessing high levels of noise and global intensity variations, due to its more sophisticated use of local pixel/voxel intensity statistics. Originality/value It is often difficult to know a priori which segmentation will perform best for a given image type, particularly when geometric modeling is the ultimate goal. This work offers insight to the algorithm selection process, as well as outlining a practical framework for generating useful geometric surfaces in an Eulerian setting. PMID:25750470
Large-scale seismic signal analysis with Hadoop
Addair, T. G.; Dodge, D. A.; Walter, W. R.; ...
2014-02-11
In seismology, waveform cross correlation has been used for years to produce high-precision hypocenter locations and for sensitive detectors. Because correlated seismograms generally are found only at small hypocenter separation distances, correlation detectors have historically been reserved for spotlight purposes. However, many regions have been found to produce large numbers of correlated seismograms, and there is growing interest in building next-generation pipelines that employ correlation as a core part of their operation. In an effort to better understand the distribution and behavior of correlated seismic events, we have cross correlated a global dataset consisting of over 300 million seismograms. Thismore » was done using a conventional distributed cluster, and required 42 days. In anticipation of processing much larger datasets, we have re-architected the system to run as a series of MapReduce jobs on a Hadoop cluster. In doing so we achieved a factor of 19 performance increase on a test dataset. We found that fundamental algorithmic transformations were required to achieve the maximum performance increase. Whereas in the original IO-bound implementation, we went to great lengths to minimize IO, in the Hadoop implementation where IO is cheap, we were able to greatly increase the parallelism of our algorithms by performing a tiered series of very fine-grained (highly parallelizable) transformations on the data. Each of these MapReduce jobs required reading and writing large amounts of data.« less
Large-scale seismic signal analysis with Hadoop
DOE Office of Scientific and Technical Information (OSTI.GOV)
Addair, T. G.; Dodge, D. A.; Walter, W. R.
In seismology, waveform cross correlation has been used for years to produce high-precision hypocenter locations and for sensitive detectors. Because correlated seismograms generally are found only at small hypocenter separation distances, correlation detectors have historically been reserved for spotlight purposes. However, many regions have been found to produce large numbers of correlated seismograms, and there is growing interest in building next-generation pipelines that employ correlation as a core part of their operation. In an effort to better understand the distribution and behavior of correlated seismic events, we have cross correlated a global dataset consisting of over 300 million seismograms. Thismore » was done using a conventional distributed cluster, and required 42 days. In anticipation of processing much larger datasets, we have re-architected the system to run as a series of MapReduce jobs on a Hadoop cluster. In doing so we achieved a factor of 19 performance increase on a test dataset. We found that fundamental algorithmic transformations were required to achieve the maximum performance increase. Whereas in the original IO-bound implementation, we went to great lengths to minimize IO, in the Hadoop implementation where IO is cheap, we were able to greatly increase the parallelism of our algorithms by performing a tiered series of very fine-grained (highly parallelizable) transformations on the data. Each of these MapReduce jobs required reading and writing large amounts of data.« less
Karayiannis, N B
2000-01-01
This paper presents the development and investigates the properties of ordered weighted learning vector quantization (LVQ) and clustering algorithms. These algorithms are developed by using gradient descent to minimize reformulation functions based on aggregation operators. An axiomatic approach provides conditions for selecting aggregation operators that lead to admissible reformulation functions. Minimization of admissible reformulation functions based on ordered weighted aggregation operators produces a family of soft LVQ and clustering algorithms, which includes fuzzy LVQ and clustering algorithms as special cases. The proposed LVQ and clustering algorithms are used to perform segmentation of magnetic resonance (MR) images of the brain. The diagnostic value of the segmented MR images provides the basis for evaluating a variety of ordered weighted LVQ and clustering algorithms.
A General Event Location Algorithm with Applications to Eclipse and Station Line-of-Sight
NASA Technical Reports Server (NTRS)
Parker, Joel J. K.; Hughes, Steven P.
2011-01-01
A general-purpose algorithm for the detection and location of orbital events is developed. The proposed algorithm reduces the problem to a global root-finding problem by mapping events of interest (such as eclipses, station access events, etc.) to continuous, differentiable event functions. A stepping algorithm and a bracketing algorithm are used to detect and locate the roots. Examples of event functions and the stepping/bracketing algorithms are discussed, along with results indicating performance and accuracy in comparison to commercial tools across a variety of trajectories.
A General Event Location Algorithm with Applications to Eclispe and Station Line-of-Sight
NASA Technical Reports Server (NTRS)
Parker, Joel J. K.; Hughes, Steven P.
2011-01-01
A general-purpose algorithm for the detection and location of orbital events is developed. The proposed algorithm reduces the problem to a global root-finding problem by mapping events of interest (such as eclipses, station access events, etc.) to continuous, differentiable event functions. A stepping algorithm and a bracketing algorithm are used to detect and locate the roots. Examples of event functions and the stepping/bracketing algorithms are discussed, along with results indicating performance and accuracy in comparison to commercial tools across a variety of trajectories.
Hierarchical Dirichlet process model for gene expression clustering
2013-01-01
Clustering is an important data processing tool for interpreting microarray data and genomic network inference. In this article, we propose a clustering algorithm based on the hierarchical Dirichlet processes (HDP). The HDP clustering introduces a hierarchical structure in the statistical model which captures the hierarchical features prevalent in biological data such as the gene express data. We develop a Gibbs sampling algorithm based on the Chinese restaurant metaphor for the HDP clustering. We apply the proposed HDP algorithm to both regulatory network segmentation and gene expression clustering. The HDP algorithm is shown to outperform several popular clustering algorithms by revealing the underlying hierarchical structure of the data. For the yeast cell cycle data, we compare the HDP result to the standard result and show that the HDP algorithm provides more information and reduces the unnecessary clustering fragments. PMID:23587447
Canonical PSO Based K-Means Clustering Approach for Real Datasets.
Dey, Lopamudra; Chakraborty, Sanjay
2014-01-01
"Clustering" the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.
Topology control algorithm for wireless sensor networks based on Link forwarding
NASA Astrophysics Data System (ADS)
Pucuo, Cairen; Qi, Ai-qin
2018-03-01
The research of topology control could effectively save energy and increase the service life of network based on wireless sensor. In this paper, a arithmetic called LTHC (link transmit hybrid clustering) based on link transmit is proposed. It decreases expenditure of energy by changing the way of cluster-node’s communication. The idea is to establish a link between cluster and SINK node when the cluster is formed, and link-node must be non-cluster. Through the link, cluster sends information to SINK nodes. For the sake of achieving the uniform distribution of energy on the network, prolongate the network survival time, and improve the purpose of communication, the communication will cut down much more expenditure of energy for cluster which away from SINK node. In the two aspects of improving the traffic and network survival time, we find that the LTCH is far superior to the traditional LEACH by experiments.
A hybrid monkey search algorithm for clustering analysis.
Chen, Xin; Zhou, Yongquan; Luo, Qifang
2014-01-01
Clustering is a popular data analysis and data mining technique. The k-means clustering algorithm is one of the most commonly used methods. However, it highly depends on the initial solution and is easy to fall into local optimum solution. In view of the disadvantages of the k-means method, this paper proposed a hybrid monkey algorithm based on search operator of artificial bee colony algorithm for clustering analysis and experiment on synthetic and real life datasets to show that the algorithm has a good performance than that of the basic monkey algorithm for clustering analysis.
Dynamic PROOF clusters with PoD: architecture and user experience
NASA Astrophysics Data System (ADS)
Manafov, Anar
2011-12-01
PROOF on Demand (PoD) is a tool-set, which sets up a PROOF cluster on any resource management system. PoD is a user oriented product with an easy to use GUI and a command-line interface. It is fully automated. No administrative privileges or special knowledge is required to use it. PoD utilizes a plug-in system, to use different job submission front-ends. The current PoD distribution is shipped with LSF, Torque (PBS), Grid Engine, Condor, gLite, and SSH plug-ins. The product is to be extended. We therefore plan to implement a plug-in for AliEn Grid as well. Recently developed algorithms made it possible to efficiently maintain two types of connections: packet-forwarding and native PROOF connections. This helps to properly handle most kinds of workers, with and without firewalls. PoD maintains the PROOF environment automatically and, for example, prevents resource misusage in case when workers idle for too long. As PoD matures as a product and provides more plug-ins, it's used as a standard for setting up dynamic PROOF clusters in many different institutions. The GSI Analysis Facility (GSIAF) is in production since 2007. The static PROOF cluster has been phased out end of 2009. GSIAF is now completely based on PoD. Users create private dynamic PROOF clusters on the general purpose batch farm. This provides an easier resource sharing between interactive local batch and Grid usage. The main user communities are FAIR and ALICE.
Image processing via VLSI: A concept paper
NASA Technical Reports Server (NTRS)
Nathan, R.
1982-01-01
Implementing specific image processing algorithms via very large scale integrated systems offers a potent solution to the problem of handling high data rates. Two algorithms stand out as being particularly critical -- geometric map transformation and filtering or correlation. These two functions form the basis for data calibration, registration and mosaicking. VLSI presents itself as an inexpensive ancillary function to be added to almost any general purpose computer and if the geometry and filter algorithms are implemented in VLSI, the processing rate bottleneck would be significantly relieved. A set of image processing functions that limit present systems to deal with future throughput needs, translates these functions to algorithms, implements via VLSI technology and interfaces the hardware to a general purpose digital computer is developed.
Java implementation of Class Association Rule algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tamura, Makio
2007-08-30
Java implementation of three Class Association Rule mining algorithms, NETCAR, CARapriori, and clustering based rule mining. NETCAR algorithm is a novel algorithm developed by Makio Tamura. The algorithm is discussed in a paper: UCRL-JRNL-232466-DRAFT, and would be published in a peer review scientific journal. The software is used to extract combinations of genes relevant with a phenotype from a phylogenetic profile and a phenotype profile. The phylogenetic profiles is represented by a binary matrix and a phenotype profile is represented by a binary vector. The present application of this software will be in genome analysis, however, it could be appliedmore » more generally.« less
Canonical PSO Based K-Means Clustering Approach for Real Datasets
Dey, Lopamudra; Chakraborty, Sanjay
2014-01-01
“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms. PMID:27355083
NASA Astrophysics Data System (ADS)
Tamiminia, Haifa; Homayouni, Saeid; McNairn, Heather; Safari, Abdoreza
2017-06-01
Polarimetric Synthetic Aperture Radar (PolSAR) data, thanks to their specific characteristics such as high resolution, weather and daylight independence, have become a valuable source of information for environment monitoring and management. The discrimination capability of observations acquired by these sensors can be used for land cover classification and mapping. The aim of this paper is to propose an optimized kernel-based C-means clustering algorithm for agriculture crop mapping from multi-temporal PolSAR data. Firstly, several polarimetric features are extracted from preprocessed data. These features are linear polarization intensities, and several statistical and physical based decompositions such as Cloude-Pottier, Freeman-Durden and Yamaguchi techniques. Then, the kernelized version of hard and fuzzy C-means clustering algorithms are applied to these polarimetric features in order to identify crop types. The kernel function, unlike the conventional partitioning clustering algorithms, simplifies the non-spherical and non-linearly patterns of data structure, to be clustered easily. In addition, in order to enhance the results, Particle Swarm Optimization (PSO) algorithm is used to tune the kernel parameters, cluster centers and to optimize features selection. The efficiency of this method was evaluated by using multi-temporal UAVSAR L-band images acquired over an agricultural area near Winnipeg, Manitoba, Canada, during June and July in 2012. The results demonstrate more accurate crop maps using the proposed method when compared to the classical approaches, (e.g. 12% improvement in general). In addition, when the optimization technique is used, greater improvement is observed in crop classification, e.g. 5% in overall. Furthermore, a strong relationship between Freeman-Durden volume scattering component, which is related to canopy structure, and phenological growth stages is observed.
A method of operation scheduling based on video transcoding for cluster equipment
NASA Astrophysics Data System (ADS)
Zhou, Haojie; Yan, Chun
2018-04-01
Because of the cluster technology in real-time video transcoding device, the application of facing the massive growth in the number of video assignments and resolution and bit rate of diversity, task scheduling algorithm, and analyze the current mainstream of cluster for real-time video transcoding equipment characteristics of the cluster, combination with the characteristics of the cluster equipment task delay scheduling algorithm is proposed. This algorithm enables the cluster to get better performance in the generation of the job queue and the lower part of the job queue when receiving the operation instruction. In the end, a small real-time video transcode cluster is constructed to analyze the calculation ability, running time, resource occupation and other aspects of various algorithms in operation scheduling. The experimental results show that compared with traditional clustering task scheduling algorithm, task delay scheduling algorithm has more flexible and efficient characteristics.
[Cluster analysis in biomedical researches].
Akopov, A S; Moskovtsev, A A; Dolenko, S A; Savina, G D
2013-01-01
Cluster analysis is one of the most popular methods for the analysis of multi-parameter data. The cluster analysis reveals the internal structure of the data, group the separate observations on the degree of their similarity. The review provides a definition of the basic concepts of cluster analysis, and discusses the most popular clustering algorithms: k-means, hierarchical algorithms, Kohonen networks algorithms. Examples are the use of these algorithms in biomedical research.
Clustering analysis of moving target signatures
NASA Astrophysics Data System (ADS)
Martone, Anthony; Ranney, Kenneth; Innocenti, Roberto
2010-04-01
Previously, we developed a moving target indication (MTI) processing approach to detect and track slow-moving targets inside buildings, which successfully detected moving targets (MTs) from data collected by a low-frequency, ultra-wideband radar. Our MTI algorithms include change detection, automatic target detection (ATD), clustering, and tracking. The MTI algorithms can be implemented in a real-time or near-real-time system; however, a person-in-the-loop is needed to select input parameters for the clustering algorithm. Specifically, the number of clusters to input into the cluster algorithm is unknown and requires manual selection. A critical need exists to automate all aspects of the MTI processing formulation. In this paper, we investigate two techniques that automatically determine the number of clusters: the adaptive knee-point (KP) algorithm and the recursive pixel finding (RPF) algorithm. The KP algorithm is based on a well-known heuristic approach for determining the number of clusters. The RPF algorithm is analogous to the image processing, pixel labeling procedure. Both algorithms are used to analyze the false alarm and detection rates of three operational scenarios of personnel walking inside wood and cinderblock buildings.
Application of hybrid clustering using parallel k-means algorithm and DIANA algorithm
NASA Astrophysics Data System (ADS)
Umam, Khoirul; Bustamam, Alhadi; Lestari, Dian
2017-03-01
DNA is one of the carrier of genetic information of living organisms. Encoding, sequencing, and clustering DNA sequences has become the key jobs and routine in the world of molecular biology, in particular on bioinformatics application. There are two type of clustering, hierarchical clustering and partitioning clustering. In this paper, we combined two type clustering i.e. K-Means (partitioning clustering) and DIANA (hierarchical clustering), therefore it called Hybrid clustering. Application of hybrid clustering using Parallel K-Means algorithm and DIANA algorithm used to clustering DNA sequences of Human Papillomavirus (HPV). The clustering process is started with Collecting DNA sequences of HPV are obtained from NCBI (National Centre for Biotechnology Information), then performing characteristics extraction of DNA sequences. The characteristics extraction result is store in a matrix form, then normalize this matrix using Min-Max normalization and calculate genetic distance using Euclidian Distance. Furthermore, the hybrid clustering is applied by using implementation of Parallel K-Means algorithm and DIANA algorithm. The aim of using Hybrid Clustering is to obtain better clusters result. For validating the resulted clusters, to get optimum number of clusters, we use Davies-Bouldin Index (DBI). In this study, the result of implementation of Parallel K-Means clustering is data clustered become 5 clusters with minimal IDB value is 0.8741, and Hybrid Clustering clustered data become 13 sub-clusters with minimal IDB values = 0.8216, 0.6845, 0.3331, 0.1994 and 0.3952. The IDB value of hybrid clustering less than IBD value of Parallel K-Means clustering only that perform at 1ts stage. Its means clustering using Hybrid Clustering have the better result to clustered DNA sequence of HPV than perform parallel K-Means Clustering only.
Clustering algorithm for determining community structure in large networks
NASA Astrophysics Data System (ADS)
Pujol, Josep M.; Béjar, Javier; Delgado, Jordi
2006-07-01
We propose an algorithm to find the community structure in complex networks based on the combination of spectral analysis and modularity optimization. The clustering produced by our algorithm is as accurate as the best algorithms on the literature of modularity optimization; however, the main asset of the algorithm is its efficiency. The best match for our algorithm is Newman’s fast algorithm, which is the reference algorithm for clustering in large networks due to its efficiency. When both algorithms are compared, our algorithm outperforms the fast algorithm both in efficiency and accuracy of the clustering, in terms of modularity. Thus, the results suggest that the proposed algorithm is a good choice to analyze the community structure of medium and large networks in the range of tens and hundreds of thousand vertices.
Spatio-Temporal Clustering of Monitoring Network
NASA Astrophysics Data System (ADS)
Hussain, I.; Pilz, J.
2009-04-01
Pakistan has much diversity in seasonal variation of different locations. Some areas are in desserts and remain very hot and waterless, for example coastal areas are situated along the Arabian Sea and have very warm season and a little rainfall. Some areas are covered with mountains, have very low temperature and heavy rainfall; for instance Karakoram ranges. The most important variables that have an impact on the climate are temperature, precipitation, humidity, wind speed and elevation. Furthermore, it is hard to find homogeneous regions in Pakistan with respect to climate variation. Identification of homogeneous regions in Pakistan can be useful in many aspects. It can be helpful for prediction of the climate in the sub-regions and for optimizing the number of monitoring sites. In the earlier literature no one tried to identify homogeneous regions of Pakistan with respect to climate variation. There are only a few papers about spatio-temporal clustering of monitoring network. Steinhaus (1956) presented the well-known K-means clustering method. It can identify a predefined number of clusters by iteratively assigning centriods to clusters based. Castro et al. (1997) developed a genetic heuristic algorithm to solve medoids based clustering. Their method is based on genetic recombination upon random assorting recombination. The suggested method is appropriate for clustering the attributes which have genetic characteristics. Sap and Awan (2005) presented a robust weighted kernel K-means algorithm incorporating spatial constraints for clustering climate data. The proposed algorithm can effectively handle noise, outliers and auto-correlation in the spatial data, for effective and efficient data analysis by exploring patterns and structures in the data. Soltani and Modarres (2006) used hierarchical and divisive cluster analysis to categorize patterns of rainfall in Iran. They only considered rainfall at twenty-eight monitoring sites and concluded that eight clusters existed. Soltani and Modarres (2006) classified the sites by using only average rainfall of sites, they did not consider time replications and spatial coordinates. Kerby et.al (2007) purposed spatial clustering method based on likelihood. They took account of the geographic locations through the variance covariance matrix. Their purposed method works like hierarchical clustering methods. Moreovere, it is inappropiriate for time replication data and could not perform well for large number of sites. Tuia.et.al (2008) used scan statistics for identifying spatio-temporal clusters for fire sequences in the Tuscany region in Italy. The scan statistics clustering method was developed by Kulldorff et al. (1997) to detect spatio-temporal clusters in epidemiology and assessing their significance. The purposed scan statistics method is used only for univariate discrete stochastic random variables. In this paper we make use of a very simple approach for spatio-temporal clustering which can create separable and homogeneous clusters. Most of the clustering methods are based on Euclidean distances. It is well known that geographic coordinates are spherical coordinates and estimating Euclidean distances from spherical coordinates is inappropriate. As a transformation from geographic coordinates to rectangular (D-plane) coordinates we use the Lambert projection method. The partition around medoids clustering method is incorporated on the data including D-plane coordinates. Ordinary kriging is taken as validity measure for the precipitation data. The kriging results for clusters are more accurate and have less variation compared to complete monitoring network precipitation data. References Casto.V.E and Murray.A.T (1997). Spatial Clustering with Data Mining with Genetic Algorithms. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.8573 Kaufman.L and Rousseeuw.P.J (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley series of Probability and Mathematical Statistics, New York. Kulldorf.M (1997). A spatial scan statistic. Commun. Stat.-Theor. Math. 26(6), 1481-1496 Kerby. A , Marx. D, Samal. A and Adamchuck. V. (2007). Spatial Clustering Using the Likelihood Function. Seventh IEEE International Conference on Data Mining - Workshops Steinhaus.H (1956). Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci., C1. III vol IV:801- 804 Snyder, J. P. (1987). Map Projection: A Working Manual. U. S. Geological Survey Professional Paper 1395. Washington, DC: U. S. Government Printing Office, pp. 104-110 Sap.M.N and Awan. A.M (2005). Finding Spatio-Temporal Patterns in Climate Data Using Clustering. Proceedings of the International Conference on Cyberworlds (CW'05) Soltani.S and Modarres.R (2006). Classification of Spatio -Temporal Pattern of Rainfall in Iran: Using Hierarchical and Divisive Cluster Analysis. Journal of Spatial Hydrology Vol.6, No.2 Tuia.D, Ratle.F, Lasaponara.R, Telesca.L and Kanevski.M (2008). Scan Statistics Analysis for Forest Fire Clusters. Commun. in Nonlinear science and numerical simulation 13,1689-1694.
Efficient architecture for spike sorting in reconfigurable hardware.
Hwang, Wen-Jyi; Lee, Wei-Hao; Lin, Shiow-Jyu; Lai, Sheng-Ying
2013-11-01
This paper presents a novel hardware architecture for fast spike sorting. The architecture is able to perform both the feature extraction and clustering in hardware. The generalized Hebbian algorithm (GHA) and fuzzy C-means (FCM) algorithm are used for feature extraction and clustering, respectively. The employment of GHA allows efficient computation of principal components for subsequent clustering operations. The FCM is able to achieve near optimal clustering for spike sorting. Its performance is insensitive to the selection of initial cluster centers. The hardware implementations of GHA and FCM feature low area costs and high throughput. In the GHA architecture, the computation of different weight vectors share the same circuit for lowering the area costs. Moreover, in the FCM hardware implementation, the usual iterative operations for updating the membership matrix and cluster centroid are merged into one single updating process to evade the large storage requirement. To show the effectiveness of the circuit, the proposed architecture is physically implemented by field programmable gate array (FPGA). It is embedded in a System-on-Chip (SOC) platform for performance measurement. Experimental results show that the proposed architecture is an efficient spike sorting design for attaining high classification correct rate and high speed computation.
Finding reproducible cluster partitions for the k-means algorithm
2013-01-01
K-means clustering is widely used for exploratory data analysis. While its dependence on initialisation is well-known, it is common practice to assume that the partition with lowest sum-of-squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also the closest that k-means can provide to true structure, when applied to synthetic data. We show that this is generally the case for small numbers of clusters, but for values of k that are still of theoretical and practical interest, similar values of SSQ can correspond to markedly different cluster partitions. This paper extends stability measures previously presented in the context of finding optimal values of cluster number, into a component of a 2-d map of the local minima found by the k-means algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index. The proposed method is illustrated by application to five synthetic datasets replicating a real world breast cancer dataset with varying data density, and a large bioinformatics dataset. PMID:23369085
Finding reproducible cluster partitions for the k-means algorithm.
Lisboa, Paulo J G; Etchells, Terence A; Jarman, Ian H; Chambers, Simon J
2013-01-01
K-means clustering is widely used for exploratory data analysis. While its dependence on initialisation is well-known, it is common practice to assume that the partition with lowest sum-of-squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also the closest that k-means can provide to true structure, when applied to synthetic data. We show that this is generally the case for small numbers of clusters, but for values of k that are still of theoretical and practical interest, similar values of SSQ can correspond to markedly different cluster partitions. This paper extends stability measures previously presented in the context of finding optimal values of cluster number, into a component of a 2-d map of the local minima found by the k-means algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index. The proposed method is illustrated by application to five synthetic datasets replicating a real world breast cancer dataset with varying data density, and a large bioinformatics dataset.
Problem solving with genetic algorithms and Splicer
NASA Technical Reports Server (NTRS)
Bayer, Steven E.; Wang, Lui
1991-01-01
Genetic algorithms are highly parallel, adaptive search procedures (i.e., problem-solving methods) loosely based on the processes of population genetics and Darwinian survival of the fittest. Genetic algorithms have proven useful in domains where other optimization techniques perform poorly. The main purpose of the paper is to discuss a NASA-sponsored software development project to develop a general-purpose tool for using genetic algorithms. The tool, called Splicer, can be used to solve a wide variety of optimization problems and is currently available from NASA and COSMIC. This discussion is preceded by an introduction to basic genetic algorithm concepts and a discussion of genetic algorithm applications.
The theory of variational hybrid quantum-classical algorithms
NASA Astrophysics Data System (ADS)
McClean, Jarrod R.; Romero, Jonathan; Babbush, Ryan; Aspuru-Guzik, Alán
2016-02-01
Many quantum algorithms have daunting resource requirements when compared to what is available today. To address this discrepancy, a quantum-classical hybrid optimization scheme known as ‘the quantum variational eigensolver’ was developed (Peruzzo et al 2014 Nat. Commun. 5 4213) with the philosophy that even minimal quantum resources could be made useful when used in conjunction with classical routines. In this work we extend the general theory of this algorithm and suggest algorithmic improvements for practical implementations. Specifically, we develop a variational adiabatic ansatz and explore unitary coupled cluster where we establish a connection from second order unitary coupled cluster to universal gate sets through a relaxation of exponential operator splitting. We introduce the concept of quantum variational error suppression that allows some errors to be suppressed naturally in this algorithm on a pre-threshold quantum device. Additionally, we analyze truncation and correlated sampling in Hamiltonian averaging as ways to reduce the cost of this procedure. Finally, we show how the use of modern derivative free optimization techniques can offer dramatic computational savings of up to three orders of magnitude over previously used optimization techniques.
NASA Technical Reports Server (NTRS)
Lennington, R. K.; Johnson, J. K.
1979-01-01
An efficient procedure which clusters data using a completely unsupervised clustering algorithm and then uses labeled pixels to label the resulting clusters or perform a stratified estimate using the clusters as strata is developed. Three clustering algorithms, CLASSY, AMOEBA, and ISOCLS, are compared for efficiency. Three stratified estimation schemes and three labeling schemes are also considered and compared.
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets
Wernisch, Lorenz
2017-01-01
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.
Gabasova, Evelina; Reid, John; Wernisch, Lorenz
2017-10-01
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Soffientini, Chiara Dolores, E-mail: chiaradolores.soffientini@polimi.it; Baselli, Giuseppe; De Bernardi, Elisabetta
Purpose: Quantitative {sup 18}F-fluorodeoxyglucose positron emission tomography is limited by the uncertainty in lesion delineation due to poor SNR, low resolution, and partial volume effects, subsequently impacting oncological assessment, treatment planning, and follow-up. The present work develops and validates a segmentation algorithm based on statistical clustering. The introduction of constraints based on background features and contiguity priors is expected to improve robustness vs clinical image characteristics such as lesion dimension, noise, and contrast level. Methods: An eight-class Gaussian mixture model (GMM) clustering algorithm was modified by constraining the mean and variance parameters of four background classes according to the previousmore » analysis of a lesion-free background volume of interest (background modeling). Hence, expectation maximization operated only on the four classes dedicated to lesion detection. To favor the segmentation of connected objects, a further variant was introduced by inserting priors relevant to the classification of neighbors. The algorithm was applied to simulated datasets and acquired phantom data. Feasibility and robustness toward initialization were assessed on a clinical dataset manually contoured by two expert clinicians. Comparisons were performed with respect to a standard eight-class GMM algorithm and to four different state-of-the-art methods in terms of volume error (VE), Dice index, classification error (CE), and Hausdorff distance (HD). Results: The proposed GMM segmentation with background modeling outperformed standard GMM and all the other tested methods. Medians of accuracy indexes were VE <3%, Dice >0.88, CE <0.25, and HD <1.2 in simulations; VE <23%, Dice >0.74, CE <0.43, and HD <1.77 in phantom data. Robustness toward image statistic changes (±15%) was shown by the low index changes: <26% for VE, <17% for Dice, and <15% for CE. Finally, robustness toward the user-dependent volume initialization was demonstrated. The inclusion of the spatial prior improved segmentation accuracy only for lesions surrounded by heterogeneous background: in the relevant simulation subset, the median VE significantly decreased from 13% to 7%. Results on clinical data were found in accordance with simulations, with absolute VE <7%, Dice >0.85, CE <0.30, and HD <0.81. Conclusions: The sole introduction of constraints based on background modeling outperformed standard GMM and the other tested algorithms. Insertion of a spatial prior improved the accuracy for realistic cases of objects in heterogeneous backgrounds. Moreover, robustness against initialization supports the applicability in a clinical setting. In conclusion, application-driven constraints can generally improve the capabilities of GMM and statistical clustering algorithms.« less
CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks.
Li, Min; Li, Dongyan; Tang, Yu; Wu, Fangxiang; Wang, Jianxin
2017-08-31
Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster.
CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks
Li, Min; Li, Dongyan; Tang, Yu; Wang, Jianxin
2017-01-01
Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster. PMID:28858211
Multi-Parent Clustering Algorithms from Stochastic Grammar Data Models
NASA Technical Reports Server (NTRS)
Mjoisness, Eric; Castano, Rebecca; Gray, Alexander
1999-01-01
We introduce a statistical data model and an associated optimization-based clustering algorithm which allows data vectors to belong to zero, one or several "parent" clusters. For each data vector the algorithm makes a discrete decision among these alternatives. Thus, a recursive version of this algorithm would place data clusters in a Directed Acyclic Graph rather than a tree. We test the algorithm with synthetic data generated according to the statistical data model. We also illustrate the algorithm using real data from large-scale gene expression assays.
Kaliman, Ilya A; Krylov, Anna I
2017-04-30
A new hardware-agnostic contraction algorithm for tensors of arbitrary symmetry and sparsity is presented. The algorithm is implemented as a stand-alone open-source code libxm. This code is also integrated with general tensor library libtensor and with the Q-Chem quantum-chemistry package. An overview of the algorithm, its implementation, and benchmarks are presented. Similarly to other tensor software, the algorithm exploits efficient matrix multiplication libraries and assumes that tensors are stored in a block-tensor form. The distinguishing features of the algorithm are: (i) efficient repackaging of the individual blocks into large matrices and back, which affords efficient graphics processing unit (GPU)-enabled calculations without modifications of higher-level codes; (ii) fully asynchronous data transfer between disk storage and fast memory. The algorithm enables canonical all-electron coupled-cluster and equation-of-motion coupled-cluster calculations with single and double substitutions (CCSD and EOM-CCSD) with over 1000 basis functions on a single quad-GPU machine. We show that the algorithm exhibits predicted theoretical scaling for canonical CCSD calculations, O(N 6 ), irrespective of the data size on disk. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Fast detection of the fuzzy communities based on leader-driven algorithm
NASA Astrophysics Data System (ADS)
Fang, Changjian; Mu, Dejun; Deng, Zhenghong; Hu, Jun; Yi, Chen-He
2018-03-01
In this paper, we present the leader-driven algorithm (LDA) for learning community structure in networks. The algorithm allows one to find overlapping clusters in a network, an important aspect of real networks, especially social networks. The algorithm requires no input parameters and learns the number of clusters naturally from the network. It accomplishes this using leadership centrality in a clever manner. It identifies local minima of leadership centrality as followers which belong only to one cluster, and the remaining nodes are leaders which connect clusters. In this way, the number of clusters can be learned using only the network structure. The LDA is also an extremely fast algorithm, having runtime linear in the network size. Thus, this algorithm can be used to efficiently cluster extremely large networks.
Craig, Hugh; Berretta, Regina; Moscato, Pablo
2016-01-01
In this study we propose a novel, unsupervised clustering methodology for analyzing large datasets. This new, efficient methodology converts the general clustering problem into the community detection problem in graph by using the Jensen-Shannon distance, a dissimilarity measure originating in Information Theory. Moreover, we use graph theoretic concepts for the generation and analysis of proximity graphs. Our methodology is based on a newly proposed memetic algorithm (iMA-Net) for discovering clusters of data elements by maximizing the modularity function in proximity graphs of literary works. To test the effectiveness of this general methodology, we apply it to a text corpus dataset, which contains frequencies of approximately 55,114 unique words across all 168 written in the Shakespearean era (16th and 17th centuries), to analyze and detect clusters of similar plays. Experimental results and comparison with state-of-the-art clustering methods demonstrate the remarkable performance of our new method for identifying high quality clusters which reflect the commonalities in the literary style of the plays. PMID:27571416
Research on retailer data clustering algorithm based on Spark
NASA Astrophysics Data System (ADS)
Huang, Qiuman; Zhou, Feng
2017-03-01
Big data analysis is a hot topic in the IT field now. Spark is a high-reliability and high-performance distributed parallel computing framework for big data sets. K-means algorithm is one of the classical partition methods in clustering algorithm. In this paper, we study the k-means clustering algorithm on Spark. Firstly, the principle of the algorithm is analyzed, and then the clustering analysis is carried out on the supermarket customers through the experiment to find out the different shopping patterns. At the same time, this paper proposes the parallelization of k-means algorithm and the distributed computing framework of Spark, and gives the concrete design scheme and implementation scheme. This paper uses the two-year sales data of a supermarket to validate the proposed clustering algorithm and achieve the goal of subdividing customers, and then analyze the clustering results to help enterprises to take different marketing strategies for different customer groups to improve sales performance.
A review on the multivariate statistical methods for dimensional reduction studies
NASA Astrophysics Data System (ADS)
Aik, Lim Eng; Kiang, Lam Chee; Mohamed, Zulkifley Bin; Hong, Tan Wei
2017-05-01
In this research study we have discussed multivariate statistical methods for dimensional reduction, which has been done by various researchers. The reduction of dimensionality is valuable to accelerate algorithm progression, as well as really may offer assistance with the last grouping/clustering precision. A lot of boisterous or even flawed info information regularly prompts a not exactly alluring algorithm progression. Expelling un-useful or dis-instructive information segments may for sure help the algorithm discover more broad grouping locales and principles and generally speaking accomplish better exhibitions on new data set.
Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm
Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong
2016-01-01
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis. PMID:27959895
Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm.
Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong
2016-01-01
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis.
Survey of adaptive image coding techniques
NASA Technical Reports Server (NTRS)
Habibi, A.
1977-01-01
The general problem of image data compression is discussed briefly with attention given to the use of Karhunen-Loeve transforms, suboptimal systems, and block quantization. A survey is then conducted encompassing the four categories of adaptive systems: (1) adaptive transform coding (adaptive sampling, adaptive quantization, etc.), (2) adaptive predictive coding (adaptive delta modulation, adaptive DPCM encoding, etc.), (3) adaptive cluster coding (blob algorithms and the multispectral cluster coding technique), and (4) adaptive entropy coding.
ERIC Educational Resources Information Center
Dong, Nianbo; Spybrook, Jessaca; Kelcey, Ben
2016-01-01
The purpose of this study is to propose a general framework for power analyses to detect the moderator effects in two- and three-level cluster randomized trials (CRTs). The study specifically aims to: (1) develop the statistical formulations for calculating statistical power, minimum detectable effect size (MDES) and its confidence interval to…
GDPC: Gravitation-based Density Peaks Clustering algorithm
NASA Astrophysics Data System (ADS)
Jiang, Jianhua; Hao, Dehao; Chen, Yujun; Parmar, Milan; Li, Keqin
2018-07-01
The Density Peaks Clustering algorithm, which we refer to as DPC, is a novel and efficient density-based clustering approach, and it is published in Science in 2014. The DPC has advantages of discovering clusters with varying sizes and varying densities, but has some limitations of detecting the number of clusters and identifying anomalies. We develop an enhanced algorithm with an alternative decision graph based on gravitation theory and nearby distance to identify centroids and anomalies accurately. We apply our method to some UCI and synthetic data sets. We report comparative clustering performances using F-Measure and 2-dimensional vision. We also compare our method to other clustering algorithms, such as K-Means, Affinity Propagation (AP) and DPC. We present F-Measure scores and clustering accuracies of our GDPC algorithm compared to K-Means, AP and DPC on different data sets. We show that the GDPC has the superior performance in its capability of: (1) detecting the number of clusters obviously; (2) aggregating clusters with varying sizes, varying densities efficiently; (3) identifying anomalies accurately.
Mining the National Career Assessment Examination Result Using Clustering Algorithm
NASA Astrophysics Data System (ADS)
Pagudpud, M. V.; Palaoag, T. T.; Padirayon, L. M.
2018-03-01
Education is an essential process today which elicits authorities to discover and establish innovative strategies for educational improvement. This study applied data mining using clustering technique for knowledge extraction from the National Career Assessment Examination (NCAE) result in the Division of Quirino. The NCAE is an examination given to all grade 9 students in the Philippines to assess their aptitudes in the different domains. Clustering the students is helpful in identifying students’ learning considerations. With the use of the RapidMiner tool, clustering algorithms such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), k-means, k-medoid, expectation maximization clustering, and support vector clustering algorithms were analyzed. The silhouette indexes of the said clustering algorithms were compared, and the result showed that the k-means algorithm with k = 3 and silhouette index equal to 0.196 is the most appropriate clustering algorithm to group the students. Three groups were formed having 477 students in the determined group (cluster 0), 310 proficient students (cluster 1) and 396 developing students (cluster 2). The data mining technique used in this study is essential in extracting useful information from the NCAE result to better understand the abilities of students which in turn is a good basis for adopting teaching strategies.
Enhancing PC Cluster-Based Parallel Branch-and-Bound Algorithms for the Graph Coloring Problem
NASA Astrophysics Data System (ADS)
Taoka, Satoshi; Takafuji, Daisuke; Watanabe, Toshimasa
A branch-and-bound algorithm (BB for short) is the most general technique to deal with various combinatorial optimization problems. Even if it is used, computation time is likely to increase exponentially. So we consider its parallelization to reduce it. It has been reported that the computation time of a parallel BB heavily depends upon node-variable selection strategies. And, in case of a parallel BB, it is also necessary to prevent increase in communication time. So, it is important to pay attention to how many and what kind of nodes are to be transferred (called sending-node selection strategy). In this paper, for the graph coloring problem, we propose some sending-node selection strategies for a parallel BB algorithm by adopting MPI for parallelization and experimentally evaluate how these strategies affect computation time of a parallel BB on a PC cluster network.
Automatic Clustering Using Multi-objective Particle Swarm and Simulated Annealing
Abubaker, Ahmad; Baharum, Adam; Alrefaei, Mahmoud
2015-01-01
This paper puts forward a new automatic clustering algorithm based on Multi-Objective Particle Swarm Optimization and Simulated Annealing, “MOPSOSA”. The proposed algorithm is capable of automatic clustering which is appropriate for partitioning datasets to a suitable number of clusters. MOPSOSA combines the features of the multi-objective based particle swarm optimization (PSO) and the Multi-Objective Simulated Annealing (MOSA). Three cluster validity indices were optimized simultaneously to establish the suitable number of clusters and the appropriate clustering for a dataset. The first cluster validity index is centred on Euclidean distance, the second on the point symmetry distance, and the last cluster validity index is based on short distance. A number of algorithms have been compared with the MOPSOSA algorithm in resolving clustering problems by determining the actual number of clusters and optimal clustering. Computational experiments were carried out to study fourteen artificial and five real life datasets. PMID:26132309
NASA Astrophysics Data System (ADS)
Gong, Lina; Xu, Tao; Zhang, Wei; Li, Xuhong; Wang, Xia; Pan, Wenwen
2017-03-01
The traditional microblog recommendation algorithm has the problems of low efficiency and modest effect in the era of big data. In the aim of solving these issues, this paper proposed a mixed recommendation algorithm with user clustering. This paper first introduced the situation of microblog marketing industry. Then, this paper elaborates the user interest modeling process and detailed advertisement recommendation methods. Finally, this paper compared the mixed recommendation algorithm with the traditional classification algorithm and mixed recommendation algorithm without user clustering. The results show that the mixed recommendation algorithm with user clustering has good accuracy and recall rate in the microblog advertisements promotion.
Procedure of Partitioning Data Into Number of Data Sets or Data Group - A Review
NASA Astrophysics Data System (ADS)
Kim, Tai-Hoon
The goal of clustering is to decompose a dataset into similar groups based on a objective function. Some already well established clustering algorithms are there for data clustering. Objective of these data clustering algorithms are to divide the data points of the feature space into a number of groups (or classes) so that a predefined set of criteria are satisfied. The article considers the comparative study about the effectiveness and efficiency of traditional data clustering algorithms. For evaluating the performance of the clustering algorithms, Minkowski score is used here for different data sets.
Android Malware Classification Using K-Means Clustering Algorithm
NASA Astrophysics Data System (ADS)
Hamid, Isredza Rahmi A.; Syafiqah Khalid, Nur; Azma Abdullah, Nurul; Rahman, Nurul Hidayah Ab; Chai Wen, Chuah
2017-08-01
Malware was designed to gain access or damage a computer system without user notice. Besides, attacker exploits malware to commit crime or fraud. This paper proposed Android malware classification approach based on K-Means clustering algorithm. We evaluate the proposed model in terms of accuracy using machine learning algorithms. Two datasets were selected to demonstrate the practicing of K-Means clustering algorithms that are Virus Total and Malgenome dataset. We classify the Android malware into three clusters which are ransomware, scareware and goodware. Nine features were considered for each types of dataset such as Lock Detected, Text Detected, Text Score, Encryption Detected, Threat, Porn, Law, Copyright and Moneypak. We used IBM SPSS Statistic software for data classification and WEKA tools to evaluate the built cluster. The proposed K-Means clustering algorithm shows promising result with high accuracy when tested using Random Forest algorithm.
An extended affinity propagation clustering method based on different data density types.
Zhao, XiuLi; Xu, WeiXiang
2015-01-01
Affinity propagation (AP) algorithm, as a novel clustering method, does not require the users to specify the initial cluster centers in advance, which regards all data points as potential exemplars (cluster centers) equally and groups the clusters totally by the similar degree among the data points. But in many cases there exist some different intensive areas within the same data set, which means that the data set does not distribute homogeneously. In such situation the AP algorithm cannot group the data points into ideal clusters. In this paper, we proposed an extended AP clustering algorithm to deal with such a problem. There are two steps in our method: firstly the data set is partitioned into several data density types according to the nearest distances of each data point; and then the AP clustering method is, respectively, used to group the data points into clusters in each data density type. Two experiments are carried out to evaluate the performance of our algorithm: one utilizes an artificial data set and the other uses a real seismic data set. The experiment results show that groups are obtained more accurately by our algorithm than OPTICS and AP clustering algorithm itself.
Scalable Parallel Density-based Clustering and Applications
NASA Astrophysics Data System (ADS)
Patwary, Mostofa Ali
2014-04-01
Recently, density-based clustering algorithms (DBSCAN and OPTICS) have gotten significant attention of the scientific community due to their unique capability of discovering arbitrary shaped clusters and eliminating noise data. These algorithms have several applications, which require high performance computing, including finding halos and subhalos (clusters) from massive cosmology data in astrophysics, analyzing satellite images, X-ray crystallography, and anomaly detection. However, parallelization of these algorithms are extremely challenging as they exhibit inherent sequential data access order, unbalanced workload resulting in low parallel efficiency. To break the data access sequentiality and to achieve high parallelism, we develop new parallel algorithms, both for DBSCAN and OPTICS, designed using graph algorithmic techniques. For example, our parallel DBSCAN algorithm exploits the similarities between DBSCAN and computing connected components. Using datasets containing up to a billion floating point numbers, we show that our parallel density-based clustering algorithms significantly outperform the existing algorithms, achieving speedups up to 27.5 on 40 cores on shared memory architecture and speedups up to 5,765 using 8,192 cores on distributed memory architecture. In our experiments, we found that while achieving the scalability, our algorithms produce clustering results with comparable quality to the classical algorithms.
Deep linear autoencoder and patch clustering-based unified one-dimensional coding of image and video
NASA Astrophysics Data System (ADS)
Li, Honggui
2017-09-01
This paper proposes a unified one-dimensional (1-D) coding framework of image and video, which depends on deep learning neural network and image patch clustering. First, an improved K-means clustering algorithm for image patches is employed to obtain the compact inputs of deep artificial neural network. Second, for the purpose of best reconstructing original image patches, deep linear autoencoder (DLA), a linear version of the classical deep nonlinear autoencoder, is introduced to achieve the 1-D representation of image blocks. Under the circumstances of 1-D representation, DLA is capable of attaining zero reconstruction error, which is impossible for the classical nonlinear dimensionality reduction methods. Third, a unified 1-D coding infrastructure for image, intraframe, interframe, multiview video, three-dimensional (3-D) video, and multiview 3-D video is built by incorporating different categories of videos into the inputs of patch clustering algorithm. Finally, it is shown in the results of simulation experiments that the proposed methods can simultaneously gain higher compression ratio and peak signal-to-noise ratio than those of the state-of-the-art methods in the situation of low bitrate transmission.
Energy Aware Clustering Algorithms for Wireless Sensor Networks
NASA Astrophysics Data System (ADS)
Rakhshan, Noushin; Rafsanjani, Marjan Kuchaki; Liu, Chenglian
2011-09-01
The sensor nodes deployed in wireless sensor networks (WSNs) are extremely power constrained, so maximizing the lifetime of the entire networks is mainly considered in the design. In wireless sensor networks, hierarchical network structures have the advantage of providing scalable and energy efficient solutions. In this paper, we investigate different clustering algorithms for WSNs and also compare these clustering algorithms based on metrics such as clustering distribution, cluster's load balancing, Cluster Head's (CH) selection strategy, CH's role rotation, node mobility, clusters overlapping, intra-cluster communications, reliability, security and location awareness.
Removal of impulse noise clusters from color images with local order statistics
NASA Astrophysics Data System (ADS)
Ruchay, Alexey; Kober, Vitaly
2017-09-01
This paper proposes a novel algorithm for restoring images corrupted with clusters of impulse noise. The noise clusters often occur when the probability of impulse noise is very high. The proposed noise removal algorithm consists of detection of bulky impulse noise in three color channels with local order statistics followed by removal of the detected clusters by means of vector median filtering. With the help of computer simulation we show that the proposed algorithm is able to effectively remove clustered impulse noise. The performance of the proposed algorithm is compared in terms of image restoration metrics with that of common successful algorithms.
Study of parameters of the nearest neighbour shared algorithm on clustering documents
NASA Astrophysics Data System (ADS)
Mustika Rukmi, Alvida; Budi Utomo, Daryono; Imro’atus Sholikhah, Neni
2018-03-01
Document clustering is one way of automatically managing documents, extracting of document topics and fastly filtering information. Preprocess of clustering documents processed by textmining consists of: keyword extraction using Rapid Automatic Keyphrase Extraction (RAKE) and making the document as concept vector using Latent Semantic Analysis (LSA). Furthermore, the clustering process is done so that the documents with the similarity of the topic are in the same cluster, based on the preprocesing by textmining performed. Shared Nearest Neighbour (SNN) algorithm is a clustering method based on the number of "nearest neighbors" shared. The parameters in the SNN Algorithm consist of: k nearest neighbor documents, ɛ shared nearest neighbor documents and MinT minimum number of similar documents, which can form a cluster. Characteristics The SNN algorithm is based on shared ‘neighbor’ properties. Each cluster is formed by keywords that are shared by the documents. SNN algorithm allows a cluster can be built more than one keyword, if the value of the frequency of appearing keywords in document is also high. Determination of parameter values on SNN algorithm affects document clustering results. The higher parameter value k, will increase the number of neighbor documents from each document, cause similarity of neighboring documents are lower. The accuracy of each cluster is also low. The higher parameter value ε, caused each document catch only neighbor documents that have a high similarity to build a cluster. It also causes more unclassified documents (noise). The higher the MinT parameter value cause the number of clusters will decrease, since the number of similar documents can not form clusters if less than MinT. Parameter in the SNN Algorithm determine performance of clustering result and the amount of noise (unclustered documents ). The Silhouette coeffisient shows almost the same result in many experiments, above 0.9, which means that SNN algorithm works well with different parameter values.
Algorithms of maximum likelihood data clustering with applications
NASA Astrophysics Data System (ADS)
Giada, Lorenzo; Marsili, Matteo
2002-12-01
We address the problem of data clustering by introducing an unsupervised, parameter-free approach based on maximum likelihood principle. Starting from the observation that data sets belonging to the same cluster share a common information, we construct an expression for the likelihood of any possible cluster structure. The likelihood in turn depends only on the Pearson's coefficient of the data. We discuss clustering algorithms that provide a fast and reliable approximation to maximum likelihood configurations. Compared to standard clustering methods, our approach has the advantages that (i) it is parameter free, (ii) the number of clusters need not be fixed in advance and (iii) the interpretation of the results is transparent. In order to test our approach and compare it with standard clustering algorithms, we analyze two very different data sets: time series of financial market returns and gene expression data. We find that different maximization algorithms produce similar cluster structures whereas the outcome of standard algorithms has a much wider variability.
A new clustering algorithm applicable to multispectral and polarimetric SAR images
NASA Technical Reports Server (NTRS)
Wong, Yiu-Fai; Posner, Edward C.
1993-01-01
We describe an application of a scale-space clustering algorithm to the classification of a multispectral and polarimetric SAR image of an agricultural site. After the initial polarimetric and radiometric calibration and noise cancellation, we extracted a 12-dimensional feature vector for each pixel from the scattering matrix. The clustering algorithm was able to partition a set of unlabeled feature vectors from 13 selected sites, each site corresponding to a distinct crop, into 13 clusters without any supervision. The cluster parameters were then used to classify the whole image. The classification map is much less noisy and more accurate than those obtained by hierarchical rules. Starting with every point as a cluster, the algorithm works by melting the system to produce a tree of clusters in the scale space. It can cluster data in any multidimensional space and is insensitive to variability in cluster densities, sizes and ellipsoidal shapes. This algorithm, more powerful than existing ones, may be useful for remote sensing for land use.
Meng, Qinggang; Deng, Su; Huang, Hongbin; Wu, Yahui; Badii, Atta
2017-01-01
Heterogeneous information networks (e.g. bibliographic networks and social media networks) that consist of multiple interconnected objects are ubiquitous. Clustering analysis is an effective method to understand the semantic information and interpretable structure of the heterogeneous information networks, and it has attracted the attention of many researchers in recent years. However, most studies assume that heterogeneous information networks usually follow some simple schemas, such as bi-typed networks or star network schema, and they can only cluster one type of object in the network each time. In this paper, a novel clustering framework is proposed based on sparse tensor factorization for heterogeneous information networks, which can cluster multiple types of objects simultaneously in a single pass without any network schema information. The types of objects and the relations between them in the heterogeneous information networks are modeled as a sparse tensor. The clustering issue is modeled as an optimization problem, which is similar to the well-known Tucker decomposition. Then, an Alternating Least Squares (ALS) algorithm and a feasible initialization method are proposed to solve the optimization problem. Based on the tensor factorization, we simultaneously partition different types of objects into different clusters. The experimental results on both synthetic and real-world datasets have demonstrated that our proposed clustering framework, STFClus, can model heterogeneous information networks efficiently and can outperform state-of-the-art clustering algorithms as a generally applicable single-pass clustering method for heterogeneous network which is network schema agnostic. PMID:28245222
Wu, Jibing; Meng, Qinggang; Deng, Su; Huang, Hongbin; Wu, Yahui; Badii, Atta
2017-01-01
Heterogeneous information networks (e.g. bibliographic networks and social media networks) that consist of multiple interconnected objects are ubiquitous. Clustering analysis is an effective method to understand the semantic information and interpretable structure of the heterogeneous information networks, and it has attracted the attention of many researchers in recent years. However, most studies assume that heterogeneous information networks usually follow some simple schemas, such as bi-typed networks or star network schema, and they can only cluster one type of object in the network each time. In this paper, a novel clustering framework is proposed based on sparse tensor factorization for heterogeneous information networks, which can cluster multiple types of objects simultaneously in a single pass without any network schema information. The types of objects and the relations between them in the heterogeneous information networks are modeled as a sparse tensor. The clustering issue is modeled as an optimization problem, which is similar to the well-known Tucker decomposition. Then, an Alternating Least Squares (ALS) algorithm and a feasible initialization method are proposed to solve the optimization problem. Based on the tensor factorization, we simultaneously partition different types of objects into different clusters. The experimental results on both synthetic and real-world datasets have demonstrated that our proposed clustering framework, STFClus, can model heterogeneous information networks efficiently and can outperform state-of-the-art clustering algorithms as a generally applicable single-pass clustering method for heterogeneous network which is network schema agnostic.
Incremental fuzzy C medoids clustering of time series data using dynamic time warping distance
Chen, Jingli; Wu, Shuai; Liu, Zhizhong; Chao, Hao
2018-01-01
Clustering time series data is of great significance since it could extract meaningful statistics and other characteristics. Especially in biomedical engineering, outstanding clustering algorithms for time series may help improve the health level of people. Considering data scale and time shifts of time series, in this paper, we introduce two incremental fuzzy clustering algorithms based on a Dynamic Time Warping (DTW) distance. For recruiting Single-Pass and Online patterns, our algorithms could handle large-scale time series data by splitting it into a set of chunks which are processed sequentially. Besides, our algorithms select DTW to measure distance of pair-wise time series and encourage higher clustering accuracy because DTW could determine an optimal match between any two time series by stretching or compressing segments of temporal data. Our new algorithms are compared to some existing prominent incremental fuzzy clustering algorithms on 12 benchmark time series datasets. The experimental results show that the proposed approaches could yield high quality clusters and were better than all the competitors in terms of clustering accuracy. PMID:29795600
Incremental fuzzy C medoids clustering of time series data using dynamic time warping distance.
Liu, Yongli; Chen, Jingli; Wu, Shuai; Liu, Zhizhong; Chao, Hao
2018-01-01
Clustering time series data is of great significance since it could extract meaningful statistics and other characteristics. Especially in biomedical engineering, outstanding clustering algorithms for time series may help improve the health level of people. Considering data scale and time shifts of time series, in this paper, we introduce two incremental fuzzy clustering algorithms based on a Dynamic Time Warping (DTW) distance. For recruiting Single-Pass and Online patterns, our algorithms could handle large-scale time series data by splitting it into a set of chunks which are processed sequentially. Besides, our algorithms select DTW to measure distance of pair-wise time series and encourage higher clustering accuracy because DTW could determine an optimal match between any two time series by stretching or compressing segments of temporal data. Our new algorithms are compared to some existing prominent incremental fuzzy clustering algorithms on 12 benchmark time series datasets. The experimental results show that the proposed approaches could yield high quality clusters and were better than all the competitors in terms of clustering accuracy.
Multi-Optimisation Consensus Clustering
NASA Astrophysics Data System (ADS)
Li, Jian; Swift, Stephen; Liu, Xiaohui
Ensemble Clustering has been developed to provide an alternative way of obtaining more stable and accurate clustering results. It aims to avoid the biases of individual clustering algorithms. However, it is still a challenge to develop an efficient and robust method for Ensemble Clustering. Based on an existing ensemble clustering method, Consensus Clustering (CC), this paper introduces an advanced Consensus Clustering algorithm called Multi-Optimisation Consensus Clustering (MOCC), which utilises an optimised Agreement Separation criterion and a Multi-Optimisation framework to improve the performance of CC. Fifteen different data sets are used for evaluating the performance of MOCC. The results reveal that MOCC can generate more accurate clustering results than the original CC algorithm.
ERIC Educational Resources Information Center
Kitazoe, Noriko; Fujita, Naofumi; Izumoto, Yuji; Terada, Shin-ichi; Hatakenaka, Yuhei
2017-01-01
The purpose of this study was to investigate whether the individuals in the general population with high scores on the Autism Spectrum Quotient constituted a single homogeneous group or not. A cohort of university students (n = 4901) was investigated by cluster analysis based on the original five subscales of the Autism Spectrum Quotient. Based on…
Swarm Intelligence in Text Document Clustering
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cui, Xiaohui; Potok, Thomas E
2008-01-01
Social animals or insects in nature often exhibit a form of emergent collective behavior. The research field that attempts to design algorithms or distributed problem-solving devices inspired by the collective behavior of social insect colonies is called Swarm Intelligence. Compared to the traditional algorithms, the swarm algorithms are usually flexible, robust, decentralized and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document collection clustering. The major challenge of today's information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role inmore » helping users to effectively navigate, summarize, and organize the overwhelmed information. In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. These clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools and ant food forage.« less
Automatic microseismic event picking via unsupervised machine learning
NASA Astrophysics Data System (ADS)
Chen, Yangkang
2018-01-01
Effective and efficient arrival picking plays an important role in microseismic and earthquake data processing and imaging. Widely used short-term-average long-term-average ratio (STA/LTA) based arrival picking algorithms suffer from the sensitivity to moderate-to-strong random ambient noise. To make the state-of-the-art arrival picking approaches effective, microseismic data need to be first pre-processed, for example, removing sufficient amount of noise, and second analysed by arrival pickers. To conquer the noise issue in arrival picking for weak microseismic or earthquake event, I leverage the machine learning techniques to help recognizing seismic waveforms in microseismic or earthquake data. Because of the dependency of supervised machine learning algorithm on large volume of well-designed training data, I utilize an unsupervised machine learning algorithm to help cluster the time samples into two groups, that is, waveform points and non-waveform points. The fuzzy clustering algorithm has been demonstrated to be effective for such purpose. A group of synthetic, real microseismic and earthquake data sets with different levels of complexity show that the proposed method is much more robust than the state-of-the-art STA/LTA method in picking microseismic events, even in the case of moderately strong background noise.
2013-01-01
Background Next generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of next generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. While these tools filled an important gap, current approaches for storing, processing, and analyzing short read datasets generally have remained simple and lack the complexity needed to efficiently model the produced reads and assemble them correctly. Results Previously, we presented an overlap graph coarsening scheme for modeling read overlap relationships on multiple levels. Most current read assembly and analysis approaches use a single graph or set of clusters to represent the relationships among a read dataset. Instead, we use a series of graphs to represent the reads and their overlap relationships across a spectrum of information granularity. At each information level our algorithm is capable of generating clusters of reads from the reduced graph, forming an integrated graph modeling and clustering approach for read analysis and assembly. Previously we applied our algorithm to simulated and real 454 datasets to assess its ability to efficiently model and cluster next generation sequencing data. In this paper we extend our algorithm to large simulated and real Illumina datasets to demonstrate that our algorithm is practical for both sequencing technologies. Conclusions Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships, is scalable for large datasets, and is practical for both Illumina and 454 sequencing technologies. PMID:24564333
Novel density-based and hierarchical density-based clustering algorithms for uncertain data.
Zhang, Xianchao; Liu, Han; Zhang, Xiaotong
2017-09-01
Uncertain data has posed a great challenge to traditional clustering algorithms. Recently, several algorithms have been proposed for clustering uncertain data, and among them density-based techniques seem promising for handling data uncertainty. However, some issues like losing uncertain information, high time complexity and nonadaptive threshold have not been addressed well in the previous density-based algorithm FDBSCAN and hierarchical density-based algorithm FOPTICS. In this paper, we firstly propose a novel density-based algorithm PDBSCAN, which improves the previous FDBSCAN from the following aspects: (1) it employs a more accurate method to compute the probability that the distance between two uncertain objects is less than or equal to a boundary value, instead of the sampling-based method in FDBSCAN; (2) it introduces new definitions of probability neighborhood, support degree, core object probability, direct reachability probability, thus reducing the complexity and solving the issue of nonadaptive threshold (for core object judgement) in FDBSCAN. Then, we modify the algorithm PDBSCAN to an improved version (PDBSCANi), by using a better cluster assignment strategy to ensure that every object will be assigned to the most appropriate cluster, thus solving the issue of nonadaptive threshold (for direct density reachability judgement) in FDBSCAN. Furthermore, as PDBSCAN and PDBSCANi have difficulties for clustering uncertain data with non-uniform cluster density, we propose a novel hierarchical density-based algorithm POPTICS by extending the definitions of PDBSCAN, adding new definitions of fuzzy core distance and fuzzy reachability distance, and employing a new clustering framework. POPTICS can reveal the cluster structures of the datasets with different local densities in different regions better than PDBSCAN and PDBSCANi, and it addresses the issues in FOPTICS. Experimental results demonstrate the superiority of our proposed algorithms over the existing algorithms in accuracy and efficiency. Copyright © 2017 Elsevier Ltd. All rights reserved.
Gauge-free cluster variational method by maximal messages and moment matching.
Domínguez, Eduardo; Lage-Castellanos, Alejandro; Mulet, Roberto; Ricci-Tersenghi, Federico
2017-04-01
We present an implementation of the cluster variational method (CVM) as a message passing algorithm. The kind of message passing algorithm used for CVM, usually named generalized belief propagation (GBP), is a generalization of the belief propagation algorithm in the same way that CVM is a generalization of the Bethe approximation for estimating the partition function. However, the connection between fixed points of GBP and the extremal points of the CVM free energy is usually not a one-to-one correspondence because of the existence of a gauge transformation involving the GBP messages. Our contribution is twofold. First, we propose a way of defining messages (fields) in a generic CVM approximation, such that messages arrive on a given region from all its ancestors, and not only from its direct parents, as in the standard parent-to-child GBP. We call this approach maximal messages. Second, we focus on the case of binary variables, reinterpreting the messages as fields enforcing the consistency between the moments of the local (marginal) probability distributions. We provide a precise rule to enforce all consistencies, avoiding any redundancy, that would otherwise lead to a gauge transformation on the messages. This moment matching method is gauge free, i.e., it guarantees that the resulting GBP is not gauge invariant. We apply our maximal messages and moment matching GBP to obtain an analytical expression for the critical temperature of the Ising model in general dimensions at the level of plaquette CVM. The values obtained outperform Bethe estimates, and are comparable with loop corrected belief propagation equations. The method allows for a straightforward generalization to disordered systems.
Gauge-free cluster variational method by maximal messages and moment matching
NASA Astrophysics Data System (ADS)
Domínguez, Eduardo; Lage-Castellanos, Alejandro; Mulet, Roberto; Ricci-Tersenghi, Federico
2017-04-01
We present an implementation of the cluster variational method (CVM) as a message passing algorithm. The kind of message passing algorithm used for CVM, usually named generalized belief propagation (GBP), is a generalization of the belief propagation algorithm in the same way that CVM is a generalization of the Bethe approximation for estimating the partition function. However, the connection between fixed points of GBP and the extremal points of the CVM free energy is usually not a one-to-one correspondence because of the existence of a gauge transformation involving the GBP messages. Our contribution is twofold. First, we propose a way of defining messages (fields) in a generic CVM approximation, such that messages arrive on a given region from all its ancestors, and not only from its direct parents, as in the standard parent-to-child GBP. We call this approach maximal messages. Second, we focus on the case of binary variables, reinterpreting the messages as fields enforcing the consistency between the moments of the local (marginal) probability distributions. We provide a precise rule to enforce all consistencies, avoiding any redundancy, that would otherwise lead to a gauge transformation on the messages. This moment matching method is gauge free, i.e., it guarantees that the resulting GBP is not gauge invariant. We apply our maximal messages and moment matching GBP to obtain an analytical expression for the critical temperature of the Ising model in general dimensions at the level of plaquette CVM. The values obtained outperform Bethe estimates, and are comparable with loop corrected belief propagation equations. The method allows for a straightforward generalization to disordered systems.
A special purpose silicon compiler for designing supercomputing VLSI systems
NASA Technical Reports Server (NTRS)
Venkateswaran, N.; Murugavel, P.; Kamakoti, V.; Shankarraman, M. J.; Rangarajan, S.; Mallikarjun, M.; Karthikeyan, B.; Prabhakar, T. S.; Satish, V.; Venkatasubramaniam, P. R.
1991-01-01
Design of general/special purpose supercomputing VLSI systems for numeric algorithm execution involves tackling two important aspects, namely their computational and communication complexities. Development of software tools for designing such systems itself becomes complex. Hence a novel design methodology has to be developed. For designing such complex systems a special purpose silicon compiler is needed in which: the computational and communicational structures of different numeric algorithms should be taken into account to simplify the silicon compiler design, the approach is macrocell based, and the software tools at different levels (algorithm down to the VLSI circuit layout) should get integrated. In this paper a special purpose silicon (SPS) compiler based on PACUBE macrocell VLSI arrays for designing supercomputing VLSI systems is presented. It is shown that turn-around time and silicon real estate get reduced over the silicon compilers based on PLA's, SLA's, and gate arrays. The first two silicon compiler characteristics mentioned above enable the SPS compiler to perform systolic mapping (at the macrocell level) of algorithms whose computational structures are of GIPOP (generalized inner product outer product) form. Direct systolic mapping on PLA's, SLA's, and gate arrays is very difficult as they are micro-cell based. A novel GIPOP processor is under development using this special purpose silicon compiler.
Heterogeneous Tensor Decomposition for Clustering via Manifold Optimization.
Sun, Yanfeng; Gao, Junbin; Hong, Xia; Mishra, Bamdev; Yin, Baocai
2016-03-01
Tensor clustering is an important tool that exploits intrinsically rich structures in real-world multiarray or Tensor datasets. Often in dealing with those datasets, standard practice is to use subspace clustering that is based on vectorizing multiarray data. However, vectorization of tensorial data does not exploit complete structure information. In this paper, we propose a subspace clustering algorithm without adopting any vectorization process. Our approach is based on a novel heterogeneous Tucker decomposition model taking into account cluster membership information. We propose a new clustering algorithm that alternates between different modes of the proposed heterogeneous tensor model. All but the last mode have closed-form updates. Updating the last mode reduces to optimizing over the multinomial manifold for which we investigate second order Riemannian geometry and propose a trust-region algorithm. Numerical experiments show that our proposed algorithm compete effectively with state-of-the-art clustering algorithms that are based on tensor factorization.
Hyper-spectral image segmentation using spectral clustering with covariance descriptors
NASA Astrophysics Data System (ADS)
Kursun, Olcay; Karabiber, Fethullah; Koc, Cemalettin; Bal, Abdullah
2009-02-01
Image segmentation is an important and difficult computer vision problem. Hyper-spectral images pose even more difficulty due to their high-dimensionality. Spectral clustering (SC) is a recently popular clustering/segmentation algorithm. In general, SC lifts the data to a high dimensional space, also known as the kernel trick, then derive eigenvectors in this new space, and finally using these new dimensions partition the data into clusters. We demonstrate that SC works efficiently when combined with covariance descriptors that can be used to assess pixelwise similarities rather than in the high-dimensional Euclidean space. We present the formulations and some preliminary results of the proposed hybrid image segmentation method for hyper-spectral images.
NASA Astrophysics Data System (ADS)
Sun, Jiajia; Li, Yaoguo
2017-02-01
Joint inversion that simultaneously inverts multiple geophysical data sets to recover a common Earth model is increasingly being applied to exploration problems. Petrophysical data can serve as an effective constraint to link different physical property models in such inversions. There are two challenges, among others, associated with the petrophysical approach to joint inversion. One is related to the multimodality of petrophysical data because there often exist more than one relationship between different physical properties in a region of study. The other challenge arises from the fact that petrophysical relationships have different characteristics and can exhibit point, linear, quadratic, or exponential forms in a crossplot. The fuzzy c-means (FCM) clustering technique is effective in tackling the first challenge and has been applied successfully. We focus on the second challenge in this paper and develop a joint inversion method based on variations of the FCM clustering technique. To account for the specific shapes of petrophysical relationships, we introduce several different fuzzy clustering algorithms that are capable of handling different shapes of petrophysical relationships. We present two synthetic and one field data examples and demonstrate that, by choosing appropriate distance measures for the clustering component in the joint inversion algorithm, the proposed joint inversion method provides an effective means of handling common petrophysical situations we encounter in practice. The jointly inverted models have both enhanced structural similarity and increased petrophysical correlation, and better represent the subsurface in the spatial domain and the parameter domain of physical properties.
NASA Astrophysics Data System (ADS)
Abdul-Nasir, Aimi Salihah; Mashor, Mohd Yusoff; Halim, Nurul Hazwani Abd; Mohamed, Zeehaida
2015-05-01
Malaria is a life-threatening parasitic infectious disease that corresponds for nearly one million deaths each year. Due to the requirement of prompt and accurate diagnosis of malaria, the current study has proposed an unsupervised pixel segmentation based on clustering algorithm in order to obtain the fully segmented red blood cells (RBCs) infected with malaria parasites based on the thin blood smear images of P. vivax species. In order to obtain the segmented infected cell, the malaria images are first enhanced by using modified global contrast stretching technique. Then, an unsupervised segmentation technique based on clustering algorithm has been applied on the intensity component of malaria image in order to segment the infected cell from its blood cells background. In this study, cascaded moving k-means (MKM) and fuzzy c-means (FCM) clustering algorithms has been proposed for malaria slide image segmentation. After that, median filter algorithm has been applied to smooth the image as well as to remove any unwanted regions such as small background pixels from the image. Finally, seeded region growing area extraction algorithm has been applied in order to remove large unwanted regions that are still appeared on the image due to their size in which cannot be cleaned by using median filter. The effectiveness of the proposed cascaded MKM and FCM clustering algorithms has been analyzed qualitatively and quantitatively by comparing the proposed cascaded clustering algorithm with MKM and FCM clustering algorithms. Overall, the results indicate that segmentation using the proposed cascaded clustering algorithm has produced the best segmentation performances by achieving acceptable sensitivity as well as high specificity and accuracy values compared to the segmentation results provided by MKM and FCM algorithms.
Fong, Simon; Deb, Suash; Yang, Xin-She; Zhuang, Yan
2014-01-01
Traditional K-means clustering algorithms have the drawback of getting stuck at local optima that depend on the random values of initial centroids. Optimization algorithms have their advantages in guiding iterative computation to search for global optima while avoiding local optima. The algorithms help speed up the clustering process by converging into a global optimum early with multiple search agents in action. Inspired by nature, some contemporary optimization algorithms which include Ant, Bat, Cuckoo, Firefly, and Wolf search algorithms mimic the swarming behavior allowing them to cooperatively steer towards an optimal objective within a reasonable time. It is known that these so-called nature-inspired optimization algorithms have their own characteristics as well as pros and cons in different applications. When these algorithms are combined with K-means clustering mechanism for the sake of enhancing its clustering quality by avoiding local optima and finding global optima, the new hybrids are anticipated to produce unprecedented performance. In this paper, we report the results of our evaluation experiments on the integration of nature-inspired optimization methods into K-means algorithms. In addition to the standard evaluation metrics in evaluating clustering quality, the extended K-means algorithms that are empowered by nature-inspired optimization methods are applied on image segmentation as a case study of application scenario.
Deb, Suash; Yang, Xin-She
2014-01-01
Traditional K-means clustering algorithms have the drawback of getting stuck at local optima that depend on the random values of initial centroids. Optimization algorithms have their advantages in guiding iterative computation to search for global optima while avoiding local optima. The algorithms help speed up the clustering process by converging into a global optimum early with multiple search agents in action. Inspired by nature, some contemporary optimization algorithms which include Ant, Bat, Cuckoo, Firefly, and Wolf search algorithms mimic the swarming behavior allowing them to cooperatively steer towards an optimal objective within a reasonable time. It is known that these so-called nature-inspired optimization algorithms have their own characteristics as well as pros and cons in different applications. When these algorithms are combined with K-means clustering mechanism for the sake of enhancing its clustering quality by avoiding local optima and finding global optima, the new hybrids are anticipated to produce unprecedented performance. In this paper, we report the results of our evaluation experiments on the integration of nature-inspired optimization methods into K-means algorithms. In addition to the standard evaluation metrics in evaluating clustering quality, the extended K-means algorithms that are empowered by nature-inspired optimization methods are applied on image segmentation as a case study of application scenario. PMID:25202730
Lei, Yang; Yu, Dai; Bin, Zhang; Yang, Yang
2017-01-01
Clustering algorithm as a basis of data analysis is widely used in analysis systems. However, as for the high dimensions of the data, the clustering algorithm may overlook the business relation between these dimensions especially in the medical fields. As a result, usually the clustering result may not meet the business goals of the users. Then, in the clustering process, if it can combine the knowledge of the users, that is, the doctor's knowledge or the analysis intent, the clustering result can be more satisfied. In this paper, we propose an interactive K -means clustering method to improve the user's satisfactions towards the result. The core of this method is to get the user's feedback of the clustering result, to optimize the clustering result. Then, a particle swarm optimization algorithm is used in the method to optimize the parameters, especially the weight settings in the clustering algorithm to make it reflect the user's business preference as possible. After that, based on the parameter optimization and adjustment, the clustering result can be closer to the user's requirement. Finally, we take an example in the breast cancer, to testify our method. The experiments show the better performance of our algorithm.
Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming.
Wang, Haizhou; Song, Mingzhou
2011-12-01
The heuristic k -means algorithm, widely used for cluster analysis, does not guarantee optimality. We developed a dynamic programming algorithm for optimal one-dimensional clustering. The algorithm is implemented as an R package called Ckmeans.1d.dp . We demonstrate its advantage in optimality and runtime over the standard iterative k -means algorithm.
Inference from clustering with application to gene-expression microarrays.
Dougherty, Edward R; Barrera, Junior; Brun, Marcel; Kim, Seungchan; Cesar, Roberto M; Chen, Yidong; Bittner, Michael; Trent, Jeffrey M
2002-01-01
There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.
Using Grey Wolf Algorithm to Solve the Capacitated Vehicle Routing Problem
NASA Astrophysics Data System (ADS)
Korayem, L.; Khorsid, M.; Kassem, S. S.
2015-05-01
The capacitated vehicle routing problem (CVRP) is a class of the vehicle routing problems (VRPs). In CVRP a set of identical vehicles having fixed capacities are required to fulfill customers' demands for a single commodity. The main objective is to minimize the total cost or distance traveled by the vehicles while satisfying a number of constraints, such as: the capacity constraint of each vehicle, logical flow constraints, etc. One of the methods employed in solving the CVRP is the cluster-first route-second method. It is a technique based on grouping of customers into a number of clusters, where each cluster is served by one vehicle. Once clusters are formed, a route determining the best sequence to visit customers is established within each cluster. The recently bio-inspired grey wolf optimizer (GWO), introduced in 2014, has proven to be efficient in solving unconstrained, as well as, constrained optimization problems. In the current research, our main contributions are: combining GWO with the traditional K-means clustering algorithm to generate the ‘K-GWO’ algorithm, deriving a capacitated version of the K-GWO algorithm by incorporating a capacity constraint into the aforementioned algorithm, and finally, developing 2 new clustering heuristics. The resulting algorithm is used in the clustering phase of the cluster-first route-second method to solve the CVR problem. The algorithm is tested on a number of benchmark problems with encouraging results.
Chemodynamical Clustering Applied to APOGEE Data: Rediscovering Globular Clusters
NASA Astrophysics Data System (ADS)
Chen, Boquan; D’Onghia, Elena; Pardy, Stephen A.; Pasquali, Anna; Bertelli Motta, Clio; Hanlon, Bret; Grebel, Eva K.
2018-06-01
We have developed a novel technique based on a clustering algorithm that searches for kinematically and chemically clustered stars in the APOGEE DR12 Cannon data. As compared to classical chemical tagging, the kinematic information included in our methodology allows us to identify stars that are members of known globular clusters with greater confidence. We apply our algorithm to the entire APOGEE catalog of 150,615 stars whose chemical abundances are derived by the Cannon. Our methodology found anticorrelations between the elements Al and Mg, Na and O, and C and N previously identified in the optical spectra in globular clusters, even though we omit these elements in our algorithm. Our algorithm identifies globular clusters without a priori knowledge of their locations in the sky. Thus, not only does this technique promise to discover new globular clusters, but it also allows us to identify candidate streams of kinematically and chemically clustered stars in the Milky Way.
Clustering performance comparison using K-means and expectation maximization algorithms.
Jung, Yong Gyu; Kang, Min Soo; Heo, Jun
2014-11-14
Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K -means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K -means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.
Rideout, Jai Ram; He, Yan; Navas-Molina, Jose A; Walters, William A; Ursell, Luke K; Gibbons, Sean M; Chase, John; McDonald, Daniel; Gonzalez, Antonio; Robbins-Pianka, Adam; Clemente, Jose C; Gilbert, Jack A; Huse, Susan M; Zhou, Hong-Wei; Knight, Rob; Caporaso, J Gregory
2014-01-01
We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to "classic" open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, "classic" open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of "classic" open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by "classic" open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME's uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME's OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.
Basic firefly algorithm for document clustering
NASA Astrophysics Data System (ADS)
Mohammed, Athraa Jasim; Yusof, Yuhanis; Husni, Husniza
2015-12-01
The Document clustering plays significant role in Information Retrieval (IR) where it organizes documents prior to the retrieval process. To date, various clustering algorithms have been proposed and this includes the K-means and Particle Swarm Optimization. Even though these algorithms have been widely applied in many disciplines due to its simplicity, such an approach tends to be trapped in a local minimum during its search for an optimal solution. To address the shortcoming, this paper proposes a Basic Firefly (Basic FA) algorithm to cluster text documents. The algorithm employs the Average Distance to Document Centroid (ADDC) as the objective function of the search. Experiments utilizing the proposed algorithm were conducted on the 20Newsgroups benchmark dataset. Results demonstrate that the Basic FA generates a more robust and compact clusters than the ones produced by K-means and Particle Swarm Optimization (PSO).
Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra.
Rieder, Vera; Schork, Karin U; Kerschke, Laura; Blank-Landeshammer, Bernhard; Sickmann, Albert; Rahnenführer, Jörg
2017-11-03
In proteomics, liquid chromatography-tandem mass spectrometry (LC-MS/MS) is established for identifying peptides and proteins. Duplicated spectra, that is, multiple spectra of the same peptide, occur both in single MS/MS runs and in large spectral libraries. Clustering tandem mass spectra is used to find consensus spectra, with manifold applications. First, it speeds up database searches, as performed for instance by Mascot. Second, it helps to identify novel peptides across species. Third, it is used for quality control to detect wrongly annotated spectra. We compare different clustering algorithms based on the cosine distance between spectra. CAST, MS-Cluster, and PRIDE Cluster are popular algorithms to cluster tandem mass spectra. We add well-known algorithms for large data sets, hierarchical clustering, DBSCAN, and connected components of a graph, as well as the new method N-Cluster. All algorithms are evaluated on real data with varied parameter settings. Cluster results are compared with each other and with peptide annotations based on validation measures such as purity. Quality control, regarding the detection of wrongly (un)annotated spectra, is discussed for exemplary resulting clusters. N-Cluster proves to be highly competitive. All clustering results benefit from the so-called DISMS2 filter that integrates additional information, for example, on precursor mass.
Bacciu, Davide; Starita, Antonina
2008-11-01
Determining a compact neural coding for a set of input stimuli is an issue that encompasses several biological memory mechanisms as well as various artificial neural network models. In particular, establishing the optimal network structure is still an open problem when dealing with unsupervised learning models. In this paper, we introduce a novel learning algorithm, named competitive repetition-suppression (CoRe) learning, inspired by a cortical memory mechanism called repetition suppression (RS). We show how such a mechanism is used, at various levels of the cerebral cortex, to generate compact neural representations of the visual stimuli. From the general CoRe learning model, we derive a clustering algorithm, named CoRe clustering, that can automatically estimate the unknown cluster number from the data without using a priori information concerning the input distribution. We illustrate how CoRe clustering, besides its biological plausibility, posses strong theoretical properties in terms of robustness to noise and outliers, and we provide an error function describing CoRe learning dynamics. Such a description is used to analyze CoRe relationships with the state-of-the art clustering models and to highlight CoRe similitude with rival penalized competitive learning (RPCL), showing how CoRe extends such a model by strengthening the rival penalization estimation by means of loss functions from robust statistics.
Korkmaz, Selcuk; Zararsiz, Gokmen; Goksuluk, Dincer
2015-01-01
Virtual screening is an important step in early-phase of drug discovery process. Since there are thousands of compounds, this step should be both fast and effective in order to distinguish drug-like and nondrug-like molecules. Statistical machine learning methods are widely used in drug discovery studies for classification purpose. Here, we aim to develop a new tool, which can classify molecules as drug-like and nondrug-like based on various machine learning methods, including discriminant, tree-based, kernel-based, ensemble and other algorithms. To construct this tool, first, performances of twenty-three different machine learning algorithms are compared by ten different measures, then, ten best performing algorithms have been selected based on principal component and hierarchical cluster analysis results. Besides classification, this application has also ability to create heat map and dendrogram for visual inspection of the molecules through hierarchical cluster analysis. Moreover, users can connect the PubChem database to download molecular information and to create two-dimensional structures of compounds. This application is freely available through www.biosoft.hacettepe.edu.tr/MLViS/. PMID:25928885
Model-based clustering for RNA-seq data.
Si, Yaqing; Liu, Peng; Li, Pinghua; Brutnell, Thomas P
2014-01-15
RNA-seq technology has been widely adopted as an attractive alternative to microarray-based methods to study global gene expression. However, robust statistical tools to analyze these complex datasets are still lacking. By grouping genes with similar expression profiles across treatments, cluster analysis provides insight into gene functions and networks, and hence is an important technique for RNA-seq data analysis. In this manuscript, we derive clustering algorithms based on appropriate probability models for RNA-seq data. An expectation-maximization algorithm and another two stochastic versions of expectation-maximization algorithms are described. In addition, a strategy for initialization based on likelihood is proposed to improve the clustering algorithms. Moreover, we present a model-based hybrid-hierarchical clustering method to generate a tree structure that allows visualization of relationships among clusters as well as flexibility of choosing the number of clusters. Results from both simulation studies and analysis of a maize RNA-seq dataset show that our proposed methods provide better clustering results than alternative methods such as the K-means algorithm and hierarchical clustering methods that are not based on probability models. An R package, MBCluster.Seq, has been developed to implement our proposed algorithms. This R package provides fast computation and is publicly available at http://www.r-project.org
Landscape Analysis and Algorithm Development for Plateau Plagued Search Spaces
2011-02-28
Final Report for AFOSR #FA9550-08-1-0422 Landscape Analysis and Algorithm Development for Plateau Plagued Search Spaces August 1, 2008 to November 30...focused on developing high level general purpose algorithms , such as Tabu Search and Genetic Algorithms . However, understanding of when and why these... algorithms perform well still lags. Our project extended the theory of certain combi- natorial optimization problems to develop analytical
An improved initialization center k-means clustering algorithm based on distance and density
NASA Astrophysics Data System (ADS)
Duan, Yanling; Liu, Qun; Xia, Shuyin
2018-04-01
Aiming at the problem of the random initial clustering center of k means algorithm that the clustering results are influenced by outlier data sample and are unstable in multiple clustering, a method of central point initialization method based on larger distance and higher density is proposed. The reciprocal of the weighted average of distance is used to represent the sample density, and the data sample with the larger distance and the higher density are selected as the initial clustering centers to optimize the clustering results. Then, a clustering evaluation method based on distance and density is designed to verify the feasibility of the algorithm and the practicality, the experimental results on UCI data sets show that the algorithm has a certain stability and practicality.
Diametrical clustering for identifying anti-correlated gene clusters.
Dhillon, Inderjit S; Marcotte, Edward M; Roshan, Usman
2003-09-01
Clustering genes based upon their expression patterns allows us to predict gene function. Most existing clustering algorithms cluster genes together when their expression patterns show high positive correlation. However, it has been observed that genes whose expression patterns are strongly anti-correlated can also be functionally similar. Biologically, this is not unintuitive-genes responding to the same stimuli, regardless of the nature of the response, are more likely to operate in the same pathways. We present a new diametrical clustering algorithm that explicitly identifies anti-correlated clusters of genes. Our algorithm proceeds by iteratively (i). re-partitioning the genes and (ii). computing the dominant singular vector of each gene cluster; each singular vector serving as the prototype of a 'diametric' cluster. We empirically show the effectiveness of the algorithm in identifying diametrical or anti-correlated clusters. Testing the algorithm on yeast cell cycle data, fibroblast gene expression data, and DNA microarray data from yeast mutants reveals that opposed cellular pathways can be discovered with this method. We present systems whose mRNA expression patterns, and likely their functions, oppose the yeast ribosome and proteosome, along with evidence for the inverse transcriptional regulation of a number of cellular systems.
General purpose graphic processing unit implementation of adaptive pulse compression algorithms
NASA Astrophysics Data System (ADS)
Cai, Jingxiao; Zhang, Yan
2017-07-01
This study introduces a practical approach to implement real-time signal processing algorithms for general surveillance radar based on NVIDIA graphical processing units (GPUs). The pulse compression algorithms are implemented using compute unified device architecture (CUDA) libraries such as CUDA basic linear algebra subroutines and CUDA fast Fourier transform library, which are adopted from open source libraries and optimized for the NVIDIA GPUs. For more advanced, adaptive processing algorithms such as adaptive pulse compression, customized kernel optimization is needed and investigated. A statistical optimization approach is developed for this purpose without needing much knowledge of the physical configurations of the kernels. It was found that the kernel optimization approach can significantly improve the performance. Benchmark performance is compared with the CPU performance in terms of processing accelerations. The proposed implementation framework can be used in various radar systems including ground-based phased array radar, airborne sense and avoid radar, and aerospace surveillance radar.
An open source software for fast grid-based data-mining in spatial epidemiology (FGBASE).
Baker, David M; Valleron, Alain-Jacques
2014-10-30
Examining whether disease cases are clustered in space is an important part of epidemiological research. Another important part of spatial epidemiology is testing whether patients suffering from a disease are more, or less, exposed to environmental factors of interest than adequately defined controls. Both approaches involve determining the number of cases and controls (or population at risk) in specific zones. For cluster searches, this often must be done for millions of different zones. Doing this by calculating distances can lead to very lengthy computations. In this work we discuss the computational advantages of geographical grid-based methods, and introduce an open source software (FGBASE) which we have created for this purpose. Geographical grids based on the Lambert Azimuthal Equal Area projection are well suited for spatial epidemiology because they preserve area: each cell of the grid has the same area. We describe how data is projected onto such a grid, as well as grid-based algorithms for spatial epidemiological data-mining. The software program (FGBASE), that we have developed, implements these grid-based methods. The grid based algorithms perform extremely fast. This is particularly the case for cluster searches. When applied to a cohort of French Type 1 Diabetes (T1D) patients, as an example, the grid based algorithms detected potential clusters in a few seconds on a modern laptop. This compares very favorably to an equivalent cluster search using distance calculations instead of a grid, which took over 4 hours on the same computer. In the case study we discovered 4 potential clusters of T1D cases near the cities of Le Havre, Dunkerque, Toulouse and Nantes. One example of environmental analysis with our software was to study whether a significant association could be found between distance to vineyards with heavy pesticide. None was found. In both examples, the software facilitates the rapid testing of hypotheses. Grid-based algorithms for mining spatial epidemiological data provide advantages in terms of computational complexity thus improving the speed of computations. We believe that these methods and this software tool (FGBASE) will lower the computational barriers to entry for those performing epidemiological research.
A novel harmony search-K means hybrid algorithm for clustering gene expression data
Nazeer, KA Abdul; Sebastian, MP; Kumar, SD Madhu
2013-01-01
Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms. PMID:23390351
A novel harmony search-K means hybrid algorithm for clustering gene expression data.
Nazeer, Ka Abdul; Sebastian, Mp; Kumar, Sd Madhu
2013-01-01
Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms.
m-BIRCH: an online clustering approach for computer vision applications
NASA Astrophysics Data System (ADS)
Madan, Siddharth K.; Dana, Kristin J.
2015-03-01
We adapt a classic online clustering algorithm called Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH), to incrementally cluster large datasets of features commonly used in multimedia and computer vision. We call the adapted version modified-BIRCH (m-BIRCH). The algorithm uses only a fraction of the dataset memory to perform clustering, and updates the clustering decisions when new data comes in. Modifications made in m-BIRCH enable data driven parameter selection and effectively handle varying density regions in the feature space. Data driven parameter selection automatically controls the level of coarseness of the data summarization. Effective handling of varying density regions is necessary to well represent the different density regions in data summarization. We use m-BIRCH to cluster 840K color SIFT descriptors, and 60K outlier corrupted grayscale patches. We use the algorithm to cluster datasets consisting of challenging non-convex clustering patterns. Our implementation of the algorithm provides an useful clustering tool and is made publicly available.
Impact of missing data imputation methods on gene expression clustering and classification.
de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G
2015-02-26
Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .
A Self-Adaptive Fuzzy c-Means Algorithm for Determining the Optimal Number of Clusters
Wang, Zhihao; Yi, Jing
2016-01-01
For the shortcoming of fuzzy c-means algorithm (FCM) needing to know the number of clusters in advance, this paper proposed a new self-adaptive method to determine the optimal number of clusters. Firstly, a density-based algorithm was put forward. The algorithm, according to the characteristics of the dataset, automatically determined the possible maximum number of clusters instead of using the empirical rule n and obtained the optimal initial cluster centroids, improving the limitation of FCM that randomly selected cluster centroids lead the convergence result to the local minimum. Secondly, this paper, by introducing a penalty function, proposed a new fuzzy clustering validity index based on fuzzy compactness and separation, which ensured that when the number of clusters verged on that of objects in the dataset, the value of clustering validity index did not monotonically decrease and was close to zero, so that the optimal number of clusters lost robustness and decision function. Then, based on these studies, a self-adaptive FCM algorithm was put forward to estimate the optimal number of clusters by the iterative trial-and-error process. At last, experiments were done on the UCI, KDD Cup 1999, and synthetic datasets, which showed that the method not only effectively determined the optimal number of clusters, but also reduced the iteration of FCM with the stable clustering result. PMID:28042291
Poole, William; Leinonen, Kalle; Shmulevich, Ilya
2017-01-01
Cancer researchers have long recognized that somatic mutations are not uniformly distributed within genes. However, most approaches for identifying cancer mutations focus on either the entire-gene or single amino-acid level. We have bridged these two methodologies with a multiscale mutation clustering algorithm that identifies variable length mutation clusters in cancer genes. We ran our algorithm on 539 genes using the combined mutation data in 23 cancer types from The Cancer Genome Atlas (TCGA) and identified 1295 mutation clusters. The resulting mutation clusters cover a wide range of scales and often overlap with many kinds of protein features including structured domains, phosphorylation sites, and known single nucleotide variants. We statistically associated these multiscale clusters with gene expression and drug response data to illuminate the functional and clinical consequences of mutations in our clusters. Interestingly, we find multiple clusters within individual genes that have differential functional associations: these include PTEN, FUBP1, and CDH1. This methodology has potential implications in identifying protein regions for drug targets, understanding the biological underpinnings of cancer, and personalizing cancer treatments. Toward this end, we have made the mutation clusters and the clustering algorithm available to the public. Clusters and pathway associations can be interactively browsed at m2c.systemsbiology.net. The multiscale mutation clustering algorithm is available at https://github.com/IlyaLab/M2C. PMID:28170390
Poole, William; Leinonen, Kalle; Shmulevich, Ilya; Knijnenburg, Theo A; Bernard, Brady
2017-02-01
Cancer researchers have long recognized that somatic mutations are not uniformly distributed within genes. However, most approaches for identifying cancer mutations focus on either the entire-gene or single amino-acid level. We have bridged these two methodologies with a multiscale mutation clustering algorithm that identifies variable length mutation clusters in cancer genes. We ran our algorithm on 539 genes using the combined mutation data in 23 cancer types from The Cancer Genome Atlas (TCGA) and identified 1295 mutation clusters. The resulting mutation clusters cover a wide range of scales and often overlap with many kinds of protein features including structured domains, phosphorylation sites, and known single nucleotide variants. We statistically associated these multiscale clusters with gene expression and drug response data to illuminate the functional and clinical consequences of mutations in our clusters. Interestingly, we find multiple clusters within individual genes that have differential functional associations: these include PTEN, FUBP1, and CDH1. This methodology has potential implications in identifying protein regions for drug targets, understanding the biological underpinnings of cancer, and personalizing cancer treatments. Toward this end, we have made the mutation clusters and the clustering algorithm available to the public. Clusters and pathway associations can be interactively browsed at m2c.systemsbiology.net. The multiscale mutation clustering algorithm is available at https://github.com/IlyaLab/M2C.
The Mucciardi-Gose Clustering Algorithm and Its Applications in Automatic Pattern Recognition.
A procedure known as the Mucciardi- Gose clustering algorithm, CLUSTR, for determining the geometrical or statistical relationships among groups of N...discussion of clustering algorithms is given; the particular advantages of the Mucciardi- Gose procedure are described. The mathematical basis for, and the
Security clustering algorithm based on reputation in hierarchical peer-to-peer network
NASA Astrophysics Data System (ADS)
Chen, Mei; Luo, Xin; Wu, Guowen; Tan, Yang; Kita, Kenji
2013-03-01
For the security problems of the hierarchical P2P network (HPN), the paper presents a security clustering algorithm based on reputation (CABR). In the algorithm, we take the reputation mechanism for ensuring the security of transaction and use cluster for managing the reputation mechanism. In order to improve security, reduce cost of network brought by management of reputation and enhance stability of cluster, we select reputation, the historical average online time, and the network bandwidth as the basic factors of the comprehensive performance of node. Simulation results showed that the proposed algorithm improved the security, reduced the network overhead, and enhanced stability of cluster.
Shah, Sohil Atul
2017-01-01
Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank. PMID:28851838
Determining open cluster membership. A Bayesian framework for quantitative member classification
NASA Astrophysics Data System (ADS)
Stott, Jonathan J.
2018-01-01
Aims: My goal is to develop a quantitative algorithm for assessing open cluster membership probabilities. The algorithm is designed to work with single-epoch observations. In its simplest form, only one set of program images and one set of reference images are required. Methods: The algorithm is based on a two-stage joint astrometric and photometric assessment of cluster membership probabilities. The probabilities were computed within a Bayesian framework using any available prior information. Where possible, the algorithm emphasizes simplicity over mathematical sophistication. Results: The algorithm was implemented and tested against three observational fields using published survey data. M 67 and NGC 654 were selected as cluster examples while a third, cluster-free, field was used for the final test data set. The algorithm shows good quantitative agreement with the existing surveys and has a false-positive rate significantly lower than the astrometric or photometric methods used individually.
Random Walk Quantum Clustering Algorithm Based on Space
NASA Astrophysics Data System (ADS)
Xiao, Shufen; Dong, Yumin; Ma, Hongyang
2018-01-01
In the random quantum walk, which is a quantum simulation of the classical walk, data points interacted when selecting the appropriate walk strategy by taking advantage of quantum-entanglement features; thus, the results obtained when the quantum walk is used are different from those when the classical walk is adopted. A new quantum walk clustering algorithm based on space is proposed by applying the quantum walk to clustering analysis. In this algorithm, data points are viewed as walking participants, and similar data points are clustered using the walk function in the pay-off matrix according to a certain rule. The walk process is simplified by implementing a space-combining rule. The proposed algorithm is validated by a simulation test and is proved superior to existing clustering algorithms, namely, Kmeans, PCA + Kmeans, and LDA-Km. The effects of some of the parameters in the proposed algorithm on its performance are also analyzed and discussed. Specific suggestions are provided.
A highly efficient multi-core algorithm for clustering extremely large datasets
2010-01-01
Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hui, C; Suh, Y; Robertson, D
Purpose: To develop a novel algorithm to generate internal respiratory signals for sorting of four-dimensional (4D) computed tomography (CT) images. Methods: The proposed algorithm extracted multiple time resolved features as potential respiratory signals. These features were taken from the 4D CT images and its Fourier transformed space. Several low-frequency locations in the Fourier space and selected anatomical features from the images were used as potential respiratory signals. A clustering algorithm was then used to search for the group of appropriate potential respiratory signals. The chosen signals were then normalized and averaged to form the final internal respiratory signal. Performance ofmore » the algorithm was tested in 50 4D CT data sets and results were compared with external signals from the real-time position management (RPM) system. Results: In almost all cases, the proposed algorithm generated internal respiratory signals that visibly matched the external respiratory signals from the RPM system. On average, the end inspiration times calculated by the proposed algorithm were within 0.1 s of those given by the RPM system. Less than 3% of the calculated end inspiration times were more than one time frame away from those given by the RPM system. In 3 out of the 50 cases, the proposed algorithm generated internal respiratory signals that were significantly smoother than the RPM signals. In these cases, images sorted using the internal respiratory signals showed fewer artifacts in locations corresponding to the discrepancy in the internal and external respiratory signals. Conclusion: We developed a robust algorithm that generates internal respiratory signals from 4D CT images. In some cases, it even showed the potential to outperform the RPM system. The proposed algorithm is completely automatic and generally takes less than 2 min to process. It can be easily implemented into the clinic and can potentially replace the use of external surrogates.« less
Contributions to "k"-Means Clustering and Regression via Classification Algorithms
ERIC Educational Resources Information Center
Salman, Raied
2012-01-01
The dissertation deals with clustering algorithms and transforming regression problems into classification problems. The main contributions of the dissertation are twofold; first, to improve (speed up) the clustering algorithms and second, to develop a strict learning environment for solving regression problems as classification tasks by using…
Multi scales based sparse matrix spectral clustering image segmentation
NASA Astrophysics Data System (ADS)
Liu, Zhongmin; Chen, Zhicai; Li, Zhanming; Hu, Wenjin
2018-04-01
In image segmentation, spectral clustering algorithms have to adopt the appropriate scaling parameter to calculate the similarity matrix between the pixels, which may have a great impact on the clustering result. Moreover, when the number of data instance is large, computational complexity and memory use of the algorithm will greatly increase. To solve these two problems, we proposed a new spectral clustering image segmentation algorithm based on multi scales and sparse matrix. We devised a new feature extraction method at first, then extracted the features of image on different scales, at last, using the feature information to construct sparse similarity matrix which can improve the operation efficiency. Compared with traditional spectral clustering algorithm, image segmentation experimental results show our algorithm have better degree of accuracy and robustness.
An AK-LDMeans algorithm based on image clustering
NASA Astrophysics Data System (ADS)
Chen, Huimin; Li, Xingwei; Zhang, Yongbin; Chen, Nan
2018-03-01
Clustering is an effective analytical technique for handling unmarked data for value mining. Its ultimate goal is to mark unclassified data quickly and correctly. We use the roadmap for the current image processing as the experimental background. In this paper, we propose an AK-LDMeans algorithm to automatically lock the K value by designing the Kcost fold line, and then use the long-distance high-density method to select the clustering centers to further replace the traditional initial clustering center selection method, which further improves the efficiency and accuracy of the traditional K-Means Algorithm. And the experimental results are compared with the current clustering algorithm and the results are obtained. The algorithm can provide effective reference value in the fields of image processing, machine vision and data mining.
Optimal Alignment of Structures for Finite and Periodic Systems.
Griffiths, Matthew; Niblett, Samuel P; Wales, David J
2017-10-10
Finding the optimal alignment between two structures is important for identifying the minimum root-mean-square distance (RMSD) between them and as a starting point for calculating pathways. Most current algorithms for aligning structures are stochastic, scale exponentially with the size of structure, and the performance can be unreliable. We present two complementary methods for aligning structures corresponding to isolated clusters of atoms and to condensed matter described by a periodic cubic supercell. The first method (Go-PERMDIST), a branch and bound algorithm, locates the global minimum RMSD deterministically in polynomial time. The run time increases for larger RMSDs. The second method (FASTOVERLAP) is a heuristic algorithm that aligns structures by finding the global maximum kernel correlation between them using fast Fourier transforms (FFTs) and fast SO(3) transforms (SOFTs). For periodic systems, FASTOVERLAP scales with the square of the number of identical atoms in the system, reliably finds the best alignment between structures that are not too distant, and shows significantly better performance than existing algorithms. The expected run time for Go-PERMDIST is longer than FASTOVERLAP for periodic systems. For finite clusters, the FASTOVERLAP algorithm is competitive with existing algorithms. The expected run time for Go-PERMDIST to find the global RMSD between two structures deterministically is generally longer than for existing stochastic algorithms. However, with an earlier exit condition, Go-PERMDIST exhibits similar or better performance.
Hierarchical trie packet classification algorithm based on expectation-maximization clustering.
Bi, Xia-An; Zhao, Junxia
2017-01-01
With the development of computer network bandwidth, packet classification algorithms which are able to deal with large-scale rule sets are in urgent need. Among the existing algorithms, researches on packet classification algorithms based on hierarchical trie have become an important packet classification research branch because of their widely practical use. Although hierarchical trie is beneficial to save large storage space, it has several shortcomings such as the existence of backtracking and empty nodes. This paper proposes a new packet classification algorithm, Hierarchical Trie Algorithm Based on Expectation-Maximization Clustering (HTEMC). Firstly, this paper uses the formalization method to deal with the packet classification problem by means of mapping the rules and data packets into a two-dimensional space. Secondly, this paper uses expectation-maximization algorithm to cluster the rules based on their aggregate characteristics, and thereby diversified clusters are formed. Thirdly, this paper proposes a hierarchical trie based on the results of expectation-maximization clustering. Finally, this paper respectively conducts simulation experiments and real-environment experiments to compare the performances of our algorithm with other typical algorithms, and analyzes the results of the experiments. The hierarchical trie structure in our algorithm not only adopts trie path compression to eliminate backtracking, but also solves the problem of low efficiency of trie updates, which greatly improves the performance of the algorithm.
Esteban, Santiago; Rodríguez Tablado, Manuel; Peper, Francisco; Mahumud, Yamila S; Ricci, Ricardo I; Kopitowski, Karin; Terrasa, Sergio
2017-01-01
Precision medicine requires extremely large samples. Electronic health records (EHR) are thought to be a cost-effective source of data for that purpose. Phenotyping algorithms help reduce classification errors, making EHR a more reliable source of information for research. Four algorithm development strategies for classifying patients according to their diabetes status (diabetics; non-diabetics; inconclusive) were tested (one codes-only algorithm; one boolean algorithm, four statistical learning algorithms and six stacked generalization meta-learners). The best performing algorithms within each strategy were tested on the validation set. The stacked generalization algorithm yielded the highest Kappa coefficient value in the validation set (0.95 95% CI 0.91, 0.98). The implementation of these algorithms allows for the exploitation of data from thousands of patients accurately, greatly reducing the costs of constructing retrospective cohorts for research.
Energy Aware Cluster-Based Routing in Flying Ad-Hoc Networks.
Aadil, Farhan; Raza, Ali; Khan, Muhammad Fahad; Maqsood, Muazzam; Mehmood, Irfan; Rho, Seungmin
2018-05-03
Flying ad-hoc networks (FANETs) are a very vibrant research area nowadays. They have many military and civil applications. Limited battery energy and the high mobility of micro unmanned aerial vehicles (UAVs) represent their two main problems, i.e., short flight time and inefficient routing. In this paper, we try to address both of these problems by means of efficient clustering. First, we adjust the transmission power of the UAVs by anticipating their operational requirements. Optimal transmission range will have minimum packet loss ratio (PLR) and better link quality, which ultimately save the energy consumed during communication. Second, we use a variant of the K-Means Density clustering algorithm for selection of cluster heads. Optimal cluster heads enhance the cluster lifetime and reduce the routing overhead. The proposed model outperforms the state of the art artificial intelligence techniques such as Ant Colony Optimization-based clustering algorithm and Grey Wolf Optimization-based clustering algorithm. The performance of the proposed algorithm is evaluated in term of number of clusters, cluster building time, cluster lifetime and energy consumption.
SU-F-T-268: A Feasibility Study of Independent Dose Verification for Vero4DRT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yamashita, M; Kokubo, M; Institute of Biomedical Research and Innovation, Kobe, Hyogo
2016-06-15
Purpose: Vero4DRT (Mitsubishi Heavy Industries Ltd.) has been released for a few years. The treatment planning system (TPS) of Vero4DRT is dedicated, so the measurement is the only method of dose verification. There have been no reports of independent dose verification using Clarksonbased algorithm for Vero4DRT. An independent dose verification software program of the general-purpose linac using a modified Clarkson-based algorithm was modified for Vero4DRT. In this study, we evaluated the accuracy of independent dose verification program and the feasibility of the secondary check for Vero4DRT. Methods: iPlan (Brainlab AG) was used as the TPS. PencilBeam Convolution was used formore » dose calculation algorithm of IMRT and X-ray Voxel Monte Carlo was used for the others. Simple MU Analysis (SMU, Triangle Products, Japan) was used as the independent dose verification software program in which CT-based dose calculation was performed using a modified Clarkson-based algorithm. In this study, 120 patients’ treatment plans were collected in our institute. The treatments were performed using the conventional irradiation for lung and prostate, SBRT for lung and Step and shoot IMRT for prostate. Comparison in dose between the TPS and the SMU was done and confidence limits (CLs, Mean ± 2SD %) were compared to those from the general-purpose linac. Results: As the results of the CLs, the conventional irradiation (lung, prostate), SBRT (lung) and IMRT (prostate) show 2.2 ± 3.5% (CL of the general-purpose linac: 2.4 ± 5.3%), 1.1 ± 1.7% (−0.3 ± 2.0%), 4.8 ± 3.7% (5.4 ± 5.3%) and −0.5 ± 2.5% (−0.1 ± 3.6%), respectively. The CLs for Vero4DRT show similar results to that for the general-purpose linac. Conclusion: The independent dose verification for the new linac is clinically available as a secondary check and we performed the check with the similar tolerance level of the general-purpose linac. This research is partially supported by Japan Agency for Medical Research and Development (AMED)« less
A fuzzy clustering algorithm to detect planar and quadric shapes
NASA Technical Reports Server (NTRS)
Krishnapuram, Raghu; Frigui, Hichem; Nasraoui, Olfa
1992-01-01
In this paper, we introduce a new fuzzy clustering algorithm to detect an unknown number of planar and quadric shapes in noisy data. The proposed algorithm is computationally and implementationally simple, and it overcomes many of the drawbacks of the existing algorithms that have been proposed for similar tasks. Since the clustering is performed in the original image space, and since no features need to be computed, this approach is particularly suited for sparse data. The algorithm may also be used in pattern recognition applications.
A Fast Density-Based Clustering Algorithm for Real-Time Internet of Things Stream
Ying Wah, Teh
2014-01-01
Data streams are continuously generated over time from Internet of Things (IoT) devices. The faster all of this data is analyzed, its hidden trends and patterns discovered, and new strategies created, the faster action can be taken, creating greater value for organizations. Density-based method is a prominent class in clustering data streams. It has the ability to detect arbitrary shape clusters, to handle outlier, and it does not need the number of clusters in advance. Therefore, density-based clustering algorithm is a proper choice for clustering IoT streams. Recently, several density-based algorithms have been proposed for clustering data streams. However, density-based clustering in limited time is still a challenging issue. In this paper, we propose a density-based clustering algorithm for IoT streams. The method has fast processing time to be applicable in real-time application of IoT devices. Experimental results show that the proposed approach obtains high quality results with low computation time on real and synthetic datasets. PMID:25110753
A fast density-based clustering algorithm for real-time Internet of Things stream.
Amini, Amineh; Saboohi, Hadi; Wah, Teh Ying; Herawan, Tutut
2014-01-01
Data streams are continuously generated over time from Internet of Things (IoT) devices. The faster all of this data is analyzed, its hidden trends and patterns discovered, and new strategies created, the faster action can be taken, creating greater value for organizations. Density-based method is a prominent class in clustering data streams. It has the ability to detect arbitrary shape clusters, to handle outlier, and it does not need the number of clusters in advance. Therefore, density-based clustering algorithm is a proper choice for clustering IoT streams. Recently, several density-based algorithms have been proposed for clustering data streams. However, density-based clustering in limited time is still a challenging issue. In this paper, we propose a density-based clustering algorithm for IoT streams. The method has fast processing time to be applicable in real-time application of IoT devices. Experimental results show that the proposed approach obtains high quality results with low computation time on real and synthetic datasets.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation.
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. PMID:26221133
A clustering algorithm for determining community structure in complex networks
NASA Astrophysics Data System (ADS)
Jin, Hong; Yu, Wei; Li, ShiJun
2018-02-01
Clustering algorithms are attractive for the task of community detection in complex networks. DENCLUE is a representative density based clustering algorithm which has a firm mathematical basis and good clustering properties allowing for arbitrarily shaped clusters in high dimensional datasets. However, this method cannot be directly applied to community discovering due to its inability to deal with network data. Moreover, it requires a careful selection of the density parameter and the noise threshold. To solve these issues, a new community detection method is proposed in this paper. First, we use a spectral analysis technique to map the network data into a low dimensional Euclidean Space which can preserve node structural characteristics. Then, DENCLUE is applied to detect the communities in the network. A mathematical method named Sheather-Jones plug-in is chosen to select the density parameter which can describe the intrinsic clustering structure accurately. Moreover, every node on the network is meaningful so there were no noise nodes as a result the noise threshold can be ignored. We test our algorithm on both benchmark and real-life networks, and the results demonstrate the effectiveness of our algorithm over other popularity density based clustering algorithms adopted to community detection.
Collaborative filtering recommendation model based on fuzzy clustering algorithm
NASA Astrophysics Data System (ADS)
Yang, Ye; Zhang, Yunhua
2018-05-01
As one of the most widely used algorithms in recommender systems, collaborative filtering algorithm faces two serious problems, which are the sparsity of data and poor recommendation effect in big data environment. In traditional clustering analysis, the object is strictly divided into several classes and the boundary of this division is very clear. However, for most objects in real life, there is no strict definition of their forms and attributes of their class. Concerning the problems above, this paper proposes to improve the traditional collaborative filtering model through the hybrid optimization of implicit semantic algorithm and fuzzy clustering algorithm, meanwhile, cooperating with collaborative filtering algorithm. In this paper, the fuzzy clustering algorithm is introduced to fuzzy clustering the information of project attribute, which makes the project belong to different project categories with different membership degrees, and increases the density of data, effectively reduces the sparsity of data, and solves the problem of low accuracy which is resulted from the inaccuracy of similarity calculation. Finally, this paper carries out empirical analysis on the MovieLens dataset, and compares it with the traditional user-based collaborative filtering algorithm. The proposed algorithm has greatly improved the recommendation accuracy.
NASA Astrophysics Data System (ADS)
Feng, Jian-xin; Tang, Jia-fu; Wang, Guang-xing
2007-04-01
On the basis of the analysis of clustering algorithm that had been proposed for MANET, a novel clustering strategy was proposed in this paper. With the trust defined by statistical hypothesis in probability theory and the cluster head selected by node trust and node mobility, this strategy can realize the function of the malicious nodes detection which was neglected by other clustering algorithms and overcome the deficiency of being incapable of implementing the relative mobility metric of corresponding nodes in the MOBIC algorithm caused by the fact that the receiving power of two consecutive HELLO packet cannot be measured. It's an effective solution to cluster MANET securely.
Parallel Clustering Algorithm for Large-Scale Biological Data Sets
Wang, Minchao; Zhang, Wu; Ding, Wang; Dai, Dongbo; Zhang, Huiran; Xie, Hao; Chen, Luonan; Guo, Yike; Xie, Jiang
2014-01-01
Backgrounds Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Methods Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. Result A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies. PMID:24705246
Comparison Between Supervised and Unsupervised Classifications of Neuronal Cell Types: A Case Study
Guerra, Luis; McGarry, Laura M; Robles, Víctor; Bielza, Concha; Larrañaga, Pedro; Yuste, Rafael
2011-01-01
In the study of neural circuits, it becomes essential to discern the different neuronal cell types that build the circuit. Traditionally, neuronal cell types have been classified using qualitative descriptors. More recently, several attempts have been made to classify neurons quantitatively, using unsupervised clustering methods. While useful, these algorithms do not take advantage of previous information known to the investigator, which could improve the classification task. For neocortical GABAergic interneurons, the problem to discern among different cell types is particularly difficult and better methods are needed to perform objective classifications. Here we explore the use of supervised classification algorithms to classify neurons based on their morphological features, using a database of 128 pyramidal cells and 199 interneurons from mouse neocortex. To evaluate the performance of different algorithms we used, as a “benchmark,” the test to automatically distinguish between pyramidal cells and interneurons, defining “ground truth” by the presence or absence of an apical dendrite. We compared hierarchical clustering with a battery of different supervised classification algorithms, finding that supervised classifications outperformed hierarchical clustering. In addition, the selection of subsets of distinguishing features enhanced the classification accuracy for both sets of algorithms. The analysis of selected variables indicates that dendritic features were most useful to distinguish pyramidal cells from interneurons when compared with somatic and axonal morphological variables. We conclude that supervised classification algorithms are better matched to the general problem of distinguishing neuronal cell types when some information on these cell groups, in our case being pyramidal or interneuron, is known a priori. As a spin-off of this methodological study, we provide several methods to automatically distinguish neocortical pyramidal cells from interneurons, based on their morphologies. © 2010 Wiley Periodicals, Inc. Develop Neurobiol 71: 71–82, 2011 PMID:21154911
Measuring Constraint-Set Utility for Partitional Clustering Algorithms
NASA Technical Reports Server (NTRS)
Davidson, Ian; Wagstaff, Kiri L.; Basu, Sugato
2006-01-01
Clustering with constraints is an active area of machine learning and data mining research. Previous empirical work has convincingly shown that adding constraints to clustering improves the performance of a variety of algorithms. However, in most of these experiments, results are averaged over different randomly chosen constraint sets from a given set of labels, thereby masking interesting properties of individual sets. We demonstrate that constraint sets vary significantly in how useful they are for constrained clustering; some constraint sets can actually decrease algorithm performance. We create two quantitative measures, informativeness and coherence, that can be used to identify useful constraint sets. We show that these measures can also help explain differences in performance for four particular constrained clustering algorithms.
An Improved Clustering Algorithm of Tunnel Monitoring Data for Cloud Computing
Zhong, Luo; Tang, KunHao; Li, Lin; Yang, Guang; Ye, JingJing
2014-01-01
With the rapid development of urban construction, the number of urban tunnels is increasing and the data they produce become more and more complex. It results in the fact that the traditional clustering algorithm cannot handle the mass data of the tunnel. To solve this problem, an improved parallel clustering algorithm based on k-means has been proposed. It is a clustering algorithm using the MapReduce within cloud computing that deals with data. It not only has the advantage of being used to deal with mass data but also is more efficient. Moreover, it is able to compute the average dissimilarity degree of each cluster in order to clean the abnormal data. PMID:24982971
Efficient Record Linkage Algorithms Using Complete Linkage Clustering.
Mamun, Abdullah-Al; Aseltine, Robert; Rajasekaran, Sanguthevar
2016-01-01
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.
Efficient Record Linkage Algorithms Using Complete Linkage Clustering
Mamun, Abdullah-Al; Aseltine, Robert; Rajasekaran, Sanguthevar
2016-01-01
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times. PMID:27124604
Automatic Spike Sorting Using Tuning Information
Ventura, Valérie
2011-01-01
Current spike sorting methods focus on clustering neurons’ characteristic spike waveforms. The resulting spike-sorted data are typically used to estimate how covariates of interest modulate the firing rates of neurons. However, when these covariates do modulate the firing rates, they provide information about spikes’ identities, which thus far have been ignored for the purpose of spike sorting. This letter describes a novel approach to spike sorting, which incorporates both waveform information and tuning information obtained from the modulation of firing rates. Because it efficiently uses all the available information, this spike sorter yields lower spike misclassification rates than traditional automatic spike sorters. This theoretical result is verified empirically on several examples. The proposed method does not require additional assumptions; only its implementation is different. It essentially consists of performing spike sorting and tuning estimation simultaneously rather than sequentially, as is currently done. We used an expectation-maximization maximum likelihood algorithm to implement the new spike sorter. We present the general form of this algorithm and provide a detailed implementable version under the assumptions that neurons are independent and spike according to Poisson processes. Finally, we uncover a systematic flaw of spike sorting based on waveform information only. PMID:19548802
Automatic spike sorting using tuning information.
Ventura, Valérie
2009-09-01
Current spike sorting methods focus on clustering neurons' characteristic spike waveforms. The resulting spike-sorted data are typically used to estimate how covariates of interest modulate the firing rates of neurons. However, when these covariates do modulate the firing rates, they provide information about spikes' identities, which thus far have been ignored for the purpose of spike sorting. This letter describes a novel approach to spike sorting, which incorporates both waveform information and tuning information obtained from the modulation of firing rates. Because it efficiently uses all the available information, this spike sorter yields lower spike misclassification rates than traditional automatic spike sorters. This theoretical result is verified empirically on several examples. The proposed method does not require additional assumptions; only its implementation is different. It essentially consists of performing spike sorting and tuning estimation simultaneously rather than sequentially, as is currently done. We used an expectation-maximization maximum likelihood algorithm to implement the new spike sorter. We present the general form of this algorithm and provide a detailed implementable version under the assumptions that neurons are independent and spike according to Poisson processes. Finally, we uncover a systematic flaw of spike sorting based on waveform information only.
Dynamics of fragment formation in neutron-rich matter
NASA Astrophysics Data System (ADS)
Alcain, P. N.; Dorso, C. O.
2018-01-01
Background: Neutron stars are astronomical systems with nucleons subjected to extreme conditions. Due to the longer range Coulomb repulsion between protons, the system has structural inhomogeneities. Several interactions tailored to reproduce nuclear matter plus a screened Coulomb term reproduce these inhomogeneities known as nuclear pasta. These structural inhomogeneities, located in the crusts of neutron stars, can also arise in expanding systems depending on the thermodynamic conditions (temperature, proton fraction, etc.) and the expansion velocity. Purpose: We aim to find the dynamics of the fragment formation for expanding systems simulated according to the little big bang model. This expansion resembles the evolution of merging neutron stars. Method: We study the dynamics of the nucleons with semiclassical molecular dynamics models. Starting with an equilibrium configuration, we expand the system homogeneously until we arrive at an asymptotic configuration (i.e., very low final densities). We study, with four different cluster recognition algorithms, the fragment distribution throughout this expansion and the dynamics of the cluster formation. Results: Studying the topology of the equilibrium states, before the expansion, we reproduced the known pasta phases plus a novel phase we called pregnocchi, consisting of proton aggregates embedded in a neutron sea. We have identified different fragmentation regimes, depending on the initial temperature and fragment velocity. In particular, for the already mentioned pregnocchi, a neutron cloud surrounds the clusters during the early stages of the expansion, resulting in systems that give rise to configurations compatible with the emergence of the r process. Conclusions: We showed that a proper identification of the cluster distribution is highly dependent on the cluster recognition algorithm chosen, and found that the early cluster recognition algorithm (ECRA) was the most stable one. This approach allowed us to identify the dynamics of the fragment formation. These calculations pave the way to a comparison between Earth experiments and neutron star studies.
GraphTeams: a method for discovering spatial gene clusters in Hi-C sequencing data.
Schulz, Tizian; Stoye, Jens; Doerr, Daniel
2018-05-08
Hi-C sequencing offers novel, cost-effective means to study the spatial conformation of chromosomes. We use data obtained from Hi-C experiments to provide new evidence for the existence of spatial gene clusters. These are sets of genes with associated functionality that exhibit close proximity to each other in the spatial conformation of chromosomes across several related species. We present the first gene cluster model capable of handling spatial data. Our model generalizes a popular computational model for gene cluster prediction, called δ-teams, from sequences to graphs. Following previous lines of research, we subsequently extend our model to allow for several vertices being associated with the same label. The model, called δ-teams with families, is particular suitable for our application as it enables handling of gene duplicates. We develop algorithmic solutions for both models. We implemented the algorithm for discovering δ-teams with families and integrated it into a fully automated workflow for discovering gene clusters in Hi-C data, called GraphTeams. We applied it to human and mouse data to find intra- and interchromosomal gene cluster candidates. The results include intrachromosomal clusters that seem to exhibit a closer proximity in space than on their chromosomal DNA sequence. We further discovered interchromosomal gene clusters that contain genes from different chromosomes within the human genome, but are located on a single chromosome in mouse. By identifying δ-teams with families, we provide a flexible model to discover gene cluster candidates in Hi-C data. Our analysis of Hi-C data from human and mouse reveals several known gene clusters (thus validating our approach), but also few sparsely studied or possibly unknown gene cluster candidates that could be the source of further experimental investigations.
Tang, Jiqiang; Yang, Wu; Zhu, Lingyun; Wang, Dong; Feng, Xin
2017-04-26
In recent years, Wireless Sensor Networks with a Mobile Sink (WSN-MS) have been an active research topic due to the widespread use of mobile devices. However, how to get the balance between data delivery latency and energy consumption becomes a key issue of WSN-MS. In this paper, we study the clustering approach by jointly considering the Route planning for mobile sink and Clustering Problem (RCP) for static sensor nodes. We solve the RCP problem by using the minimum travel route clustering approach, which applies the minimum travel route of the mobile sink to guide the clustering process. We formulate the RCP problem as an Integer Non-Linear Programming (INLP) problem to shorten the travel route of the mobile sink under three constraints: the communication hops constraint, the travel route constraint and the loop avoidance constraint. We then propose an Imprecise Induction Algorithm (IIA) based on the property that the solution with a small hop count is more feasible than that with a large hop count. The IIA algorithm includes three processes: initializing travel route planning with a Traveling Salesman Problem (TSP) algorithm, transforming the cluster head to a cluster member and transforming the cluster member to a cluster head. Extensive experimental results show that the IIA algorithm could automatically adjust cluster heads according to the maximum hops parameter and plan a shorter travel route for the mobile sink. Compared with the Shortest Path Tree-based Data-Gathering Algorithm (SPT-DGA), the IIA algorithm has the characteristics of shorter route length, smaller cluster head count and faster convergence rate.
Testing the accuracy of redshift-space group-finding algorithms
NASA Astrophysics Data System (ADS)
Frederic, James J.
1995-04-01
Using simulated redshift surveys generated from a high-resolution N-body cosmological structure simulation, we study algorithms used to identify groups of galaxies in redshift space. Two algorithms are investigated; both are friends-of-friends schemes with variable linking lengths in the radial and transverse dimenisons. The chief difference between the algorithms is in the redshift linking length. The algorithm proposed by Huchra & Geller (1982) uses a generous linking length designed to find 'fingers of god,' while that of Nolthenius & White (1987) uses a smaller linking length to minimize contamination by projection. We find that neither of the algorithms studied is intrinsically superior to the other; rather, the ideal algorithm as well as the ideal algorithm parameters depends on the purpose for which groups are to be studied. The Huchra & Geller algorithm misses few real groups, at the cost of including some spurious groups and members, while the Nolthenius & White algorithm misses high velocity dispersion groups and members but is less likely to include interlopers in its group assignments. Adjusting the parameters of either algorithm results in a trade-off between group accuracy and completeness. In a companion paper we investigate the accuracy of virial mass estimates and clustering properties of groups identified using these algorithms.
Efficient implementation of parallel three-dimensional FFT on clusters of PCs
NASA Astrophysics Data System (ADS)
Takahashi, Daisuke
2003-05-01
In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of PCs. The three-dimensional FFT algorithm can be altered into a block three-dimensional FFT algorithm to reduce the number of cache misses. We show that the block three-dimensional FFT algorithm improves performance by utilizing the cache memory effectively. We use the block three-dimensional FFT algorithm to implement the parallel three-dimensional FFT algorithm. We succeeded in obtaining performance of over 1.3 GFLOPS on an 8-node dual Pentium III 1 GHz PC SMP cluster.
Blessy, S A Praylin Selva; Sulochana, C Helen
2015-01-01
Segmentation of brain tumor from Magnetic Resonance Imaging (MRI) becomes very complicated due to the structural complexities of human brain and the presence of intensity inhomogeneities. To propose a method that effectively segments brain tumor from MR images and to evaluate the performance of unsupervised optimal fuzzy clustering (UOFC) algorithm for segmentation of brain tumor from MR images. Segmentation is done by preprocessing the MR image to standardize intensity inhomogeneities followed by feature extraction, feature fusion and clustering. Different validation measures are used to evaluate the performance of the proposed method using different clustering algorithms. The proposed method using UOFC algorithm produces high sensitivity (96%) and low specificity (4%) compared to other clustering methods. Validation results clearly show that the proposed method with UOFC algorithm effectively segments brain tumor from MR images.
Adaptive density trajectory cluster based on time and space distance
NASA Astrophysics Data System (ADS)
Liu, Fagui; Zhang, Zhijie
2017-10-01
There are some hotspot problems remaining in trajectory cluster for discovering mobile behavior regularity, such as the computation of distance between sub trajectories, the setting of parameter values in cluster algorithm and the uncertainty/boundary problem of data set. As a result, based on the time and space, this paper tries to define the calculation method of distance between sub trajectories. The significance of distance calculation for sub trajectories is to clearly reveal the differences in moving trajectories and to promote the accuracy of cluster algorithm. Besides, a novel adaptive density trajectory cluster algorithm is proposed, in which cluster radius is computed through using the density of data distribution. In addition, cluster centers and number are selected by a certain strategy automatically, and uncertainty/boundary problem of data set is solved by designed weighted rough c-means. Experimental results demonstrate that the proposed algorithm can perform the fuzzy trajectory cluster effectively on the basis of the time and space distance, and obtain the optimal cluster centers and rich cluster results information adaptably for excavating the features of mobile behavior in mobile and sociology network.
An incremental DPMM-based method for trajectory clustering, modeling, and retrieval.
Hu, Weiming; Li, Xi; Tian, Guodong; Maybank, Stephen; Zhang, Zhongfei
2013-05-01
Trajectory analysis is the basis for many applications, such as indexing of motion events in videos, activity recognition, and surveillance. In this paper, the Dirichlet process mixture model (DPMM) is applied to trajectory clustering, modeling, and retrieval. We propose an incremental version of a DPMM-based clustering algorithm and apply it to cluster trajectories. An appropriate number of trajectory clusters is determined automatically. When trajectories belonging to new clusters arrive, the new clusters can be identified online and added to the model without any retraining using the previous data. A time-sensitive Dirichlet process mixture model (tDPMM) is applied to each trajectory cluster for learning the trajectory pattern which represents the time-series characteristics of the trajectories in the cluster. Then, a parameterized index is constructed for each cluster. A novel likelihood estimation algorithm for the tDPMM is proposed, and a trajectory-based video retrieval model is developed. The tDPMM-based probabilistic matching method and the DPMM-based model growing method are combined to make the retrieval model scalable and adaptable. Experimental comparisons with state-of-the-art algorithms demonstrate the effectiveness of our algorithm.
NASA Astrophysics Data System (ADS)
Di, Nur Faraidah Muhammad; Satari, Siti Zanariah
2017-05-01
Outlier detection in linear data sets has been done vigorously but only a small amount of work has been done for outlier detection in circular data. In this study, we proposed multiple outliers detection in circular regression models based on the clustering algorithm. Clustering technique basically utilizes distance measure to define distance between various data points. Here, we introduce the similarity distance based on Euclidean distance for circular model and obtain a cluster tree using the single linkage clustering algorithm. Then, a stopping rule for the cluster tree based on the mean direction and circular standard deviation of the tree height is proposed. We classify the cluster group that exceeds the stopping rule as potential outlier. Our aim is to demonstrate the effectiveness of proposed algorithms with the similarity distances in detecting the outliers. It is found that the proposed methods are performed well and applicable for circular regression model.
Design of the SLAC RCE Platform: A General Purpose ATCA Based Data Acquisition System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Herbst, R.; Claus, R.; Freytag, M.
2015-01-23
The SLAC RCE platform is a general purpose clustered data acquisition system implemented on a custom ATCA compliant blade, called the Cluster On Board (COB). The core of the system is the Reconfigurable Cluster Element (RCE), which is a system-on-chip design based upon the Xilinx Zynq family of FPGAs, mounted on custom COB daughter-boards. The Zynq architecture couples a dual core ARM Cortex A9 based processor with a high performance 28nm FPGA. The RCE has 12 external general purpose bi-directional high speed links, each supporting serial rates of up to 12Gbps. 8 RCE nodes are included on a COB, eachmore » with a 10Gbps connection to an on-board 24-port Ethernet switch integrated circuit. The COB is designed to be used with a standard full-mesh ATCA backplane allowing multiple RCE nodes to be tightly interconnected with minimal interconnect latency. Multiple shelves can be clustered using the front panel 10-gbps connections. The COB also supports local and inter-blade timing and trigger distribution. An experiment specific Rear Transition Module adapts the 96 high speed serial links to specific experiments and allows an experiment-specific timing and busy feedback connection. This coupling of processors with a high performance FPGA fabric in a low latency, multiple node cluster allows high speed data processing that can be easily adapted to any physics experiment. RTEMS and Linux are both ported to the module. The RCE has been used or is the baseline for several current and proposed experiments (LCLS, HPS, LSST, ATLAS-CSC, LBNE, DarkSide, ILC-SiD, etc).« less
Reducing Earth Topography Resolution for SMAP Mission Ground Tracks Using K-Means Clustering
NASA Technical Reports Server (NTRS)
Rizvi, Farheen
2013-01-01
The K-means clustering algorithm is used to reduce Earth topography resolution for the SMAP mission ground tracks. As SMAP propagates in orbit, knowledge of the radar antenna footprints on Earth is required for the antenna misalignment calibration. Each antenna footprint contains a latitude and longitude location pair on the Earth surface. There are 400 pairs in one data set for the calibration model. It is computationally expensive to calculate corresponding Earth elevation for these data pairs. Thus, the antenna footprint resolution is reduced. Similar topographical data pairs are grouped together with the K-means clustering algorithm. The resolution is reduced to the mean of each topographical cluster called the cluster centroid. The corresponding Earth elevation for each cluster centroid is assigned to the entire group. Results show that 400 data points are reduced to 60 while still maintaining algorithm performance and computational efficiency. In this work, sensitivity analysis is also performed to show a trade-off between algorithm performance versus computational efficiency as the number of cluster centroids and algorithm iterations are increased.
Accurate Grid-based Clustering Algorithm with Diagonal Grid Searching and Merging
NASA Astrophysics Data System (ADS)
Liu, Feng; Ye, Chengcheng; Zhu, Erzhou
2017-09-01
Due to the advent of big data, data mining technology has attracted more and more attentions. As an important data analysis method, grid clustering algorithm is fast but with relatively lower accuracy. This paper presents an improved clustering algorithm combined with grid and density parameters. The algorithm first divides the data space into the valid meshes and invalid meshes through grid parameters. Secondly, from the starting point located at the first point of the diagonal of the grids, the algorithm takes the direction of “horizontal right, vertical down” to merge the valid meshes. Furthermore, by the boundary grid processing, the invalid grids are searched and merged when the adjacent left, above, and diagonal-direction grids are all the valid ones. By doing this, the accuracy of clustering is improved. The experimental results have shown that the proposed algorithm is accuracy and relatively faster when compared with some popularly used algorithms.
NASA Astrophysics Data System (ADS)
Chuan, Zun Liang; Ismail, Noriszura; Shinyie, Wendy Ling; Lit Ken, Tan; Fam, Soo-Fen; Senawi, Azlyna; Yusoff, Wan Nur Syahidah Wan
2018-04-01
Due to the limited of historical precipitation records, agglomerative hierarchical clustering algorithms widely used to extrapolate information from gauged to ungauged precipitation catchments in yielding a more reliable projection of extreme hydro-meteorological events such as extreme precipitation events. However, identifying the optimum number of homogeneous precipitation catchments accurately based on the dendrogram resulted using agglomerative hierarchical algorithms are very subjective. The main objective of this study is to propose an efficient regionalized algorithm to identify the homogeneous precipitation catchments for non-stationary precipitation time series. The homogeneous precipitation catchments are identified using average linkage hierarchical clustering algorithm associated multi-scale bootstrap resampling, while uncentered correlation coefficient as the similarity measure. The regionalized homogeneous precipitation is consolidated using K-sample Anderson Darling non-parametric test. The analysis result shows the proposed regionalized algorithm performed more better compared to the proposed agglomerative hierarchical clustering algorithm in previous studies.
Robust MST-Based Clustering Algorithm.
Liu, Qidong; Zhang, Ruisheng; Zhao, Zhili; Wang, Zhenghai; Jiao, Mengyao; Wang, Guangjing
2018-06-01
Minimax similarity stresses the connectedness of points via mediating elements rather than favoring high mutual similarity. The grouping principle yields superior clustering results when mining arbitrarily-shaped clusters in data. However, it is not robust against noises and outliers in the data. There are two main problems with the grouping principle: first, a single object that is far away from all other objects defines a separate cluster, and second, two connected clusters would be regarded as two parts of one cluster. In order to solve such problems, we propose robust minimum spanning tree (MST)-based clustering algorithm in this letter. First, we separate the connected objects by applying a density-based coarsening phase, resulting in a low-rank matrix in which the element denotes the supernode by combining a set of nodes. Then a greedy method is presented to partition those supernodes through working on the low-rank matrix. Instead of removing the longest edges from MST, our algorithm groups the data set based on the minimax similarity. Finally, the assignment of all data points can be achieved through their corresponding supernodes. Experimental results on many synthetic and real-world data sets show that our algorithm consistently outperforms compared clustering algorithms.
NASA Astrophysics Data System (ADS)
Brenden, T. O.; Clark, R. D.; Wiley, M. J.; Seelbach, P. W.; Wang, L.
2005-05-01
Remote sensing and geographic information systems have made it possible to attribute variables for streams at increasingly detailed resolutions (e.g., individual river reaches). Nevertheless, management decisions still must be made at large scales because land and stream managers typically lack sufficient resources to manage on an individual reach basis. Managers thus require a method for identifying stream management units that are ecologically similar and that can be expected to respond similarly to management decisions. We have developed a spatially-constrained clustering algorithm that can merge neighboring river reaches with similar ecological characteristics into larger management units. The clustering algorithm is based on the Cluster Affinity Search Technique (CAST), which was developed for clustering gene expression data. Inputs to the clustering algorithm are the neighbor relationships of the reaches that comprise the digital river network, the ecological attributes of the reaches, and an affinity value, which identifies the minimum similarity for merging river reaches. In this presentation, we describe the clustering algorithm in greater detail and contrast its use with other methods (expert opinion, classification approach, regular clustering) for identifying management units using several Michigan watersheds as a backdrop.
Hierarchical trie packet classification algorithm based on expectation-maximization clustering
Bi, Xia-an; Zhao, Junxia
2017-01-01
With the development of computer network bandwidth, packet classification algorithms which are able to deal with large-scale rule sets are in urgent need. Among the existing algorithms, researches on packet classification algorithms based on hierarchical trie have become an important packet classification research branch because of their widely practical use. Although hierarchical trie is beneficial to save large storage space, it has several shortcomings such as the existence of backtracking and empty nodes. This paper proposes a new packet classification algorithm, Hierarchical Trie Algorithm Based on Expectation-Maximization Clustering (HTEMC). Firstly, this paper uses the formalization method to deal with the packet classification problem by means of mapping the rules and data packets into a two-dimensional space. Secondly, this paper uses expectation-maximization algorithm to cluster the rules based on their aggregate characteristics, and thereby diversified clusters are formed. Thirdly, this paper proposes a hierarchical trie based on the results of expectation-maximization clustering. Finally, this paper respectively conducts simulation experiments and real-environment experiments to compare the performances of our algorithm with other typical algorithms, and analyzes the results of the experiments. The hierarchical trie structure in our algorithm not only adopts trie path compression to eliminate backtracking, but also solves the problem of low efficiency of trie updates, which greatly improves the performance of the algorithm. PMID:28704476
Using Agent Base Models to Optimize Large Scale Network for Large System Inventories
NASA Technical Reports Server (NTRS)
Shameldin, Ramez Ahmed; Bowling, Shannon R.
2010-01-01
The aim of this paper is to use Agent Base Models (ABM) to optimize large scale network handling capabilities for large system inventories and to implement strategies for the purpose of reducing capital expenses. The models used in this paper either use computational algorithms or procedure implementations developed by Matlab to simulate agent based models in a principal programming language and mathematical theory using clusters, these clusters work as a high performance computational performance to run the program in parallel computational. In both cases, a model is defined as compilation of a set of structures and processes assumed to underlie the behavior of a network system.
A fast parallel clustering algorithm for molecular simulation trajectories.
Zhao, Yutong; Sheong, Fu Kit; Sun, Jian; Sander, Pedro; Huang, Xuhui
2013-01-15
We implemented a GPU-powered parallel k-centers algorithm to perform clustering on the conformations of molecular dynamics (MD) simulations. The algorithm is up to two orders of magnitude faster than the CPU implementation. We tested our algorithm on four protein MD simulation datasets ranging from the small Alanine Dipeptide to a 370-residue Maltose Binding Protein (MBP). It is capable of grouping 250,000 conformations of the MBP into 4000 clusters within 40 seconds. To achieve this, we effectively parallelized the code on the GPU and utilize the triangle inequality of metric spaces. Furthermore, the algorithm's running time is linear with respect to the number of cluster centers. In addition, we found the triangle inequality to be less effective in higher dimensions and provide a mathematical rationale. Finally, using Alanine Dipeptide as an example, we show a strong correlation between cluster populations resulting from the k-centers algorithm and the underlying density. © 2012 Wiley Periodicals, Inc. Copyright © 2012 Wiley Periodicals, Inc.
Tchagang, Alain B; Phan, Sieu; Famili, Fazel; Shearer, Heather; Fobert, Pierre; Huang, Yi; Zou, Jitao; Huang, Daiqing; Cutler, Adrian; Liu, Ziying; Pan, Youlian
2012-04-04
Nowadays, it is possible to collect expression levels of a set of genes from a set of biological samples during a series of time points. Such data have three dimensions: gene-sample-time (GST). Thus they are called 3D microarray gene expression data. To take advantage of the 3D data collected, and to fully understand the biological knowledge hidden in the GST data, novel subspace clustering algorithms have to be developed to effectively address the biological problem in the corresponding space. We developed a subspace clustering algorithm called Order Preserving Triclustering (OPTricluster), for 3D short time-series data mining. OPTricluster is able to identify 3D clusters with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the order preserving (OP) concept on the time dimension. The fusion of the two methodologies allows one to study similarities and differences between samples in terms of their temporal expression profile. OPTricluster has been successfully applied to four case studies: immune response in mice infected by malaria (Plasmodium chabaudi), systemic acquired resistance in Arabidopsis thaliana, similarities and differences between inner and outer cotyledon in Brassica napus during seed development, and to Brassica napus whole seed development. These studies showed that OPTricluster is robust to noise and is able to detect the similarities and differences between biological samples. Our analysis showed that OPTricluster generally outperforms other well known clustering algorithms such as the TRICLUSTER, gTRICLUSTER and K-means; it is robust to noise and can effectively mine the biological knowledge hidden in the 3D short time-series gene expression data.
A Class of Manifold Regularized Multiplicative Update Algorithms for Image Clustering.
Yang, Shangming; Yi, Zhang; He, Xiaofei; Li, Xuelong
2015-12-01
Multiplicative update algorithms are important tools for information retrieval, image processing, and pattern recognition. However, when the graph regularization is added to the cost function, different classes of sample data may be mapped to the same subspace, which leads to the increase of data clustering error rate. In this paper, an improved nonnegative matrix factorization (NMF) cost function is introduced. Based on the cost function, a class of novel graph regularized NMF algorithms is developed, which results in a class of extended multiplicative update algorithms with manifold structure regularization. Analysis shows that in the learning, the proposed algorithms can efficiently minimize the rank of the data representation matrix. Theoretical results presented in this paper are confirmed by simulations. For different initializations and data sets, variation curves of cost functions and decomposition data are presented to show the convergence features of the proposed update rules. Basis images, reconstructed images, and clustering results are utilized to present the efficiency of the new algorithms. Last, the clustering accuracies of different algorithms are also investigated, which shows that the proposed algorithms can achieve state-of-the-art performance in applications of image clustering.
Identifying conserved gene clusters in the presence of homology families.
He, Xin; Goldwasser, Michael H
2005-01-01
The study of conserved gene clusters is important for understanding the forces behind genome organization and evolution, as well as the function of individual genes or gene groups. In this paper, we present a new model and algorithm for identifying conserved gene clusters from pairwise genome comparison. This generalizes a recent model called "gene teams." A gene team is a set of genes that appear homologously in two or more species, possibly in a different order yet with the distance of adjacent genes in the team for each chromosome always no more than a certain threshold. We remove the constraint in the original model that each gene must have a unique occurrence in each chromosome and thus allow the analysis on complex prokaryotic or eukaryotic genomes with extensive paralogs. Our algorithm analyzes a pair of chromosomes in O(mn) time and uses O(m+n) space, where m and n are the number of genes in the respective chromosomes. We demonstrate the utility of our methods by studying two bacterial genomes, E. coli K-12 and B. subtilis. Many of the teams identified by our algorithm correlate with documented E. coli operons, while several others match predicted operons, previously suggested by computational techniques. Our implementation and data are publicly available at euler.slu.edu/ approximately goldwasser/homologyteams/.
A Multiple Sphere T-Matrix Fortran Code for Use on Parallel Computer Clusters
NASA Technical Reports Server (NTRS)
Mackowski, D. W.; Mishchenko, M. I.
2011-01-01
A general-purpose Fortran-90 code for calculation of the electromagnetic scattering and absorption properties of multiple sphere clusters is described. The code can calculate the efficiency factors and scattering matrix elements of the cluster for either fixed or random orientation with respect to the incident beam and for plane wave or localized- approximation Gaussian incident fields. In addition, the code can calculate maps of the electric field both interior and exterior to the spheres.The code is written with message passing interface instructions to enable the use on distributed memory compute clusters, and for such platforms the code can make feasible the calculation of absorption, scattering, and general EM characteristics of systems containing several thousand spheres.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hui, Cheukkai; Suh, Yelin; Robertson, Daniel
Purpose: The purpose of this study was to develop a novel algorithm to create a robust internal respiratory signal (IRS) for retrospective sorting of four-dimensional (4D) computed tomography (CT) images. Methods: The proposed algorithm combines information from the Fourier transform of the CT images and from internal anatomical features to form the IRS. The algorithm first extracts potential respiratory signals from low-frequency components in the Fourier space and selected anatomical features in the image space. A clustering algorithm then constructs groups of potential respiratory signals with similar temporal oscillation patterns. The clustered group with the largest number of similar signalsmore » is chosen to form the final IRS. To evaluate the performance of the proposed algorithm, the IRS was computed and compared with the external respiratory signal from the real-time position management (RPM) system on 80 patients. Results: In 72 (90%) of the 4D CT data sets tested, the IRS computed by the authors’ proposed algorithm matched with the RPM signal based on their normalized cross correlation. For these data sets with matching respiratory signals, the average difference between the end inspiration times (Δt{sub ins}) in the IRS and RPM signal was 0.11 s, and only 2.1% of Δt{sub ins} were more than 0.5 s apart. In the eight (10%) 4D CT data sets in which the IRS and the RPM signal did not match, the average Δt{sub ins} was 0.73 s in the nonmatching couch positions, and 35.4% of them had a Δt{sub ins} greater than 0.5 s. At couch positions in which IRS did not match the RPM signal, a correlation-based metric indicated poorer matching of neighboring couch positions in the RPM-sorted images. This implied that, when IRS did not match the RPM signal, the images sorted using the IRS showed fewer artifacts than the clinical images sorted using the RPM signal. Conclusions: The authors’ proposed algorithm can generate robust IRSs that can be used for retrospective sorting of 4D CT data. The algorithm is completely automatic and requires very little processing time. The algorithm is cost efficient and can be easily adopted for everyday clinical use.« less
NASA Astrophysics Data System (ADS)
Fan, Tian-E.; Shao, Gui-Fang; Ji, Qing-Shuang; Zheng, Ji-Wen; Liu, Tun-dong; Wen, Yu-Hua
2016-11-01
Theoretically, the determination of the structure of a cluster is to search the global minimum on its potential energy surface. The global minimization problem is often nondeterministic-polynomial-time (NP) hard and the number of local minima grows exponentially with the cluster size. In this article, a multi-populations multi-strategies differential evolution algorithm has been proposed to search the globally stable structure of Fe and Cr nanoclusters. The algorithm combines a multi-populations differential evolution with an elite pool scheme to keep the diversity of the solutions and avoid prematurely trapping into local optima. Moreover, multi-strategies such as growing method in initialization and three differential strategies in mutation are introduced to improve the convergence speed and lower the computational cost. The accuracy and effectiveness of our algorithm have been verified by comparing the results of Fe clusters with Cambridge Cluster Database. Meanwhile, the performance of our algorithm has been analyzed by comparing the convergence rate and energy evaluations with the classical DE algorithm. The multi-populations, multi-strategies mutation and growing method in initialization in our algorithm have been considered respectively. Furthermore, the structural growth pattern of Cr clusters has been predicted by this algorithm. The results show that the lowest-energy structure of Cr clusters contains many icosahedra, and the number of the icosahedral rings rises with increasing size.
Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra".
Griss, Johannes; Perez-Riverol, Yasset; The, Matthew; Käll, Lukas; Vizcaíno, Juan Antonio
2018-05-04
In the recent benchmarking article entitled "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra", Rieder et al. compared several different approaches to cluster MS/MS spectra. While we certainly recognize the value of the manuscript, here, we report some shortcomings detected in the original analyses. For most analyses, the authors clustered only single MS/MS runs. In one of the reported analyses, three MS/MS runs were processed together, which already led to computational performance issues in many of the tested approaches. This fact highlights the difficulties of using many of the tested algorithms on the nowadays produced average proteomics data sets. Second, the authors only processed identified spectra when merging MS runs. Thereby, all unidentified spectra that are of lower quality were already removed from the data set and could not influence the clustering results. Next, we found that the authors did not analyze the effect of chimeric spectra on the clustering results. In our analysis, we found that 3% of the spectra in the used data sets were chimeric, and this had marked effects on the behavior of the different clustering algorithms tested. Finally, the authors' choice to evaluate the MS-Cluster and spectra-cluster algorithms using a precursor tolerance of 5 Da for high-resolution Orbitrap data only was, in our opinion, not adequate to assess the performance of MS/MS clustering approaches.
A Novel Artificial Bee Colony Based Clustering Algorithm for Categorical Data
2015-01-01
Data with categorical attributes are ubiquitous in the real world. However, existing partitional clustering algorithms for categorical data are prone to fall into local optima. To address this issue, in this paper we propose a novel clustering algorithm, ABC-K-Modes (Artificial Bee Colony clustering based on K-Modes), based on the traditional k-modes clustering algorithm and the artificial bee colony approach. In our approach, we first introduce a one-step k-modes procedure, and then integrate this procedure with the artificial bee colony approach to deal with categorical data. In the search process performed by scout bees, we adopt the multi-source search inspired by the idea of batch processing to accelerate the convergence of ABC-K-Modes. The performance of ABC-K-Modes is evaluated by a series of experiments in comparison with that of the other popular algorithms for categorical data. PMID:25993469
A dynamic scheduling algorithm for singe-arm two-cluster tools with flexible processing times
NASA Astrophysics Data System (ADS)
Li, Xin; Fung, Richard Y. K.
2018-02-01
This article presents a dynamic algorithm for job scheduling in two-cluster tools producing multi-type wafers with flexible processing times. Flexible processing times mean that the actual times for processing wafers should be within given time intervals. The objective of the work is to minimize the completion time of the newly inserted wafer. To deal with this issue, a two-cluster tool is decomposed into three reduced single-cluster tools (RCTs) in a series based on a decomposition approach proposed in this article. For each single-cluster tool, a dynamic scheduling algorithm based on temporal constraints is developed to schedule the newly inserted wafer. Three experiments have been carried out to test the dynamic scheduling algorithm proposed, comparing with the results the 'earliest starting time' heuristic (EST) adopted in previous literature. The results show that the dynamic algorithm proposed in this article is effective and practical.
A novel artificial bee colony based clustering algorithm for categorical data.
Ji, Jinchao; Pang, Wei; Zheng, Yanlin; Wang, Zhe; Ma, Zhiqiang
2015-01-01
Data with categorical attributes are ubiquitous in the real world. However, existing partitional clustering algorithms for categorical data are prone to fall into local optima. To address this issue, in this paper we propose a novel clustering algorithm, ABC-K-Modes (Artificial Bee Colony clustering based on K-Modes), based on the traditional k-modes clustering algorithm and the artificial bee colony approach. In our approach, we first introduce a one-step k-modes procedure, and then integrate this procedure with the artificial bee colony approach to deal with categorical data. In the search process performed by scout bees, we adopt the multi-source search inspired by the idea of batch processing to accelerate the convergence of ABC-K-Modes. The performance of ABC-K-Modes is evaluated by a series of experiments in comparison with that of the other popular algorithms for categorical data.
Zhu, Bohui; Ding, Yongsheng; Hao, Kuangrong
2013-01-01
This paper presents a novel maximum margin clustering method with immune evolution (IEMMC) for automatic diagnosis of electrocardiogram (ECG) arrhythmias. This diagnostic system consists of signal processing, feature extraction, and the IEMMC algorithm for clustering of ECG arrhythmias. First, raw ECG signal is processed by an adaptive ECG filter based on wavelet transforms, and waveform of the ECG signal is detected; then, features are extracted from ECG signal to cluster different types of arrhythmias by the IEMMC algorithm. Three types of performance evaluation indicators are used to assess the effect of the IEMMC method for ECG arrhythmias, such as sensitivity, specificity, and accuracy. Compared with K-means and iterSVR algorithms, the IEMMC algorithm reflects better performance not only in clustering result but also in terms of global search ability and convergence ability, which proves its effectiveness for the detection of ECG arrhythmias. PMID:23690875
The Ordered Clustered Travelling Salesman Problem: A Hybrid Genetic Algorithm
Ahmed, Zakir Hussain
2014-01-01
The ordered clustered travelling salesman problem is a variation of the usual travelling salesman problem in which a set of vertices (except the starting vertex) of the network is divided into some prespecified clusters. The objective is to find the least cost Hamiltonian tour in which vertices of any cluster are visited contiguously and the clusters are visited in the prespecified order. The problem is NP-hard, and it arises in practical transportation and sequencing problems. This paper develops a hybrid genetic algorithm using sequential constructive crossover, 2-opt search, and a local search for obtaining heuristic solution to the problem. The efficiency of the algorithm has been examined against two existing algorithms for some asymmetric and symmetric TSPLIB instances of various sizes. The computational results show that the proposed algorithm is very effective in terms of solution quality and computational time. Finally, we present solution to some more symmetric TSPLIB instances. PMID:24701148
A genetic graph-based approach for partitional clustering.
Menéndez, Héctor D; Barrero, David F; Camacho, David
2014-05-01
Clustering is one of the most versatile tools for data analysis. In the recent years, clustering that seeks the continuity of data (in opposition to classical centroid-based approaches) has attracted an increasing research interest. It is a challenging problem with a remarkable practical interest. The most popular continuity clustering method is the spectral clustering (SC) algorithm, which is based on graph cut: It initially generates a similarity graph using a distance measure and then studies its graph spectrum to find the best cut. This approach is sensitive to the parameters of the metric, and a correct parameter choice is critical to the quality of the cluster. This work proposes a new algorithm, inspired by SC, that reduces the parameter dependency while maintaining the quality of the solution. The new algorithm, named genetic graph-based clustering (GGC), takes an evolutionary approach introducing a genetic algorithm (GA) to cluster the similarity graph. The experimental validation shows that GGC increases robustness of SC and has competitive performance in comparison with classical clustering methods, at least, in the synthetic and real dataset used in the experiments.
Exploratory Item Classification Via Spectral Graph Clustering
Chen, Yunxiao; Li, Xiaoou; Liu, Jingchen; Xu, Gongjun; Ying, Zhiliang
2017-01-01
Large-scale assessments are supported by a large item pool. An important task in test development is to assign items into scales that measure different characteristics of individuals, and a popular approach is cluster analysis of items. Classical methods in cluster analysis, such as the hierarchical clustering, K-means method, and latent-class analysis, often induce a high computational overhead and have difficulty handling missing data, especially in the presence of high-dimensional responses. In this article, the authors propose a spectral clustering algorithm for exploratory item cluster analysis. The method is computationally efficient, effective for data with missing or incomplete responses, easy to implement, and often outperforms traditional clustering algorithms in the context of high dimensionality. The spectral clustering algorithm is based on graph theory, a branch of mathematics that studies the properties of graphs. The algorithm first constructs a graph of items, characterizing the similarity structure among items. It then extracts item clusters based on the graphical structure, grouping similar items together. The proposed method is evaluated through simulations and an application to the revised Eysenck Personality Questionnaire. PMID:29033476
Symmetric nonnegative matrix factorization: algorithms and applications to probabilistic clustering.
He, Zhaoshui; Xie, Shengli; Zdunek, Rafal; Zhou, Guoxu; Cichocki, Andrzej
2011-12-01
Nonnegative matrix factorization (NMF) is an unsupervised learning method useful in various applications including image processing and semantic analysis of documents. This paper focuses on symmetric NMF (SNMF), which is a special case of NMF decomposition. Three parallel multiplicative update algorithms using level 3 basic linear algebra subprograms directly are developed for this problem. First, by minimizing the Euclidean distance, a multiplicative update algorithm is proposed, and its convergence under mild conditions is proved. Based on it, we further propose another two fast parallel methods: α-SNMF and β -SNMF algorithms. All of them are easy to implement. These algorithms are applied to probabilistic clustering. We demonstrate their effectiveness for facial image clustering, document categorization, and pattern clustering in gene expression.
Tang, Jiqiang; Yang, Wu; Zhu, Lingyun; Wang, Dong; Feng, Xin
2017-01-01
In recent years, Wireless Sensor Networks with a Mobile Sink (WSN-MS) have been an active research topic due to the widespread use of mobile devices. However, how to get the balance between data delivery latency and energy consumption becomes a key issue of WSN-MS. In this paper, we study the clustering approach by jointly considering the Route planning for mobile sink and Clustering Problem (RCP) for static sensor nodes. We solve the RCP problem by using the minimum travel route clustering approach, which applies the minimum travel route of the mobile sink to guide the clustering process. We formulate the RCP problem as an Integer Non-Linear Programming (INLP) problem to shorten the travel route of the mobile sink under three constraints: the communication hops constraint, the travel route constraint and the loop avoidance constraint. We then propose an Imprecise Induction Algorithm (IIA) based on the property that the solution with a small hop count is more feasible than that with a large hop count. The IIA algorithm includes three processes: initializing travel route planning with a Traveling Salesman Problem (TSP) algorithm, transforming the cluster head to a cluster member and transforming the cluster member to a cluster head. Extensive experimental results show that the IIA algorithm could automatically adjust cluster heads according to the maximum hops parameter and plan a shorter travel route for the mobile sink. Compared with the Shortest Path Tree-based Data-Gathering Algorithm (SPT-DGA), the IIA algorithm has the characteristics of shorter route length, smaller cluster head count and faster convergence rate. PMID:28445434
Nidheesh, N; Abdul Nazeer, K A; Ameer, P M
2017-12-01
Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data. Copyright © 2017 Elsevier Ltd. All rights reserved.
Determining the Number of Clusters in a Data Set Without Graphical Interpretation
NASA Technical Reports Server (NTRS)
Aguirre, Nathan S.; Davies, Misty D.
2011-01-01
Cluster analysis is a data mining technique that is meant ot simplify the process of classifying data points. The basic clustering process requires an input of data points and the number of clusters wanted. The clustering algorithm will then pick starting C points for the clusters, which can be either random spatial points or random data points. It then assigns each data point to the nearest C point where "nearest usually means Euclidean distance, but some algorithms use another criterion. The next step is determining whether the clustering arrangement this found is within a certain tolerance. If it falls within this tolerance, the process ends. Otherwise the C points are adjusted based on how many data points are in each cluster, and the steps repeat until the algorithm converges,
Bhattacharya, Anindya; De, Rajat K
2010-08-01
Distance based clustering algorithms can group genes that show similar expression values under multiple experimental conditions. They are unable to identify a group of genes that have similar pattern of variation in their expression values. Previously we developed an algorithm called divisive correlation clustering algorithm (DCCA) to tackle this situation, which is based on the concept of correlation clustering. But this algorithm may also fail for certain cases. In order to overcome these situations, we propose a new clustering algorithm, called average correlation clustering algorithm (ACCA), which is able to produce better clustering solution than that produced by some others. ACCA is able to find groups of genes having more common transcription factors and similar pattern of variation in their expression values. Moreover, ACCA is more efficient than DCCA with respect to the time of execution. Like DCCA, we use the concept of correlation clustering concept introduced by Bansal et al. ACCA uses the correlation matrix in such a way that all genes in a cluster have the highest average correlation values with the genes in that cluster. We have applied ACCA and some well-known conventional methods including DCCA to two artificial and nine gene expression datasets, and compared the performance of the algorithms. The clustering results of ACCA are found to be more significantly relevant to the biological annotations than those of the other methods. Analysis of the results show the superiority of ACCA over some others in determining a group of genes having more common transcription factors and with similar pattern of variation in their expression profiles. Availability of the software: The software has been developed using C and Visual Basic languages, and can be executed on the Microsoft Windows platforms. The software may be downloaded as a zip file from http://www.isical.ac.in/~rajat. Then it needs to be installed. Two word files (included in the zip file) need to be consulted before installation and execution of the software. Copyright 2010 Elsevier Inc. All rights reserved.
A Linked List-Based Algorithm for Blob Detection on Embedded Vision-Based Sensors.
Acevedo-Avila, Ricardo; Gonzalez-Mendoza, Miguel; Garcia-Garcia, Andres
2016-05-28
Blob detection is a common task in vision-based applications. Most existing algorithms are aimed at execution on general purpose computers; while very few can be adapted to the computing restrictions present in embedded platforms. This paper focuses on the design of an algorithm capable of real-time blob detection that minimizes system memory consumption. The proposed algorithm detects objects in one image scan; it is based on a linked-list data structure tree used to label blobs depending on their shape and node information. An example application showing the results of a blob detection co-processor has been built on a low-powered field programmable gate array hardware as a step towards developing a smart video surveillance system. The detection method is intended for general purpose application. As such, several test cases focused on character recognition are also examined. The results obtained present a fair trade-off between accuracy and memory requirements; and prove the validity of the proposed approach for real-time implementation on resource-constrained computing platforms.
Reducing the time requirement of k-means algorithm.
Osamor, Victor Chukwudi; Adebiyi, Ezekiel Femi; Oyelade, Jelilli Olarenwaju; Doumbia, Seydou
2012-01-01
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space R(d) and an integer k. The problem is to determine a set of k points in R(d), called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARI(HA)). We found that when k is close to d, the quality is good (ARI(HA)>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARI(HA)>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.
Reducing the Time Requirement of k-Means Algorithm
Osamor, Victor Chukwudi; Adebiyi, Ezekiel Femi; Oyelade, Jelilli Olarenwaju; Doumbia, Seydou
2012-01-01
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the quality is good (ARIHA>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data. PMID:23239974
NASA Technical Reports Server (NTRS)
Eigen, D. J.; Fromm, F. R.; Northouse, R. A.
1974-01-01
A new clustering algorithm is presented that is based on dimensional information. The algorithm includes an inherent feature selection criterion, which is discussed. Further, a heuristic method for choosing the proper number of intervals for a frequency distribution histogram, a feature necessary for the algorithm, is presented. The algorithm, although usable as a stand-alone clustering technique, is then utilized as a global approximator. Local clustering techniques and configuration of a global-local scheme are discussed, and finally the complete global-local and feature selector configuration is shown in application to a real-time adaptive classification scheme for the analysis of remote sensed multispectral scanner data.
Block clustering based on difference of convex functions (DC) programming and DC algorithms.
Le, Hoai Minh; Le Thi, Hoai An; Dinh, Tao Pham; Huynh, Van Ngai
2013-10-01
We investigate difference of convex functions (DC) programming and the DC algorithm (DCA) to solve the block clustering problem in the continuous framework, which traditionally requires solving a hard combinatorial optimization problem. DC reformulation techniques and exact penalty in DC programming are developed to build an appropriate equivalent DC program of the block clustering problem. They lead to an elegant and explicit DCA scheme for the resulting DC program. Computational experiments show the robustness and efficiency of the proposed algorithm and its superiority over standard algorithms such as two-mode K-means, two-mode fuzzy clustering, and block classification EM.
Online clustering algorithms for radar emitter classification.
Liu, Jun; Lee, Jim P Y; Senior; Li, Lingjie; Luo, Zhi-Quan; Wong, K Max
2005-08-01
Radar emitter classification is a special application of data clustering for classifying unknown radar emitters from received radar pulse samples. The main challenges of this task are the high dimensionality of radar pulse samples, small sample group size, and closely located radar pulse clusters. In this paper, two new online clustering algorithms are developed for radar emitter classification: One is model-based using the Minimum Description Length (MDL) criterion and the other is based on competitive learning. Computational complexity is analyzed for each algorithm and then compared. Simulation results show the superior performance of the model-based algorithm over competitive learning in terms of better classification accuracy, flexibility, and stability.
CAMPAIGN: an open-source library of GPU-accelerated data clustering algorithms.
Kohlhoff, Kai J; Sosnick, Marc H; Hsu, William T; Pande, Vijay S; Altman, Russ B
2011-08-15
Data clustering techniques are an essential component of a good data analysis toolbox. Many current bioinformatics applications are inherently compute-intense and work with very large datasets. Sequential algorithms are inadequate for providing the necessary performance. For this reason, we have created Clustering Algorithms for Massively Parallel Architectures, Including GPU Nodes (CAMPAIGN), a central resource for data clustering algorithms and tools that are implemented specifically for execution on massively parallel processing architectures. CAMPAIGN is a library of data clustering algorithms and tools, written in 'C for CUDA' for Nvidia GPUs. The library provides up to two orders of magnitude speed-up over respective CPU-based clustering algorithms and is intended as an open-source resource. New modules from the community will be accepted into the library and the layout of it is such that it can easily be extended to promising future platforms such as OpenCL. Releases of the CAMPAIGN library are freely available for download under the LGPL from https://simtk.org/home/campaign. Source code can also be obtained through anonymous subversion access as described on https://simtk.org/scm/?group_id=453. kjk33@cantab.net.
Research on the precise positioning of customers in large data environment
NASA Astrophysics Data System (ADS)
Zhou, Xu; He, Lili
2018-04-01
Customer positioning has always been a problem that enterprises focus on. In this paper, FCM clustering algorithm is used to cluster customer groups. However, due to the traditional FCM clustering algorithm, which is susceptible to the influence of the initial clustering center and easy to fall into the local optimal problem, the short board of FCM is solved by the gray optimization algorithm (GWO) to achieve efficient and accurate handling of a large number of retailer data.
An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China.
Zou, Hui; Zou, Zhihong; Wang, Xiaojing
2015-11-12
The increase and the complexity of data caused by the uncertain environment is today's reality. In order to identify water quality effectively and reliably, this paper presents a modified fast clustering algorithm for water quality analysis. The algorithm has adopted a varying weights K-means cluster algorithm to analyze water monitoring data. The varying weights scheme was the best weighting indicator selected by a modified indicator weight self-adjustment algorithm based on K-means, which is named MIWAS-K-means. The new clustering algorithm avoids the margin of the iteration not being calculated in some cases. With the fast clustering analysis, we can identify the quality of water samples. The algorithm is applied in water quality analysis of the Haihe River (China) data obtained by the monitoring network over a period of eight years (2006-2013) with four indicators at seven different sites (2078 samples). Both the theoretical and simulated results demonstrate that the algorithm is efficient and reliable for water quality analysis of the Haihe River. In addition, the algorithm can be applied to more complex data matrices with high dimensionality.
Cooperative network clustering and task allocation for heterogeneous small satellite network
NASA Astrophysics Data System (ADS)
Qin, Jing
The research of small satellite has emerged as a hot topic in recent years because of its economical prospects and convenience in launching and design. Due to the size and energy constraints of small satellites, forming a small satellite network(SSN) in which all the satellites cooperate with each other to finish tasks is an efficient and effective way to utilize them. In this dissertation, I designed and evaluated a weight based dominating set clustering algorithm, which efficiently organizes the satellites into stable clusters. The traditional clustering algorithms of large monolithic satellite networks, such as formation flying and satellite swarm, are often limited on automatic formation of clusters. Therefore, a novel Distributed Weight based Dominating Set(DWDS) clustering algorithm is designed to address the clustering problems in the stochastically deployed SSNs. Considering the unique features of small satellites, this algorithm is able to form the clusters efficiently and stably. In this algorithm, satellites are separated into different groups according to their spatial characteristics. A minimum dominating set is chosen as the candidate cluster head set based on their weights, which is a weighted combination of residual energy and connection degree. Then the cluster heads admit new neighbors that accept their invitations into the cluster, until the maximum cluster size is reached. Evaluated by the simulation results, in a SSN with 200 to 800 nodes, the algorithm is able to efficiently cluster more than 90% of nodes in 3 seconds. The Deadline Based Resource Balancing (DBRB) task allocation algorithm is designed for efficient task allocations in heterogeneous LEO small satellite networks. In the task allocation process, the dispatcher needs to consider the deadlines of the tasks as well as the residue energy of different resources for best energy utilization. We assume the tasks adopt a Map-Reduce framework, in which a task can consist of multiple subtasks. The DBRB algorithm is deployed on the head node of a cluster. It gathers the status from each cluster member and calculates their Node Importance Factors (NIFs) from the carried resources, residue power and compute capacity. The algorithm calculates the number of concurrent subtasks based on the deadlines, and allocates the subtasks to the nodes according to their NIF values. The simulation results show that when cluster members carry multiple resources, resource are more balanced and rare resources serve longer in DBRB than in the Earliest Deadline First algorithm. We also show that the algorithm performs well in service isolation by serving multiple tasks with different deadlines. Moreover, the average task response time with various cluster size settings is well controlled within deadlines as well. Except non-realtime tasks, small satellites may execute realtime tasks as well. The location-dependent tasks, such as image capturing, data transmission and remote sensing tasks are realtime tasks that are required to be started / finished on specific time. The resource energy balancing algorithm for realtime and non-realtime mixed workload is developed to efficiently schedule the tasks for best system performance. It calculates the residue energy for each resource type and tries to preserve resources and node availability when distributing tasks. Non-realtime tasks can be preempted by realtime tasks to provide better QoS to realtime tasks. I compared the performance of proposed algorithm with a random-priority scheduling algorithm, with only realtime tasks, non-realtime tasks and mixed tasks. It shows the resource energy reservation algorithm outperforms the latter one with both balanced and imbalanced workloads. Although the resource energy balancing task allocation algorithm for mixed workload provides preemption mechanism for realtime tasks, realtime tasks can still fail due to resource exhaustion. For LEO small satellite flies around the earth on stable orbits, the location-dependent realtime tasks can be considered as periodical tasks. Therefore, it is possible to reserve energy for these realtime tasks. The resource energy reservation algorithm preserves energy for the realtime tasks when the execution routine of periodical realtime tasks is known. In order to reserve energy for tasks starting very early in each period that the node does not have enough energy charged, an energy wrapping mechanism is also designed to calculate the residue energy from the previous period. The simulation results show that without energy reservation, realtime task failure rate can reach more than 60% when the workload is highly imbalanced. In contrast, the resource energy reservation produces zero RT task failures and leads to equal or better aggregate system throughput than the non-reservation algorithm. The proposed algorithm also preserves more energy because it avoids task preemption. (Abstract shortened by ProQuest.).
Population-based metaheuristic optimization in neutron optics and shielding design
NASA Astrophysics Data System (ADS)
DiJulio, D. D.; Björgvinsdóttir, H.; Zendler, C.; Bentley, P. M.
2016-11-01
Population-based metaheuristic algorithms are powerful tools in the design of neutron scattering instruments and the use of these types of algorithms for this purpose is becoming more and more commonplace. Today there exists a wide range of algorithms to choose from when designing an instrument and it is not always initially clear which may provide the best performance. Furthermore, due to the nature of these types of algorithms, the final solution found for a specific design scenario cannot always be guaranteed to be the global optimum. Therefore, to explore the potential benefits and differences between the varieties of these algorithms available, when applied to such design scenarios, we have carried out a detailed study of some commonly used algorithms. For this purpose, we have developed a new general optimization software package which combines a number of common metaheuristic algorithms within a single user interface and is designed specifically with neutronic calculations in mind. The algorithms included in the software are implementations of Particle-Swarm Optimization (PSO), Differential Evolution (DE), Artificial Bee Colony (ABC), and a Genetic Algorithm (GA). The software has been used to optimize the design of several problems in neutron optics and shielding, coupled with Monte-Carlo simulations, in order to evaluate the performance of the various algorithms. Generally, the performance of the algorithms depended on the specific scenarios, however it was found that DE provided the best average solutions in all scenarios investigated in this work.
Ghane, Narjes; Vard, Alireza; Talebi, Ardeshir; Nematollahy, Pardis
2017-01-01
Recognition of white blood cells (WBCs) is the first step to diagnose some particular diseases such as acquired immune deficiency syndrome, leukemia, and other blood-related diseases that are usually done by pathologists using an optical microscope. This process is time-consuming, extremely tedious, and expensive and needs experienced experts in this field. Thus, a computer-aided diagnosis system that assists pathologists in the diagnostic process can be so effective. Segmentation of WBCs is usually a first step in developing a computer-aided diagnosis system. The main purpose of this paper is to segment WBCs from microscopic images. For this purpose, we present a novel combination of thresholding, k-means clustering, and modified watershed algorithms in three stages including (1) segmentation of WBCs from a microscopic image, (2) extraction of nuclei from cell's image, and (3) separation of overlapping cells and nuclei. The evaluation results of the proposed method show that similarity measures, precision, and sensitivity respectively were 92.07, 96.07, and 94.30% for nucleus segmentation and 92.93, 97.41, and 93.78% for cell segmentation. In addition, statistical analysis presents high similarity between manual segmentation and the results obtained by the proposed method.
Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale
Kobourov, Stephen; Gallant, Mike; Börner, Katy
2016-01-01
Overview Notions of community quality underlie the clustering of networks. While studies surrounding network clustering are increasingly common, a precise understanding of the realtionship between different cluster quality metrics is unknown. In this paper, we examine the relationship between stand-alone cluster quality metrics and information recovery metrics through a rigorous analysis of four widely-used network clustering algorithms—Louvain, Infomap, label propagation, and smart local moving. We consider the stand-alone quality metrics of modularity, conductance, and coverage, and we consider the information recovery metrics of adjusted Rand score, normalized mutual information, and a variant of normalized mutual information used in previous work. Our study includes both synthetic graphs and empirical data sets of sizes varying from 1,000 to 1,000,000 nodes. Cluster Quality Metrics We find significant differences among the results of the different cluster quality metrics. For example, clustering algorithms can return a value of 0.4 out of 1 on modularity but score 0 out of 1 on information recovery. We find conductance, though imperfect, to be the stand-alone quality metric that best indicates performance on the information recovery metrics. Additionally, our study shows that the variant of normalized mutual information used in previous work cannot be assumed to differ only slightly from traditional normalized mutual information. Network Clustering Algorithms Smart local moving is the overall best performing algorithm in our study, but discrepancies between cluster evaluation metrics prevent us from declaring it an absolutely superior algorithm. Interestingly, Louvain performed better than Infomap in nearly all the tests in our study, contradicting the results of previous work in which Infomap was superior to Louvain. We find that although label propagation performs poorly when clusters are less clearly defined, it scales efficiently and accurately to large graphs with well-defined clusters. PMID:27391786
Key-Node-Separated Graph Clustering and Layouts for Human Relationship Graph Visualization.
Itoh, Takayuki; Klein, Karsten
2015-01-01
Many graph-drawing methods apply node-clustering techniques based on the density of edges to find tightly connected subgraphs and then hierarchically visualize the clustered graphs. However, users may want to focus on important nodes and their connections to groups of other nodes for some applications. For this purpose, it is effective to separately visualize the key nodes detected based on adjacency and attributes of the nodes. This article presents a graph visualization technique for attribute-embedded graphs that applies a graph-clustering algorithm that accounts for the combination of connections and attributes. The graph clustering step divides the nodes according to the commonality of connected nodes and similarity of feature value vectors. It then calculates the distances between arbitrary pairs of clusters according to the number of connecting edges and the similarity of feature value vectors and finally places the clusters based on the distances. Consequently, the technique separates important nodes that have connections to multiple large clusters and improves the visibility of such nodes' connections. To test this technique, this article presents examples with human relationship graph datasets, including a coauthorship and Twitter communication network dataset.
Luo, Ze; Baoping, Yan; Takekawa, John Y.; Prosser, Diann J.
2012-01-01
We propose a new method to help ornithologists and ecologists discover shared segments on the migratory pathway of the bar-headed geese by time-based plane-sweeping trajectory clustering. We present a density-based time parameterized line segment clustering algorithm, which extends traditional comparable clustering algorithms from temporal and spatial dimensions. We present a time-based plane-sweeping trajectory clustering algorithm to reveal the dynamic evolution of spatial-temporal object clusters and discover common motion patterns of bar-headed geese in the process of migration. Experiments are performed on GPS-based satellite telemetry data from bar-headed geese and results demonstrate our algorithms can correctly discover shared segments of the bar-headed geese migratory pathway. We also present findings on the migratory behavior of bar-headed geese determined from this new analytical approach.
Computational gene expression profiling under salt stress reveals patterns of co-expression
Sanchita; Sharma, Ashok
2016-01-01
Plants respond differently to environmental conditions. Among various abiotic stresses, salt stress is a condition where excess salt in soil causes inhibition of plant growth. To understand the response of plants to the stress conditions, identification of the responsible genes is required. Clustering is a data mining technique used to group the genes with similar expression. The genes of a cluster show similar expression and function. We applied clustering algorithms on gene expression data of Solanum tuberosum showing differential expression in Capsicum annuum under salt stress. The clusters, which were common in multiple algorithms were taken further for analysis. Principal component analysis (PCA) further validated the findings of other cluster algorithms by visualizing their clusters in three-dimensional space. Functional annotation results revealed that most of the genes were involved in stress related responses. Our findings suggest that these algorithms may be helpful in the prediction of the function of co-expressed genes. PMID:26981411
A scalable and practical one-pass clustering algorithm for recommender system
NASA Astrophysics Data System (ADS)
Khalid, Asra; Ghazanfar, Mustansar Ali; Azam, Awais; Alahmari, Saad Ali
2015-12-01
KMeans clustering-based recommendation algorithms have been proposed claiming to increase the scalability of recommender systems. One potential drawback of these algorithms is that they perform training offline and hence cannot accommodate the incremental updates with the arrival of new data, making them unsuitable for the dynamic environments. From this line of research, a new clustering algorithm called One-Pass is proposed, which is a simple, fast, and accurate. We show empirically that the proposed algorithm outperforms K-Means in terms of recommendation and training time while maintaining a good level of accuracy.
Li, Xiaofang; Xu, Lizhong; Wang, Huibin; Song, Jie; Yang, Simon X.
2010-01-01
The traditional Low Energy Adaptive Cluster Hierarchy (LEACH) routing protocol is a clustering-based protocol. The uneven selection of cluster heads results in premature death of cluster heads and premature blind nodes inside the clusters, thus reducing the overall lifetime of the network. With a full consideration of information on energy and distance distribution of neighboring nodes inside the clusters, this paper proposes a new routing algorithm based on differential evolution (DE) to improve the LEACH routing protocol. To meet the requirements of monitoring applications in outdoor environments such as the meteorological, hydrological and wetland ecological environments, the proposed algorithm uses the simple and fast search features of DE to optimize the multi-objective selection of cluster heads and prevent blind nodes for improved energy efficiency and system stability. Simulation results show that the proposed new LEACH routing algorithm has better performance, effectively extends the working lifetime of the system, and improves the quality of the wireless sensor networks. PMID:22219670
Kernel spectral clustering with memory effect
NASA Astrophysics Data System (ADS)
Langone, Rocco; Alzate, Carlos; Suykens, Johan A. K.
2013-05-01
Evolving graphs describe many natural phenomena changing over time, such as social relationships, trade markets, metabolic networks etc. In this framework, performing community detection and analyzing the cluster evolution represents a critical task. Here we propose a new model for this purpose, where the smoothness of the clustering results over time can be considered as a valid prior knowledge. It is based on a constrained optimization formulation typical of Least Squares Support Vector Machines (LS-SVM), where the objective function is designed to explicitly incorporate temporal smoothness. The latter allows the model to cluster the current data well and to be consistent with the recent history. We also propose new model selection criteria in order to carefully choose the hyper-parameters of our model, which is a crucial issue to achieve good performances. We successfully test the model on four toy problems and on a real world network. We also compare our model with Evolutionary Spectral Clustering, which is a state-of-the-art algorithm for community detection of evolving networks, illustrating that the kernel spectral clustering with memory effect can achieve better or equal performances.
Gao, Ying; Wkram, Chris Hadri; Duan, Jiajie; Chou, Jarong
2015-01-01
In order to prolong the network lifetime, energy-efficient protocols adapted to the features of wireless sensor networks should be used. This paper explores in depth the nature of heterogeneous wireless sensor networks, and finally proposes an algorithm to address the problem of finding an effective pathway for heterogeneous clustering energy. The proposed algorithm implements cluster head selection according to the degree of energy attenuation during the network’s running and the degree of candidate nodes’ effective coverage on the whole network, so as to obtain an even energy consumption over the whole network for the situation with high degree of coverage. Simulation results show that the proposed clustering protocol has better adaptability to heterogeneous environments than existing clustering algorithms in prolonging the network lifetime. PMID:26690440
Ju, Chunhua; Xu, Chonghuan
2013-01-01
Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods.
Ju, Chunhua
2013-01-01
Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods. PMID:24381525
Convalescing Cluster Configuration Using a Superlative Framework
Sabitha, R.; Karthik, S.
2015-01-01
Competent data mining methods are vital to discover knowledge from databases which are built as a result of enormous growth of data. Various techniques of data mining are applied to obtain knowledge from these databases. Data clustering is one such descriptive data mining technique which guides in partitioning data objects into disjoint segments. K-means algorithm is a versatile algorithm among the various approaches used in data clustering. The algorithm and its diverse adaptation methods suffer certain problems in their performance. To overcome these issues a superlative algorithm has been proposed in this paper to perform data clustering. The specific feature of the proposed algorithm is discretizing the dataset, thereby improving the accuracy of clustering, and also adopting the binary search initialization method to generate cluster centroids. The generated centroids are fed as input to K-means approach which iteratively segments the data objects into respective clusters. The clustered results are measured for accuracy and validity. Experiments conducted by testing the approach on datasets from the UC Irvine Machine Learning Repository evidently show that the accuracy and validity measure is higher than the other two approaches, namely, simple K-means and Binary Search method. Thus, the proposed approach proves that discretization process will improve the efficacy of descriptive data mining tasks. PMID:26543895
A curvature-based weighted fuzzy c-means algorithm for point clouds de-noising
NASA Astrophysics Data System (ADS)
Cui, Xin; Li, Shipeng; Yan, Xiutian; He, Xinhua
2018-04-01
In order to remove the noise of three-dimensional scattered point cloud and smooth the data without damnify the sharp geometric feature simultaneity, a novel algorithm is proposed in this paper. The feature-preserving weight is added to fuzzy c-means algorithm which invented a curvature weighted fuzzy c-means clustering algorithm. Firstly, the large-scale outliers are removed by the statistics of r radius neighboring points. Then, the algorithm estimates the curvature of the point cloud data by using conicoid parabolic fitting method and calculates the curvature feature value. Finally, the proposed clustering algorithm is adapted to calculate the weighted cluster centers. The cluster centers are regarded as the new points. The experimental results show that this approach is efficient to different scale and intensities of noise in point cloud with a high precision, and perform a feature-preserving nature at the same time. Also it is robust enough to different noise model.
Liu, L L; Liu, M J; Ma, M
2015-09-28
The central task of this study was to mine the gene-to-medium relationship. Adequate knowledge of this relationship could potentially improve the accuracy of differentially expressed gene mining. One of the approaches to differentially expressed gene mining uses conventional clustering algorithms to identify the gene-to-medium relationship. Compared to conventional clustering algorithms, self-organization maps (SOMs) identify the nonlinear aspects of the gene-to-medium relationships by mapping the input space into another higher dimensional feature space. However, SOMs are not suitable for huge datasets consisting of millions of samples. Therefore, a new computational model, the Function Clustering Self-Organization Maps (FCSOMs), was developed. FCSOMs take advantage of the theory of granular computing as well as advanced statistical learning methodologies, and are built specifically for each information granule (a function cluster of genes), which are intelligently partitioned by the clustering algorithm provided by the DAVID_6.7 software platform. However, only the gene functions, and not their expression values, are considered in the fuzzy clustering algorithm of DAVID. Compared to the clustering algorithm of DAVID, these experimental results show a marked improvement in the accuracy of classification with the application of FCSOMs. FCSOMs can handle huge datasets and their complex classification problems, as each FCSOM (modeled for each function cluster) can be easily parallelized.
Yelland, Lisa N; Salter, Amy B; Ryan, Philip
2011-10-15
Modified Poisson regression, which combines a log Poisson regression model with robust variance estimation, is a useful alternative to log binomial regression for estimating relative risks. Previous studies have shown both analytically and by simulation that modified Poisson regression is appropriate for independent prospective data. This method is often applied to clustered prospective data, despite a lack of evidence to support its use in this setting. The purpose of this article is to evaluate the performance of the modified Poisson regression approach for estimating relative risks from clustered prospective data, by using generalized estimating equations to account for clustering. A simulation study is conducted to compare log binomial regression and modified Poisson regression for analyzing clustered data from intervention and observational studies. Both methods generally perform well in terms of bias, type I error, and coverage. Unlike log binomial regression, modified Poisson regression is not prone to convergence problems. The methods are contrasted by using example data sets from 2 large studies. The results presented in this article support the use of modified Poisson regression as an alternative to log binomial regression for analyzing clustered prospective data when clustering is taken into account by using generalized estimating equations.
Chromium: A Stress-Processing Framework for Interactive Rendering on Clusters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Humphreys, G,; Houston, M.; Ng, Y.-R.
2002-01-11
We describe Chromium, a system for manipulating streams of graphics API commands on clusters of workstations. Chromium's stream filters can be arranged to create sort-first and sort-last parallel graphics architectures that, in many cases, support the same applications while using only commodity graphics accelerators. In addition, these stream filters can be extended programmatically, allowing the user to customize the stream transformations performed by nodes in a cluster. Because our stream processing mechanism is completely general, any cluster-parallel rendering algorithm can be either implemented on top of or embedded in Chromium. In this paper, we give examples of real-world applications thatmore » use Chromium to achieve good scalability on clusters of workstations, and describe other potential uses of this stream processing technology. By completely abstracting the underlying graphics architecture, network topology, and API command processing semantics, we allow a variety of applications to run in different environments.« less
A Self-Organizing Spatial Clustering Approach to Support Large-Scale Network RTK Systems.
Shen, Lili; Guo, Jiming; Wang, Lei
2018-06-06
The network real-time kinematic (RTK) technique can provide centimeter-level real time positioning solutions and play a key role in geo-spatial infrastructure. With ever-increasing popularity, network RTK systems will face issues in the support of large numbers of concurrent users. In the past, high-precision positioning services were oriented towards professionals and only supported a few concurrent users. Currently, precise positioning provides a spatial foundation for artificial intelligence (AI), and countless smart devices (autonomous cars, unmanned aerial-vehicles (UAVs), robotic equipment, etc.) require precise positioning services. Therefore, the development of approaches to support large-scale network RTK systems is urgent. In this study, we proposed a self-organizing spatial clustering (SOSC) approach which automatically clusters online users to reduce the computational load on the network RTK system server side. The experimental results indicate that both the SOSC algorithm and the grid algorithm can reduce the computational load efficiently, while the SOSC algorithm gives a more elastic and adaptive clustering solution with different datasets. The SOSC algorithm determines the cluster number and the mean distance to cluster center (MDTCC) according to the data set, while the grid approaches are all predefined. The side-effects of clustering algorithms on the user side are analyzed with real global navigation satellite system (GNSS) data sets. The experimental results indicate that 10 km can be safely used as the cluster radius threshold for the SOSC algorithm without significantly reducing the positioning precision and reliability on the user side.
NASA Astrophysics Data System (ADS)
Zhang, Tianzhen; Wang, Xiumei; Gao, Xinbo
2018-04-01
Nowadays, several datasets are demonstrated by multi-view, which usually include shared and complementary information. Multi-view clustering methods integrate the information of multi-view to obtain better clustering results. Nonnegative matrix factorization has become an essential and popular tool in clustering methods because of its interpretation. However, existing nonnegative matrix factorization based multi-view clustering algorithms do not consider the disagreement between views and neglects the fact that different views will have different contributions to the data distribution. In this paper, we propose a new multi-view clustering method, named adaptive multi-view clustering based on nonnegative matrix factorization and pairwise co-regularization. The proposed algorithm can obtain the parts-based representation of multi-view data by nonnegative matrix factorization. Then, pairwise co-regularization is used to measure the disagreement between views. There is only one parameter to auto learning the weight values according to the contribution of each view to data distribution. Experimental results show that the proposed algorithm outperforms several state-of-the-arts algorithms for multi-view clustering.
The applicability and effectiveness of cluster analysis
NASA Technical Reports Server (NTRS)
Ingram, D. S.; Actkinson, A. L.
1973-01-01
An insight into the characteristics which determine the performance of a clustering algorithm is presented. In order for the techniques which are examined to accurately cluster data, two conditions must be simultaneously satisfied. First the data must have a particular structure, and second the parameters chosen for the clustering algorithm must be correct. By examining the structure of the data from the Cl flight line, it is clear that no single set of parameters can be used to accurately cluster all the different crops. The effectiveness of either a noniterative or iterative clustering algorithm to accurately cluster data representative of the Cl flight line is questionable. Thus extensive a prior knowledge is required in order to use cluster analysis in its present form for applications like assisting in the definition of field boundaries and evaluating the homogeneity of a field. New or modified techniques are necessary for clustering to be a reliable tool.
Internal Cluster Validation on Earthquake Data in the Province of Bengkulu
NASA Astrophysics Data System (ADS)
Rini, D. S.; Novianti, P.; Fransiska, H.
2018-04-01
K-means method is an algorithm for cluster n object based on attribute to k partition, where k < n. There is a deficiency of algorithms that is before the algorithm is executed, k points are initialized randomly so that the resulting data clustering can be different. If the random value for initialization is not good, the clustering becomes less optimum. Cluster validation is a technique to determine the optimum cluster without knowing prior information from data. There are two types of cluster validation, which are internal cluster validation and external cluster validation. This study aims to examine and apply some internal cluster validation, including the Calinski-Harabasz (CH) Index, Sillhouette (S) Index, Davies-Bouldin (DB) Index, Dunn Index (D), and S-Dbw Index on earthquake data in the Bengkulu Province. The calculation result of optimum cluster based on internal cluster validation is CH index, S index, and S-Dbw index yield k = 2, DB Index with k = 6 and Index D with k = 15. Optimum cluster (k = 6) based on DB Index gives good results for clustering earthquake in the Bengkulu Province.
NASA Technical Reports Server (NTRS)
Mach, Douglas M.; Christian, Hugh J.; Blakeslee, Richard; Boccippio, Dennis J.; Goodman, Steve J.; Boeck, William
2006-01-01
We describe the clustering algorithm used by the Lightning Imaging Sensor (LIS) and the Optical Transient Detector (OTD) for combining the lightning pulse data into events, groups, flashes, and areas. Events are single pixels that exceed the LIS/OTD background level during a single frame (2 ms). Groups are clusters of events that occur within the same frame and in adjacent pixels. Flashes are clusters of groups that occur within 330 ms and either 5.5 km (for LIS) or 16.5 km (for OTD) of each other. Areas are clusters of flashes that occur within 16.5 km of each other. Many investigators are utilizing the LIS/OTD flash data; therefore, we test how variations in the algorithms for the event group and group-flash clustering affect the flash count for a subset of the LIS data. We divided the subset into areas with low (1-3), medium (4-15), high (16-63), and very high (64+) flashes to see how changes in the clustering parameters affect the flash rates in these different sizes of areas. We found that as long as the cluster parameters are within about a factor of two of the current values, the flash counts do not change by more than about 20%. Therefore, the flash clustering algorithm used by the LIS and OTD sensors create flash rates that are relatively insensitive to reasonable variations in the clustering algorithms.
NASA Astrophysics Data System (ADS)
Rahman, Md. Habibur; Matin, M. A.; Salma, Umma
2017-12-01
The precipitation patterns of seventeen locations in Bangladesh from 1961 to 2014 were studied using a cluster analysis and metric multidimensional scaling. In doing so, the current research applies four major hierarchical clustering methods to precipitation in conjunction with different dissimilarity measures and metric multidimensional scaling. A variety of clustering algorithms were used to provide multiple clustering dendrograms for a mixture of distance measures. The dendrogram of pre-monsoon rainfall for the seventeen locations formed five clusters. The pre-monsoon precipitation data for the areas of Srimangal and Sylhet were located in two clusters across the combination of five dissimilarity measures and four hierarchical clustering algorithms. The single linkage algorithm with Euclidian and Manhattan distances, the average linkage algorithm with the Minkowski distance, and Ward's linkage algorithm provided similar results with regard to monsoon precipitation. The results of the post-monsoon and winter precipitation data are shown in different types of dendrograms with disparate combinations of sub-clusters. The schematic geometrical representations of the precipitation data using metric multidimensional scaling showed that the post-monsoon rainfall of Cox's Bazar was located far from those of the other locations. The results of a box-and-whisker plot, different clustering techniques, and metric multidimensional scaling indicated that the precipitation behaviour of Srimangal and Sylhet during the pre-monsoon season, Cox's Bazar and Sylhet during the monsoon season, Maijdi Court and Cox's Bazar during the post-monsoon season, and Cox's Bazar and Khulna during the winter differed from those at other locations in Bangladesh.
An adaptive clustering algorithm for image matching based on corner feature
NASA Astrophysics Data System (ADS)
Wang, Zhe; Dong, Min; Mu, Xiaomin; Wang, Song
2018-04-01
The traditional image matching algorithm always can not balance the real-time and accuracy better, to solve the problem, an adaptive clustering algorithm for image matching based on corner feature is proposed in this paper. The method is based on the similarity of the matching pairs of vector pairs, and the adaptive clustering is performed on the matching point pairs. Harris corner detection is carried out first, the feature points of the reference image and the perceived image are extracted, and the feature points of the two images are first matched by Normalized Cross Correlation (NCC) function. Then, using the improved algorithm proposed in this paper, the matching results are clustered to reduce the ineffective operation and improve the matching speed and robustness. Finally, the Random Sample Consensus (RANSAC) algorithm is used to match the matching points after clustering. The experimental results show that the proposed algorithm can effectively eliminate the most wrong matching points while the correct matching points are retained, and improve the accuracy of RANSAC matching, reduce the computation load of whole matching process at the same time.
Service-Aware Clustering: An Energy-Efficient Model for the Internet-of-Things
Bagula, Antoine; Abidoye, Ademola Philip; Zodi, Guy-Alain Lusilao
2015-01-01
Current generation wireless sensor routing algorithms and protocols have been designed based on a myopic routing approach, where the motes are assumed to have the same sensing and communication capabilities. Myopic routing is not a natural fit for the IoT, as it may lead to energy imbalance and subsequent short-lived sensor networks, routing the sensor readings over the most service-intensive sensor nodes, while leaving the least active nodes idle. This paper revisits the issue of energy efficiency in sensor networks to propose a clustering model where sensor devices’ service delivery is mapped into an energy awareness model, used to design a clustering algorithm that finds service-aware clustering (SAC) configurations in IoT settings. The performance evaluation reveals the relative energy efficiency of the proposed SAC algorithm compared to related routing algorithms in terms of energy consumption, the sensor nodes’ life span and its traffic engineering efficiency in terms of throughput and delay. These include the well-known low energy adaptive clustering hierarchy (LEACH) and LEACH-centralized (LEACH-C) algorithms, as well as the most recent algorithms, such as DECSA and MOCRN. PMID:26703619
Service-Aware Clustering: An Energy-Efficient Model for the Internet-of-Things.
Bagula, Antoine; Abidoye, Ademola Philip; Zodi, Guy-Alain Lusilao
2015-12-23
Current generation wireless sensor routing algorithms and protocols have been designed based on a myopic routing approach, where the motes are assumed to have the same sensing and communication capabilities. Myopic routing is not a natural fit for the IoT, as it may lead to energy imbalance and subsequent short-lived sensor networks, routing the sensor readings over the most service-intensive sensor nodes, while leaving the least active nodes idle. This paper revisits the issue of energy efficiency in sensor networks to propose a clustering model where sensor devices' service delivery is mapped into an energy awareness model, used to design a clustering algorithm that finds service-aware clustering (SAC) configurations in IoT settings. The performance evaluation reveals the relative energy efficiency of the proposed SAC algorithm compared to related routing algorithms in terms of energy consumption, the sensor nodes' life span and its traffic engineering efficiency in terms of throughput and delay. These include the well-known low energy adaptive clustering hierarchy (LEACH) and LEACH-centralized (LEACH-C) algorithms, as well as the most recent algorithms, such as DECSA and MOCRN.
NASA Astrophysics Data System (ADS)
García-Flores, Agustín.; Paz-Gallardo, Abel; Plaza, Antonio; Li, Jun
2016-10-01
This paper describes a new web platform dedicated to the classification of satellite images called Hypergim. The current implementation of this platform enables users to perform classification of satellite images from any part of the world thanks to the worldwide maps provided by Google Maps. To perform this classification, Hypergim uses unsupervised algorithms like Isodata and K-means. Here, we present an extension of the original platform in which we adapt Hypergim in order to use supervised algorithms to improve the classification results. This involves a significant modification of the user interface, providing the user with a way to obtain samples of classes present in the images to use in the training phase of the classification process. Another main goal of this development is to improve the runtime of the image classification process. To achieve this goal, we use a parallel implementation of the Random Forest classification algorithm. This implementation is a modification of the well-known CURFIL software package. The use of this type of algorithms to perform image classification is widespread today thanks to its precision and ease of training. The actual implementation of Random Forest was developed using CUDA platform, which enables us to exploit the potential of several models of NVIDIA graphics processing units using them to execute general purpose computing tasks as image classification algorithms. As well as CUDA, we use other parallel libraries as Intel Boost, taking advantage of the multithreading capabilities of modern CPUs. To ensure the best possible results, the platform is deployed in a cluster of commodity graphics processing units (GPUs), so that multiple users can use the tool in a concurrent way. The experimental results indicate that this new algorithm widely outperform the previous unsupervised algorithms implemented in Hypergim, both in runtime as well as precision of the actual classification of the images.
Convex Clustering: An Attractive Alternative to Hierarchical Clustering
Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth
2015-01-01
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340
Convex clustering: an attractive alternative to hierarchical clustering.
Chen, Gary K; Chi, Eric C; Ranola, John Michael O; Lange, Kenneth
2015-05-01
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/.
A cost-function approach to rival penalized competitive learning (RPCL).
Ma, Jinwen; Wang, Taijun
2006-08-01
Rival penalized competitive learning (RPCL) has been shown to be a useful tool for clustering on a set of sample data in which the number of clusters is unknown. However, the RPCL algorithm was proposed heuristically and is still in lack of a mathematical theory to describe its convergence behavior. In order to solve the convergence problem, we investigate it via a cost-function approach. By theoretical analysis, we prove that a general form of RPCL, called distance-sensitive RPCL (DSRPCL), is associated with the minimization of a cost function on the weight vectors of a competitive learning network. As a DSRPCL process decreases the cost to a local minimum, a number of weight vectors eventually fall into a hypersphere surrounding the sample data, while the other weight vectors diverge to infinity. Moreover, it is shown by the theoretical analysis and simulation experiments that if the cost reduces into the global minimum, a correct number of weight vectors is automatically selected and located around the centers of the actual clusters, respectively. Finally, we apply the DSRPCL algorithms to unsupervised color image segmentation and classification of the wine data.
Efficient heuristics for maximum common substructure search.
Englert, Péter; Kovács, Péter
2015-05-26
Maximum common substructure search is a computationally hard optimization problem with diverse applications in the field of cheminformatics, including similarity search, lead optimization, molecule alignment, and clustering. Most of these applications have strict constraints on running time, so heuristic methods are often preferred. However, the development of an algorithm that is both fast enough and accurate enough for most practical purposes is still a challenge. Moreover, in some applications, the quality of a common substructure depends not only on its size but also on various topological features of the one-to-one atom correspondence it defines. Two state-of-the-art heuristic algorithms for finding maximum common substructures have been implemented at ChemAxon Ltd., and effective heuristics have been developed to improve both their efficiency and the relevance of the atom mappings they provide. The implementations have been thoroughly evaluated and compared with existing solutions (KCOMBU and Indigo). The heuristics have been found to greatly improve the performance and applicability of the algorithms. The purpose of this paper is to introduce the applied methods and present the experimental results.
Mustapha, Ibrahim; Ali, Borhanuddin Mohd; Rasid, Mohd Fadlee A.; Sali, Aduwati; Mohamad, Hafizal
2015-01-01
It is well-known that clustering partitions network into logical groups of nodes in order to achieve energy efficiency and to enhance dynamic channel access in cognitive radio through cooperative sensing. While the topic of energy efficiency has been well investigated in conventional wireless sensor networks, the latter has not been extensively explored. In this paper, we propose a reinforcement learning-based spectrum-aware clustering algorithm that allows a member node to learn the energy and cooperative sensing costs for neighboring clusters to achieve an optimal solution. Each member node selects an optimal cluster that satisfies pairwise constraints, minimizes network energy consumption and enhances channel sensing performance through an exploration technique. We first model the network energy consumption and then determine the optimal number of clusters for the network. The problem of selecting an optimal cluster is formulated as a Markov Decision Process (MDP) in the algorithm and the obtained simulation results show convergence, learning and adaptability of the algorithm to dynamic environment towards achieving an optimal solution. Performance comparisons of our algorithm with the Groupwise Spectrum Aware (GWSA)-based algorithm in terms of Sum of Square Error (SSE), complexity, network energy consumption and probability of detection indicate improved performance from the proposed approach. The results further reveal that an energy savings of 9% and a significant Primary User (PU) detection improvement can be achieved with the proposed approach. PMID:26287191
Mustapha, Ibrahim; Mohd Ali, Borhanuddin; Rasid, Mohd Fadlee A; Sali, Aduwati; Mohamad, Hafizal
2015-08-13
It is well-known that clustering partitions network into logical groups of nodes in order to achieve energy efficiency and to enhance dynamic channel access in cognitive radio through cooperative sensing. While the topic of energy efficiency has been well investigated in conventional wireless sensor networks, the latter has not been extensively explored. In this paper, we propose a reinforcement learning-based spectrum-aware clustering algorithm that allows a member node to learn the energy and cooperative sensing costs for neighboring clusters to achieve an optimal solution. Each member node selects an optimal cluster that satisfies pairwise constraints, minimizes network energy consumption and enhances channel sensing performance through an exploration technique. We first model the network energy consumption and then determine the optimal number of clusters for the network. The problem of selecting an optimal cluster is formulated as a Markov Decision Process (MDP) in the algorithm and the obtained simulation results show convergence, learning and adaptability of the algorithm to dynamic environment towards achieving an optimal solution. Performance comparisons of our algorithm with the Groupwise Spectrum Aware (GWSA)-based algorithm in terms of Sum of Square Error (SSE), complexity, network energy consumption and probability of detection indicate improved performance from the proposed approach. The results further reveal that an energy savings of 9% and a significant Primary User (PU) detection improvement can be achieved with the proposed approach.
Gatos, Ilias; Tsantis, Stavros; Spiliopoulos, Stavros; Karnabatidis, Dimitris; Theotokas, Ioannis; Zoumpoulis, Pavlos; Loupas, Thanasis; Hazle, John D; Kagadis, George C
2017-09-01
The purpose of the present study was to employ a computer-aided diagnosis system that classifies chronic liver disease (CLD) using ultrasound shear wave elastography (SWE) imaging, with a stiffness value-clustering and machine-learning algorithm. A clinical data set of 126 patients (56 healthy controls, 70 with CLD) was analyzed. First, an RGB-to-stiffness inverse mapping technique was employed. A five-cluster segmentation was then performed associating corresponding different-color regions with certain stiffness value ranges acquired from the SWE manufacturer-provided color bar. Subsequently, 35 features (7 for each cluster), indicative of physical characteristics existing within the SWE image, were extracted. A stepwise regression analysis toward feature reduction was used to derive a reduced feature subset that was fed into the support vector machine classification algorithm to classify CLD from healthy cases. The highest accuracy in classification of healthy to CLD subject discrimination from the support vector machine model was 87.3% with sensitivity and specificity values of 93.5% and 81.2%, respectively. Receiver operating characteristic curve analysis gave an area under the curve value of 0.87 (confidence interval: 0.77-0.92). A machine-learning algorithm that quantifies color information in terms of stiffness values from SWE images and discriminates CLD from healthy cases is introduced. New objective parameters and criteria for CLD diagnosis employing SWE images provided by the present study can be considered an important step toward color-based interpretation, and could assist radiologists' diagnostic performance on a daily basis after being installed in a PC and employed retrospectively, immediately after the examination. Copyright © 2017 World Federation for Ultrasound in Medicine & Biology. Published by Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Ruske, Simon; Topping, David O.; Foot, Virginia E.; Kaye, Paul H.; Stanley, Warren R.; Crawford, Ian; Morse, Andrew P.; Gallagher, Martin W.
2017-03-01
Characterisation of bioaerosols has important implications within environment and public health sectors. Recent developments in ultraviolet light-induced fluorescence (UV-LIF) detectors such as the Wideband Integrated Bioaerosol Spectrometer (WIBS) and the newly introduced Multiparameter Bioaerosol Spectrometer (MBS) have allowed for the real-time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal spores and pollen.This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents, bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification.For unsupervised learning we tested hierarchical agglomerative clustering with various different linkages. For supervised learning, 11 methods were tested, including decision trees, ensemble methods (random forests, gradient boosting and AdaBoost), two implementations for support vector machines (libsvm and liblinear) and Gaussian methods (Gaussian naïve Bayesian, quadratic and linear discriminant analysis, the k-nearest neighbours algorithm and artificial neural networks).The methods were applied to two different data sets produced using the new MBS, which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. The first data set contained mixed PSLs and the second contained a variety of laboratory-generated aerosol.Clustering in general performs slightly worse than the supervised learning methods, correctly classifying, at best, only 67. 6 and 91. 1 % for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 82. 8 and 98. 27 % of the testing data, respectively, across the two data sets.A possible alternative to gradient boosting is neural networks. We do however note that this method requires much more user input than the other methods, and we suggest that further research should be conducted using this method, especially using parallelised hardware such as the GPU, which would allow for larger networks to be trained, which could possibly yield better results.We also saw that some methods, such as clustering, failed to utilise the additional shape information provided by the instrument, whilst for others, such as the decision trees, ensemble methods and neural networks, improved performance could be attained with the inclusion of such information.
Kim, Hyoungrae; Jang, Cheongyun; Yadav, Dharmendra K; Kim, Mi-Hyun
2017-03-23
The accuracy of any 3D-QSAR, Pharmacophore and 3D-similarity based chemometric target fishing models are highly dependent on a reasonable sample of active conformations. Since a number of diverse conformational sampling algorithm exist, which exhaustively generate enough conformers, however model building methods relies on explicit number of common conformers. In this work, we have attempted to make clustering algorithms, which could find reasonable number of representative conformer ensembles automatically with asymmetric dissimilarity matrix generated from openeye tool kit. RMSD was the important descriptor (variable) of each column of the N × N matrix considered as N variables describing the relationship (network) between the conformer (in a row) and the other N conformers. This approach used to evaluate the performance of the well-known clustering algorithms by comparison in terms of generating representative conformer ensembles and test them over different matrix transformation functions considering the stability. In the network, the representative conformer group could be resampled for four kinds of algorithms with implicit parameters. The directed dissimilarity matrix becomes the only input to the clustering algorithms. Dunn index, Davies-Bouldin index, Eta-squared values and omega-squared values were used to evaluate the clustering algorithms with respect to the compactness and the explanatory power. The evaluation includes the reduction (abstraction) rate of the data, correlation between the sizes of the population and the samples, the computational complexity and the memory usage as well. Every algorithm could find representative conformers automatically without any user intervention, and they reduced the data to 14-19% of the original values within 1.13 s per sample at the most. The clustering methods are simple and practical as they are fast and do not ask for any explicit parameters. RCDTC presented the maximum Dunn and omega-squared values of the four algorithms in addition to consistent reduction rate between the population size and the sample size. The performance of the clustering algorithms was consistent over different transformation functions. Moreover, the clustering method can also be applied to molecular dynamics sampling simulation results.
A Modified MinMax k-Means Algorithm Based on PSO.
Wang, Xiaoyan; Bai, Yanping
The MinMax k -means algorithm is widely used to tackle the effect of bad initialization by minimizing the maximum intraclustering errors. Two parameters, including the exponent parameter and memory parameter, are involved in the executive process. Since different parameters have different clustering errors, it is crucial to choose appropriate parameters. In the original algorithm, a practical framework is given. Such framework extends the MinMax k -means to automatically adapt the exponent parameter to the data set. It has been believed that if the maximum exponent parameter has been set, then the programme can reach the lowest intraclustering errors. However, our experiments show that this is not always correct. In this paper, we modified the MinMax k -means algorithm by PSO to determine the proper values of parameters which can subject the algorithm to attain the lowest clustering errors. The proposed clustering method is tested on some favorite data sets in several different initial situations and is compared to the k -means algorithm and the original MinMax k -means algorithm. The experimental results indicate that our proposed algorithm can reach the lowest clustering errors automatically.
NASA Astrophysics Data System (ADS)
Farsadnia, Farhad; Ghahreman, Bijan
2016-04-01
Hydrologic homogeneous group identification is considered both fundamental and applied research in hydrology. Clustering methods are among conventional methods to assess the hydrological homogeneous regions. Recently, Self-Organizing feature Map (SOM) method has been applied in some studies. However, the main problem of this method is the interpretation on the output map of this approach. Therefore, SOM is used as input to other clustering algorithms. The aim of this study is to apply a two-level Self-Organizing feature map and Ward hierarchical clustering method to determine the hydrologic homogenous regions in North and Razavi Khorasan provinces. At first by principal component analysis, we reduced SOM input matrix dimension, then the SOM was used to form a two-dimensional features map. To determine homogeneous regions for flood frequency analysis, SOM output nodes were used as input into the Ward method. Generally, the regions identified by the clustering algorithms are not statistically homogeneous. Consequently, they have to be adjusted to improve their homogeneity. After adjustment of the homogeneity regions by L-moment tests, five hydrologic homogeneous regions were identified. Finally, adjusted regions were created by a two-level SOM and then the best regional distribution function and associated parameters were selected by the L-moment approach. The results showed that the combination of self-organizing maps and Ward hierarchical clustering by principal components as input is more effective than the hierarchical method, by principal components or standardized inputs to achieve hydrologic homogeneous regions.
A Parallel Processing Algorithm for Remote Sensing Classification
NASA Technical Reports Server (NTRS)
Gualtieri, J. Anthony
2005-01-01
A current thread in parallel computation is the use of cluster computers created by networking a few to thousands of commodity general-purpose workstation-level commuters using the Linux operating system. For example on the Medusa cluster at NASA/GSFC, this provides for super computing performance, 130 G(sub flops) (Linpack Benchmark) at moderate cost, $370K. However, to be useful for scientific computing in the area of Earth science, issues of ease of programming, access to existing scientific libraries, and portability of existing code need to be considered. In this paper, I address these issues in the context of tools for rendering earth science remote sensing data into useful products. In particular, I focus on a problem that can be decomposed into a set of independent tasks, which on a serial computer would be performed sequentially, but with a cluster computer can be performed in parallel, giving an obvious speedup. To make the ideas concrete, I consider the problem of classifying hyperspectral imagery where some ground truth is available to train the classifier. In particular I will use the Support Vector Machine (SVM) approach as applied to hyperspectral imagery. The approach will be to introduce notions about parallel computation and then to restrict the development to the SVM problem. Pseudocode (an outline of the computation) will be described and then details specific to the implementation will be given. Then timing results will be reported to show what speedups are possible using parallel computation. The paper will close with a discussion of the results.
Cluster analysis of Southeastern U.S. climate stations
NASA Astrophysics Data System (ADS)
Stooksbury, D. E.; Michaels, P. J.
1991-09-01
A two-step cluster analysis of 449 Southeastern climate stations is used to objectively determine general climate clusters (groups of climate stations) for eight southeastern states. The purpose is objectively to define regions of climatic homogeneity that should perform more robustly in subsequent climatic impact models. This type of analysis has been successfully used in many related climate research problems including the determination of corn/climate districts in Iowa (Ortiz-Valdez, 1985) and the classification of synoptic climate types (Davis, 1988). These general climate clusters may be more appropriate for climate research than the standard climate divisions (CD) groupings of climate stations, which are modifications of the agro-economic United States Department of Agriculture crop reporting districts. Unlike the CD's, these objectively determined climate clusters are not restricted by state borders and thus have reduced multicollinearity which makes them more appropriate for the study of the impact of climate and climatic change.
Learner Typologies Development Using OIndex and Data Mining Based Clustering Techniques
ERIC Educational Resources Information Center
Luan, Jing
2004-01-01
This explorative data mining project used distance based clustering algorithm to study 3 indicators, called OIndex, of student behavioral data and stabilized at a 6-cluster scenario following an exhaustive explorative study of 4, 5, and 6 cluster scenarios produced by K-Means and TwoStep algorithms. Using principles in data mining, the study…
Self-organization and clustering algorithms
NASA Technical Reports Server (NTRS)
Bezdek, James C.
1991-01-01
Kohonen's feature maps approach to clustering is often likened to the k or c-means clustering algorithms. Here, the author identifies some similarities and differences between the hard and fuzzy c-Means (HCM/FCM) or ISODATA algorithms and Kohonen's self-organizing approach. The author concludes that some differences are significant, but at the same time there may be some important unknown relationships between the two methodologies. Several avenues of research are proposed.
Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale.
Emmons, Scott; Kobourov, Stephen; Gallant, Mike; Börner, Katy
2016-01-01
Notions of community quality underlie the clustering of networks. While studies surrounding network clustering are increasingly common, a precise understanding of the realtionship between different cluster quality metrics is unknown. In this paper, we examine the relationship between stand-alone cluster quality metrics and information recovery metrics through a rigorous analysis of four widely-used network clustering algorithms-Louvain, Infomap, label propagation, and smart local moving. We consider the stand-alone quality metrics of modularity, conductance, and coverage, and we consider the information recovery metrics of adjusted Rand score, normalized mutual information, and a variant of normalized mutual information used in previous work. Our study includes both synthetic graphs and empirical data sets of sizes varying from 1,000 to 1,000,000 nodes. We find significant differences among the results of the different cluster quality metrics. For example, clustering algorithms can return a value of 0.4 out of 1 on modularity but score 0 out of 1 on information recovery. We find conductance, though imperfect, to be the stand-alone quality metric that best indicates performance on the information recovery metrics. Additionally, our study shows that the variant of normalized mutual information used in previous work cannot be assumed to differ only slightly from traditional normalized mutual information. Smart local moving is the overall best performing algorithm in our study, but discrepancies between cluster evaluation metrics prevent us from declaring it an absolutely superior algorithm. Interestingly, Louvain performed better than Infomap in nearly all the tests in our study, contradicting the results of previous work in which Infomap was superior to Louvain. We find that although label propagation performs poorly when clusters are less clearly defined, it scales efficiently and accurately to large graphs with well-defined clusters.
Reconstruction of a digital core containing clay minerals based on a clustering algorithm.
He, Yanlong; Pu, Chunsheng; Jing, Cheng; Gu, Xiaoyu; Chen, Qingdong; Liu, Hongzhi; Khan, Nasir; Dong, Qiaoling
2017-10-01
It is difficult to obtain a core sample and information for digital core reconstruction of mature sandstone reservoirs around the world, especially for an unconsolidated sandstone reservoir. Meanwhile, reconstruction and division of clay minerals play a vital role in the reconstruction of the digital cores, although the two-dimensional data-based reconstruction methods are specifically applicable as the microstructure reservoir simulation methods for the sandstone reservoir. However, reconstruction of clay minerals is still challenging from a research viewpoint for the better reconstruction of various clay minerals in the digital cores. In the present work, the content of clay minerals was considered on the basis of two-dimensional information about the reservoir. After application of the hybrid method, and compared with the model reconstructed by the process-based method, the digital core containing clay clusters without the labels of the clusters' number, size, and texture were the output. The statistics and geometry of the reconstruction model were similar to the reference model. In addition, the Hoshen-Kopelman algorithm was used to label various connected unclassified clay clusters in the initial model and then the number and size of clay clusters were recorded. At the same time, the K-means clustering algorithm was applied to divide the labeled, large connecting clusters into smaller clusters on the basis of difference in the clusters' characteristics. According to the clay minerals' characteristics, such as types, textures, and distributions, the digital core containing clay minerals was reconstructed by means of the clustering algorithm and the clay clusters' structure judgment. The distributions and textures of the clay minerals of the digital core were reasonable. The clustering algorithm improved the digital core reconstruction and provided an alternative method for the simulation of different clay minerals in the digital cores.
Validating clustering of molecular dynamics simulations using polymer models.
Phillips, Joshua L; Colvin, Michael E; Newsam, Shawn
2011-11-14
Molecular dynamics (MD) simulation is a powerful technique for sampling the meta-stable and transitional conformations of proteins and other biomolecules. Computational data clustering has emerged as a useful, automated technique for extracting conformational states from MD simulation data. Despite extensive application, relatively little work has been done to determine if the clustering algorithms are actually extracting useful information. A primary goal of this paper therefore is to provide such an understanding through a detailed analysis of data clustering applied to a series of increasingly complex biopolymer models. We develop a novel series of models using basic polymer theory that have intuitive, clearly-defined dynamics and exhibit the essential properties that we are seeking to identify in MD simulations of real biomolecules. We then apply spectral clustering, an algorithm particularly well-suited for clustering polymer structures, to our models and MD simulations of several intrinsically disordered proteins. Clustering results for the polymer models provide clear evidence that the meta-stable and transitional conformations are detected by the algorithm. The results for the polymer models also help guide the analysis of the disordered protein simulations by comparing and contrasting the statistical properties of the extracted clusters. We have developed a framework for validating the performance and utility of clustering algorithms for studying molecular biopolymer simulations that utilizes several analytic and dynamic polymer models which exhibit well-behaved dynamics including: meta-stable states, transition states, helical structures, and stochastic dynamics. We show that spectral clustering is robust to anomalies introduced by structural alignment and that different structural classes of intrinsically disordered proteins can be reliably discriminated from the clustering results. To our knowledge, our framework is the first to utilize model polymers to rigorously test the utility of clustering algorithms for studying biopolymers.
Validating clustering of molecular dynamics simulations using polymer models
2011-01-01
Background Molecular dynamics (MD) simulation is a powerful technique for sampling the meta-stable and transitional conformations of proteins and other biomolecules. Computational data clustering has emerged as a useful, automated technique for extracting conformational states from MD simulation data. Despite extensive application, relatively little work has been done to determine if the clustering algorithms are actually extracting useful information. A primary goal of this paper therefore is to provide such an understanding through a detailed analysis of data clustering applied to a series of increasingly complex biopolymer models. Results We develop a novel series of models using basic polymer theory that have intuitive, clearly-defined dynamics and exhibit the essential properties that we are seeking to identify in MD simulations of real biomolecules. We then apply spectral clustering, an algorithm particularly well-suited for clustering polymer structures, to our models and MD simulations of several intrinsically disordered proteins. Clustering results for the polymer models provide clear evidence that the meta-stable and transitional conformations are detected by the algorithm. The results for the polymer models also help guide the analysis of the disordered protein simulations by comparing and contrasting the statistical properties of the extracted clusters. Conclusions We have developed a framework for validating the performance and utility of clustering algorithms for studying molecular biopolymer simulations that utilizes several analytic and dynamic polymer models which exhibit well-behaved dynamics including: meta-stable states, transition states, helical structures, and stochastic dynamics. We show that spectral clustering is robust to anomalies introduced by structural alignment and that different structural classes of intrinsically disordered proteins can be reliably discriminated from the clustering results. To our knowledge, our framework is the first to utilize model polymers to rigorously test the utility of clustering algorithms for studying biopolymers. PMID:22082218
Dong, J; Hayakawa, Y; Kober, C
2014-01-01
When metallic prosthetic appliances and dental fillings exist in the oral cavity, the appearance of metal-induced streak artefacts is not avoidable in CT images. The aim of this study was to develop a method for artefact reduction using the statistical reconstruction on multidetector row CT images. Adjacent CT images often depict similar anatomical structures. Therefore, reconstructed images with weak artefacts were attempted using projection data of an artefact-free image in a neighbouring thin slice. Images with moderate and strong artefacts were continuously processed in sequence by successive iterative restoration where the projection data was generated from the adjacent reconstructed slice. First, the basic maximum likelihood-expectation maximization algorithm was applied. Next, the ordered subset-expectation maximization algorithm was examined. Alternatively, a small region of interest setting was designated. Finally, the general purpose graphic processing unit machine was applied in both situations. The algorithms reduced the metal-induced streak artefacts on multidetector row CT images when the sequential processing method was applied. The ordered subset-expectation maximization and small region of interest reduced the processing duration without apparent detriments. A general-purpose graphic processing unit realized the high performance. A statistical reconstruction method was applied for the streak artefact reduction. The alternative algorithms applied were effective. Both software and hardware tools, such as ordered subset-expectation maximization, small region of interest and general-purpose graphic processing unit achieved fast artefact correction.
An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China
Zou, Hui; Zou, Zhihong; Wang, Xiaojing
2015-01-01
The increase and the complexity of data caused by the uncertain environment is today’s reality. In order to identify water quality effectively and reliably, this paper presents a modified fast clustering algorithm for water quality analysis. The algorithm has adopted a varying weights K-means cluster algorithm to analyze water monitoring data. The varying weights scheme was the best weighting indicator selected by a modified indicator weight self-adjustment algorithm based on K-means, which is named MIWAS-K-means. The new clustering algorithm avoids the margin of the iteration not being calculated in some cases. With the fast clustering analysis, we can identify the quality of water samples. The algorithm is applied in water quality analysis of the Haihe River (China) data obtained by the monitoring network over a period of eight years (2006–2013) with four indicators at seven different sites (2078 samples). Both the theoretical and simulated results demonstrate that the algorithm is efficient and reliable for water quality analysis of the Haihe River. In addition, the algorithm can be applied to more complex data matrices with high dimensionality. PMID:26569283
3D segmentations of neuronal nuclei from confocal microscope image stacks
LaTorre, Antonio; Alonso-Nanclares, Lidia; Muelas, Santiago; Peña, José-María; DeFelipe, Javier
2013-01-01
In this paper, we present an algorithm to create 3D segmentations of neuronal cells from stacks of previously segmented 2D images. The idea behind this proposal is to provide a general method to reconstruct 3D structures from 2D stacks, regardless of how these 2D stacks have been obtained. The algorithm not only reuses the information obtained in the 2D segmentation, but also attempts to correct some typical mistakes made by the 2D segmentation algorithms (for example, under segmentation of tightly-coupled clusters of cells). We have tested our algorithm in a real scenario—the segmentation of the neuronal nuclei in different layers of the rat cerebral cortex. Several representative images from different layers of the cerebral cortex have been considered and several 2D segmentation algorithms have been compared. Furthermore, the algorithm has also been compared with the traditional 3D Watershed algorithm and the results obtained here show better performance in terms of correctly identified neuronal nuclei. PMID:24409123
3D segmentations of neuronal nuclei from confocal microscope image stacks.
Latorre, Antonio; Alonso-Nanclares, Lidia; Muelas, Santiago; Peña, José-María; Defelipe, Javier
2013-01-01
In this paper, we present an algorithm to create 3D segmentations of neuronal cells from stacks of previously segmented 2D images. The idea behind this proposal is to provide a general method to reconstruct 3D structures from 2D stacks, regardless of how these 2D stacks have been obtained. The algorithm not only reuses the information obtained in the 2D segmentation, but also attempts to correct some typical mistakes made by the 2D segmentation algorithms (for example, under segmentation of tightly-coupled clusters of cells). We have tested our algorithm in a real scenario-the segmentation of the neuronal nuclei in different layers of the rat cerebral cortex. Several representative images from different layers of the cerebral cortex have been considered and several 2D segmentation algorithms have been compared. Furthermore, the algorithm has also been compared with the traditional 3D Watershed algorithm and the results obtained here show better performance in terms of correctly identified neuronal nuclei.
Alignment and integration of complex networks by hypergraph-based spectral clustering
NASA Astrophysics Data System (ADS)
Michoel, Tom; Nachtergaele, Bruno
2012-11-01
Complex networks possess a rich, multiscale structure reflecting the dynamical and functional organization of the systems they model. Often there is a need to analyze multiple networks simultaneously, to model a system by more than one type of interaction, or to go beyond simple pairwise interactions, but currently there is a lack of theoretical and computational methods to address these problems. Here we introduce a framework for clustering and community detection in such systems using hypergraph representations. Our main result is a generalization of the Perron-Frobenius theorem from which we derive spectral clustering algorithms for directed and undirected hypergraphs. We illustrate our approach with applications for local and global alignment of protein-protein interaction networks between multiple species, for tripartite community detection in folksonomies, and for detecting clusters of overlapping regulatory pathways in directed networks.
Alignment and integration of complex networks by hypergraph-based spectral clustering.
Michoel, Tom; Nachtergaele, Bruno
2012-11-01
Complex networks possess a rich, multiscale structure reflecting the dynamical and functional organization of the systems they model. Often there is a need to analyze multiple networks simultaneously, to model a system by more than one type of interaction, or to go beyond simple pairwise interactions, but currently there is a lack of theoretical and computational methods to address these problems. Here we introduce a framework for clustering and community detection in such systems using hypergraph representations. Our main result is a generalization of the Perron-Frobenius theorem from which we derive spectral clustering algorithms for directed and undirected hypergraphs. We illustrate our approach with applications for local and global alignment of protein-protein interaction networks between multiple species, for tripartite community detection in folksonomies, and for detecting clusters of overlapping regulatory pathways in directed networks.
Eyler, Lauren; Hubbard, Alan; Juillard, Catherine
2016-10-01
Low and middle-income countries (LMICs) and the world's poor bear a disproportionate share of the global burden of injury. Data regarding disparities in injury are vital to inform injury prevention and trauma systems strengthening interventions targeted towards vulnerable populations, but are limited in LMICs. We aim to facilitate injury disparities research by generating a standardized methodology for assessing economic status in resource-limited country trauma registries where complex metrics such as income, expenditures, and wealth index are infeasible to assess. To address this need, we developed a cluster analysis-based algorithm for generating simple population-specific metrics of economic status using nationally representative Demographic and Health Surveys (DHS) household assets data. For a limited number of variables, g, our algorithm performs weighted k-medoids clustering of the population using all combinations of g asset variables and selects the combination of variables and number of clusters that maximize average silhouette width (ASW). In simulated datasets containing both randomly distributed variables and "true" population clusters defined by correlated categorical variables, the algorithm selected the correct variable combination and appropriate cluster numbers unless variable correlation was very weak. When used with 2011 Cameroonian DHS data, our algorithm identified twenty economic clusters with ASW 0.80, indicating well-defined population clusters. This economic model for assessing health disparities will be used in the new Cameroonian six-hospital centralized trauma registry. By describing our standardized methodology and algorithm for generating economic clustering models, we aim to facilitate measurement of health disparities in other trauma registries in resource-limited countries. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
An effective fuzzy kernel clustering analysis approach for gene expression data.
Sun, Lin; Xu, Jiucheng; Yin, Jiaojiao
2015-01-01
Fuzzy clustering is an important tool for analyzing microarray data. A major problem in applying fuzzy clustering method to microarray gene expression data is the choice of parameters with cluster number and centers. This paper proposes a new approach to fuzzy kernel clustering analysis (FKCA) that identifies desired cluster number and obtains more steady results for gene expression data. First of all, to optimize characteristic differences and estimate optimal cluster number, Gaussian kernel function is introduced to improve spectrum analysis method (SAM). By combining subtractive clustering with max-min distance mean, maximum distance method (MDM) is proposed to determine cluster centers. Then, the corresponding steps of improved SAM (ISAM) and MDM are given respectively, whose superiority and stability are illustrated through performing experimental comparisons on gene expression data. Finally, by introducing ISAM and MDM into FKCA, an effective improved FKCA algorithm is proposed. Experimental results from public gene expression data and UCI database show that the proposed algorithms are feasible for cluster analysis, and the clustering accuracy is higher than the other related clustering algorithms.
NASA Astrophysics Data System (ADS)
Ebrahimi, A.; Pahlavani, P.; Masoumi, Z.
2017-09-01
Traffic monitoring and managing in urban intelligent transportation systems (ITS) can be carried out based on vehicular sensor networks. In a vehicular sensor network, vehicles equipped with sensors such as GPS, can act as mobile sensors for sensing the urban traffic and sending the reports to a traffic monitoring center (TMC) for traffic estimation. The energy consumption by the sensor nodes is a main problem in the wireless sensor networks (WSNs); moreover, it is the most important feature in designing these networks. Clustering the sensor nodes is considered as an effective solution to reduce the energy consumption of WSNs. Each cluster should have a Cluster Head (CH), and a number of nodes located within its supervision area. The cluster heads are responsible for gathering and aggregating the information of clusters. Then, it transmits the information to the data collection center. Hence, the use of clustering decreases the volume of transmitting information, and, consequently, reduces the energy consumption of network. In this paper, Fuzzy C-Means (FCM) and Fuzzy Subtractive algorithms are employed to cluster sensors and investigate their performance on the energy consumption of sensors. It can be seen that the FCM algorithm and Fuzzy Subtractive have been reduced energy consumption of vehicle sensors up to 90.68% and 92.18%, respectively. Comparing the performance of the algorithms implies the 1.5 percent improvement in Fuzzy Subtractive algorithm in comparison.
Dräger, Andreas; Kronfeld, Marcel; Ziller, Michael J; Supper, Jochen; Planatscher, Hannes; Magnus, Jørgen B; Oldiges, Marco; Kohlbacher, Oliver; Zell, Andreas
2009-01-01
Background To understand the dynamic behavior of cellular systems, mathematical modeling is often necessary and comprises three steps: (1) experimental measurement of participating molecules, (2) assignment of rate laws to each reaction, and (3) parameter calibration with respect to the measurements. In each of these steps the modeler is confronted with a plethora of alternative approaches, e. g., the selection of approximative rate laws in step two as specific equations are often unknown, or the choice of an estimation procedure with its specific settings in step three. This overall process with its numerous choices and the mutual influence between them makes it hard to single out the best modeling approach for a given problem. Results We investigate the modeling process using multiple kinetic equations together with various parameter optimization methods for a well-characterized example network, the biosynthesis of valine and leucine in C. glutamicum. For this purpose, we derive seven dynamic models based on generalized mass action, Michaelis-Menten and convenience kinetics as well as the stochastic Langevin equation. In addition, we introduce two modeling approaches for feedback inhibition to the mass action kinetics. The parameters of each model are estimated using eight optimization strategies. To determine the most promising modeling approaches together with the best optimization algorithms, we carry out a two-step benchmark: (1) coarse-grained comparison of the algorithms on all models and (2) fine-grained tuning of the best optimization algorithms and models. To analyze the space of the best parameters found for each model, we apply clustering, variance, and correlation analysis. Conclusion A mixed model based on the convenience rate law and the Michaelis-Menten equation, in which all reactions are assumed to be reversible, is the most suitable deterministic modeling approach followed by a reversible generalized mass action kinetics model. A Langevin model is advisable to take stochastic effects into account. To estimate the model parameters, three algorithms are particularly useful: For first attempts the settings-free Tribes algorithm yields valuable results. Particle swarm optimization and differential evolution provide significantly better results with appropriate settings. PMID:19144170
Optimization of Selected Remote Sensing Algorithms for Embedded NVIDIA Kepler GPU Architecture
NASA Technical Reports Server (NTRS)
Riha, Lubomir; Le Moigne, Jacqueline; El-Ghazawi, Tarek
2015-01-01
This paper evaluates the potential of embedded Graphic Processing Units in the Nvidias Tegra K1 for onboard processing. The performance is compared to a general purpose multi-core CPU and full fledge GPU accelerator. This study uses two algorithms: Wavelet Spectral Dimension Reduction of Hyperspectral Imagery and Automated Cloud-Cover Assessment (ACCA) Algorithm. Tegra K1 achieved 51 for ACCA algorithm and 20 for the dimension reduction algorithm, as compared to the performance of the high-end 8-core server Intel Xeon CPU with 13.5 times higher power consumption.
NASA Astrophysics Data System (ADS)
Arimbi, Mentari Dian; Bustamam, Alhadi; Lestari, Dian
2017-03-01
Data clustering can be executed through partition or hierarchical method for many types of data including DNA sequences. Both clustering methods can be combined by processing partition algorithm in the first level and hierarchical in the second level, called hybrid clustering. In the partition phase some popular methods such as PAM, K-means, or Fuzzy c-means methods could be applied. In this study we selected partitioning around medoids (PAM) in our partition stage. Furthermore, following the partition algorithm, in hierarchical stage we applied divisive analysis algorithm (DIANA) in order to have more specific clusters and sub clusters structures. The number of main clusters is determined using Davies Bouldin Index (DBI) value. We choose the optimal number of clusters if the results minimize the DBI value. In this work, we conduct the clustering on 1252 HPV DNA sequences data from GenBank. The characteristic extraction is initially performed, followed by normalizing and genetic distance calculation using Euclidean distance. In our implementation, we used the hybrid PAM and DIANA using the R open source programming tool. In our results, we obtained 3 main clusters with average DBI value is 0.979, using PAM in the first stage. After executing DIANA in the second stage, we obtained 4 sub clusters for Cluster-1, 9 sub clusters for Cluster-2 and 2 sub clusters in Cluster-3, with the BDI value 0.972, 0.771, and 0.768 for each main cluster respectively. Since the second stage produce lower DBI value compare to the DBI value in the first stage, we conclude that this hybrid approach can improve the accuracy of our clustering results.
Detection of protein complex from protein-protein interaction network using Markov clustering
NASA Astrophysics Data System (ADS)
Ochieng, P. J.; Kusuma, W. A.; Haryanto, T.
2017-05-01
Detection of complexes, or groups of functionally related proteins, is an important challenge while analysing biological networks. However, existing algorithms to identify protein complexes are insufficient when applied to dense networks of experimentally derived interaction data. Therefore, we introduced a graph clustering method based on Markov clustering algorithm to identify protein complex within highly interconnected protein-protein interaction networks. Protein-protein interaction network was first constructed to develop geometrical network, the network was then partitioned using Markov clustering to detect protein complexes. The interest of the proposed method was illustrated by its application to Human Proteins associated to type II diabetes mellitus. Flow simulation of MCL algorithm was initially performed and topological properties of the resultant network were analysed for detection of the protein complex. The results indicated the proposed method successfully detect an overall of 34 complexes with 11 complexes consisting of overlapping modules and 20 non-overlapping modules. The major complex consisted of 102 proteins and 521 interactions with cluster modularity and density of 0.745 and 0.101 respectively. The comparison analysis revealed MCL out perform AP, MCODE and SCPS algorithms with high clustering coefficient (0.751) network density and modularity index (0.630). This demonstrated MCL was the most reliable and efficient graph clustering algorithm for detection of protein complexes from PPI networks.
Adamczak, Rafal; Meller, Jarek
2016-12-28
Advances in computing have enabled current protein and RNA structure prediction and molecular simulation methods to dramatically increase their sampling of conformational spaces. The quickly growing number of experimentally resolved structures, and databases such as the Protein Data Bank, also implies large scale structural similarity analyses to retrieve and classify macromolecular data. Consequently, the computational cost of structure comparison and clustering for large sets of macromolecular structures has become a bottleneck that necessitates further algorithmic improvements and development of efficient software solutions. uQlust is a versatile and easy-to-use tool for ultrafast ranking and clustering of macromolecular structures. uQlust makes use of structural profiles of proteins and nucleic acids, while combining a linear-time algorithm for implicit comparison of all pairs of models with profile hashing to enable efficient clustering of large data sets with a low memory footprint. In addition to ranking and clustering of large sets of models of the same protein or RNA molecule, uQlust can also be used in conjunction with fragment-based profiles in order to cluster structures of arbitrary length. For example, hierarchical clustering of the entire PDB using profile hashing can be performed on a typical laptop, thus opening an avenue for structural explorations previously limited to dedicated resources. The uQlust package is freely available under the GNU General Public License at https://github.com/uQlust . uQlust represents a drastic reduction in the computational complexity and memory requirements with respect to existing clustering and model quality assessment methods for macromolecular structure analysis, while yielding results on par with traditional approaches for both proteins and RNAs.
SLURM: Simple Linux Utility for Resource Management
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jette, M; Dunlap, C; Garlick, J
2002-07-08
Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of thousands of nodes. Components include machine status, partition management, job management, scheduling and stream copy modules. The design also includes a scalable, general-purpose communication infrastructure. This paper presents a overview of the SLURM architecture and functionality.
NETRA: A parallel architecture for integrated vision systems. 1: Architecture and organization
NASA Technical Reports Server (NTRS)
Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is considered to be a system that uses vision algorithms from all levels of processing for a high level application (such as object recognition). A model of computation is presented for parallel processing for an IVS. Using the model, desired features and capabilities of a parallel architecture suitable for IVSs are derived. Then a multiprocessor architecture (called NETRA) is presented. This architecture is highly flexible without the use of complex interconnection schemes. The topology of NETRA is recursively defined and hence is easily scalable from small to large systems. Homogeneity of NETRA permits fault tolerance and graceful degradation under faults. It is a recursively defined tree-type hierarchical architecture where each of the leaf nodes consists of a cluster of processors connected with a programmable crossbar with selective broadcast capability to provide for desired flexibility. A qualitative evaluation of NETRA is presented. Then general schemes are described to map parallel algorithms onto NETRA. Algorithms are classified according to their communication requirements for parallel processing. An extensive analysis of inter-cluster communication strategies in NETRA is presented, and parameters affecting performance of parallel algorithms when mapped on NETRA are discussed. Finally, a methodology to evaluate performance of algorithms on NETRA is described.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Solomon, Justin, E-mail: justin.solomon@duke.edu; Samei, Ehsan
2014-09-15
Purpose: Quantum noise properties of CT images are generally assessed using simple geometric phantoms with uniform backgrounds. Such phantoms may be inadequate when assessing nonlinear reconstruction or postprocessing algorithms. The purpose of this study was to design anatomically informed textured phantoms and use the phantoms to assess quantum noise properties across two clinically available reconstruction algorithms, filtered back projection (FBP) and sinogram affirmed iterative reconstruction (SAFIRE). Methods: Two phantoms were designed to represent lung and soft-tissue textures. The lung phantom included intricate vessel-like structures along with embedded nodules (spherical, lobulated, and spiculated). The soft tissue phantom was designed based onmore » a three-dimensional clustered lumpy background with included low-contrast lesions (spherical and anthropomorphic). The phantoms were built using rapid prototyping (3D printing) technology and, along with a uniform phantom of similar size, were imaged on a Siemens SOMATOM Definition Flash CT scanner and reconstructed with FBP and SAFIRE. Fifty repeated acquisitions were acquired for each background type and noise was assessed by estimating pixel-value statistics, such as standard deviation (i.e., noise magnitude), autocorrelation, and noise power spectrum. Noise stationarity was also assessed by examining the spatial distribution of noise magnitude. The noise properties were compared across background types and between the two reconstruction algorithms. Results: In FBP and SAFIRE images, noise was globally nonstationary for all phantoms. In FBP images of all phantoms, and in SAFIRE images of the uniform phantom, noise appeared to be locally stationary (within a reasonably small region of interest). Noise was locally nonstationary in SAFIRE images of the textured phantoms with edge pixels showing higher noise magnitude compared to pixels in more homogenous regions. For pixels in uniform regions, noise magnitude was reduced by an average of 60% in SAFIRE images compared to FBP. However, for edge pixels, noise magnitude ranged from 20% higher to 40% lower in SAFIRE images compared to FBP. SAFIRE images of the lung phantom exhibited distinct regions with varying noise texture (i.e., noise autocorrelation/power spectra). Conclusions: Quantum noise properties observed in uniform phantoms may not be representative of those in actual patients for nonlinear reconstruction algorithms. Anatomical texture should be considered when evaluating the performance of CT systems that use such nonlinear algorithms.« less
NASA Astrophysics Data System (ADS)
Kumar, Rohit; Puri, Rajeev K.
2018-03-01
Employing the quantum molecular dynamics (QMD) approach for nucleus-nucleus collisions, we test the predictive power of the energy-based clusterization algorithm, i.e., the simulating annealing clusterization algorithm (SACA), to describe the experimental data of charge distribution and various event-by-event correlations among fragments. The calculations are constrained into the Fermi-energy domain and/or mildly excited nuclear matter. Our detailed study spans over different system masses, and system-mass asymmetries of colliding partners show the importance of the energy-based clusterization algorithm for understanding multifragmentation. The present calculations are also compared with the other available calculations, which use one-body models, statistical models, and/or hybrid models.
1988-03-31
radar operation and data - collection activities, a large data -analysis effort has been under way in support of automatic wind-shear detection algorithm ...REDUCTION AND ALGORITHM DEVELOPMENT 49 A. General-Purpose Software 49 B. Concurrent Computer Systems 49 C. Sun Workstations 51 D. Radar Data Analysis 52...1. Algorithm Verification 52 2. Other Studies 53 3. Translations 54 4. Outside Distributions 55 E. Mesonet/LLWAS Data Analysis 55 1. 1985 Data 55 2
Optimized data fusion for K-means Laplacian clustering
Yu, Shi; Liu, Xinhai; Tranchevent, Léon-Charles; Glänzel, Wolfgang; Suykens, Johan A. K.; De Moor, Bart; Moreau, Yves
2011-01-01
Motivation: We propose a novel algorithm to combine multiple kernels and Laplacians for clustering analysis. The new algorithm is formulated on a Rayleigh quotient objective function and is solved as a bi-level alternating minimization procedure. Using the proposed algorithm, the coefficients of kernels and Laplacians can be optimized automatically. Results: Three variants of the algorithm are proposed. The performance is systematically validated on two real-life data fusion applications. The proposed Optimized Kernel Laplacian Clustering (OKLC) algorithms perform significantly better than other methods. Moreover, the coefficients of kernels and Laplacians optimized by OKLC show some correlation with the rank of performance of individual data source. Though in our evaluation the K values are predefined, in practical studies, the optimal cluster number can be consistently estimated from the eigenspectrum of the combined kernel Laplacian matrix. Availability: The MATLAB code of algorithms implemented in this paper is downloadable from http://homes.esat.kuleuven.be/~sistawww/bioi/syu/oklc.html. Contact: shiyu@uchicago.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20980271
Improved fuzzy clustering algorithms in segmentation of DC-enhanced breast MRI.
Kannan, S R; Ramathilagam, S; Devi, Pandiyarajan; Sathya, A
2012-02-01
Segmentation of medical images is a difficult and challenging problem due to poor image contrast and artifacts that result in missing or diffuse organ/tissue boundaries. Many researchers have applied various techniques however fuzzy c-means (FCM) based algorithms is more effective compared to other methods. The objective of this work is to develop some robust fuzzy clustering segmentation systems for effective segmentation of DCE - breast MRI. This paper obtains the robust fuzzy clustering algorithms by incorporating kernel methods, penalty terms, tolerance of the neighborhood attraction, additional entropy term and fuzzy parameters. The initial centers are obtained using initialization algorithm to reduce the computation complexity and running time of proposed algorithms. Experimental works on breast images show that the proposed algorithms are effective to improve the similarity measurement, to handle large amount of noise, to have better results in dealing the data corrupted by noise, and other artifacts. The clustering results of proposed methods are validated using Silhouette Method.
A hybrid algorithm for clustering of time series data based on affinity search technique.
Aghabozorgi, Saeed; Ying Wah, Teh; Herawan, Tutut; Jalab, Hamid A; Shaygan, Mohammad Amin; Jalali, Alireza
2014-01-01
Time series clustering is an important solution to various problems in numerous fields of research, including business, medical science, and finance. However, conventional clustering algorithms are not practical for time series data because they are essentially designed for static data. This impracticality results in poor clustering accuracy in several systems. In this paper, a new hybrid clustering algorithm is proposed based on the similarity in shape of time series data. Time series data are first grouped as subclusters based on similarity in time. The subclusters are then merged using the k-Medoids algorithm based on similarity in shape. This model has two contributions: (1) it is more accurate than other conventional and hybrid approaches and (2) it determines the similarity in shape among time series data with a low complexity. To evaluate the accuracy of the proposed model, the model is tested extensively using syntactic and real-world time series datasets.
Zhang, Junfeng; Chen, Wei; Gao, Mingyi; Shen, Gangxiang
2017-10-30
In this work, we proposed two k-means-clustering-based algorithms to mitigate the fiber nonlinearity for 64-quadrature amplitude modulation (64-QAM) signal, the training-sequence assisted k-means algorithm and the blind k-means algorithm. We experimentally demonstrated the proposed k-means-clustering-based fiber nonlinearity mitigation techniques in 75-Gb/s 64-QAM coherent optical communication system. The proposed algorithms have reduced clustering complexity and low data redundancy and they are able to quickly find appropriate initial centroids and select correctly the centroids of the clusters to obtain the global optimal solutions for large k value. We measured the bit-error-ratio (BER) performance of 64-QAM signal with different launched powers into the 50-km single mode fiber and the proposed techniques can greatly mitigate the signal impairments caused by the amplified spontaneous emission noise and the fiber Kerr nonlinearity and improve the BER performance.
A Hybrid Algorithm for Clustering of Time Series Data Based on Affinity Search Technique
Aghabozorgi, Saeed; Ying Wah, Teh; Herawan, Tutut; Jalab, Hamid A.; Shaygan, Mohammad Amin; Jalali, Alireza
2014-01-01
Time series clustering is an important solution to various problems in numerous fields of research, including business, medical science, and finance. However, conventional clustering algorithms are not practical for time series data because they are essentially designed for static data. This impracticality results in poor clustering accuracy in several systems. In this paper, a new hybrid clustering algorithm is proposed based on the similarity in shape of time series data. Time series data are first grouped as subclusters based on similarity in time. The subclusters are then merged using the k-Medoids algorithm based on similarity in shape. This model has two contributions: (1) it is more accurate than other conventional and hybrid approaches and (2) it determines the similarity in shape among time series data with a low complexity. To evaluate the accuracy of the proposed model, the model is tested extensively using syntactic and real-world time series datasets. PMID:24982966
An Enhanced PSO-Based Clustering Energy Optimization Algorithm for Wireless Sensor Network.
Vimalarani, C; Subramanian, R; Sivanandam, S N
2016-01-01
Wireless Sensor Network (WSN) is a network which formed with a maximum number of sensor nodes which are positioned in an application environment to monitor the physical entities in a target area, for example, temperature monitoring environment, water level, monitoring pressure, and health care, and various military applications. Mostly sensor nodes are equipped with self-supported battery power through which they can perform adequate operations and communication among neighboring nodes. Maximizing the lifetime of the Wireless Sensor networks, energy conservation measures are essential for improving the performance of WSNs. This paper proposes an Enhanced PSO-Based Clustering Energy Optimization (EPSO-CEO) algorithm for Wireless Sensor Network in which clustering and clustering head selection are done by using Particle Swarm Optimization (PSO) algorithm with respect to minimizing the power consumption in WSN. The performance metrics are evaluated and results are compared with competitive clustering algorithm to validate the reduction in energy consumption.
NASA Astrophysics Data System (ADS)
Adya Zizwan, Putra; Zarlis, Muhammad; Budhiarti Nababan, Erna
2017-12-01
The determination of Centroid on K-Means Algorithm directly affects the quality of the clustering results. Determination of centroid by using random numbers has many weaknesses. The GenClust algorithm that combines the use of Genetic Algorithms and K-Means uses a genetic algorithm to determine the centroid of each cluster. The use of the GenClust algorithm uses 50% chromosomes obtained through deterministic calculations and 50% is obtained from the generation of random numbers. This study will modify the use of the GenClust algorithm in which the chromosomes used are 100% obtained through deterministic calculations. The results of this study resulted in performance comparisons expressed in Mean Square Error influenced by centroid determination on K-Means method by using GenClust method, modified GenClust method and also classic K-Means.
Impact of heuristics in clustering large biological networks.
Shafin, Md Kishwar; Kabir, Kazi Lutful; Ridwan, Iffatur; Anannya, Tasmiah Tamzid; Karim, Rashid Saadman; Hoque, Mohammad Mozammel; Rahman, M Sohel
2015-12-01
Traditional clustering algorithms often exhibit poor performance for large networks. On the contrary, greedy algorithms are found to be relatively efficient while uncovering functional modules from large biological networks. The quality of the clusters produced by these greedy techniques largely depends on the underlying heuristics employed. Different heuristics based on different attributes and properties perform differently in terms of the quality of the clusters produced. This motivates us to design new heuristics for clustering large networks. In this paper, we have proposed two new heuristics and analyzed the performance thereof after incorporating those with three different combinations in a recently celebrated greedy clustering algorithm named SPICi. We have extensively analyzed the effectiveness of these new variants. The results are found to be promising. Copyright © 2015 Elsevier Ltd. All rights reserved.
Atom Probe Tomography Analysis of the Distribution of Rhenium in Nickel Alloys
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mottura, A.; Warnken, N; Miller, Michael K
2010-01-01
Atom probe tomography (APT) is used to characterise the distributions of rhenium in a binary Ni-Re alloy and the nickel-based single-crystal CMSX-4 superalloy. A purpose-built algorithm is developed to quantify the size distribution of solute clusters, and applied to the APT datasets to critique the hypothesis that rhenium is prone to the formation of clusters in these systems. No evidence is found to indicate that rhenium forms solute clusters above the level expected from random fluctuations. In CMSX-4, enrichment of Re is detected in the matrix phase close to the matrix/precipitate ({gamma}/{gamma}{prime}) phase boundaries. Phase field modelling indicates that thismore » is due to the migration of the {gamma}/{gamma}{prime} interface during cooling from the temperature of operation. Thus, neither clustering of rhenium nor interface enrichments can be the cause of the enhancement in high temperature mechanical properties conferred by rhenium alloying.« less
Mai, Xiaofeng; Liu, Jie; Wu, Xiong; Zhang, Qun; Guo, Changjian; Yang, Yanfu; Li, Zhaohui
2017-02-06
A Stokes-space modulation format classification (MFC) technique is proposed for coherent optical receivers by using a non-iterative clustering algorithm. In the clustering algorithm, two simple parameters are calculated to help find the density peaks of the data points in Stokes space and no iteration is required. Correct MFC can be realized in numerical simulations among PM-QPSK, PM-8QAM, PM-16QAM, PM-32QAM and PM-64QAM signals within practical optical signal-to-noise ratio (OSNR) ranges. The performance of the proposed MFC algorithm is also compared with those of other schemes based on clustering algorithms. The simulation results show that good classification performance can be achieved using the proposed MFC scheme with moderate time complexity. Proof-of-concept experiments are finally implemented to demonstrate MFC among PM-QPSK/16QAM/64QAM signals, which confirm the feasibility of our proposed MFC scheme.
Optimization of wireless sensor networks based on chicken swarm optimization algorithm
NASA Astrophysics Data System (ADS)
Wang, Qingxi; Zhu, Lihua
2017-05-01
In order to reduce the energy consumption of wireless sensor network and improve the survival time of network, the clustering routing protocol of wireless sensor networks based on chicken swarm optimization algorithm was proposed. On the basis of LEACH agreement, it was improved and perfected that the points on the cluster and the selection of cluster head using the chicken group optimization algorithm, and update the location of chicken which fall into the local optimum by Levy flight, enhance population diversity, ensure the global search capability of the algorithm. The new protocol avoided the die of partial node of intensive using by making balanced use of the network nodes, improved the survival time of wireless sensor network. The simulation experiments proved that the protocol is better than LEACH protocol on energy consumption, also is better than that of clustering routing protocol based on particle swarm optimization algorithm.
Predicting the random drift of MEMS gyroscope based on K-means clustering and OLS RBF Neural Network
NASA Astrophysics Data System (ADS)
Wang, Zhen-yu; Zhang, Li-jie
2017-10-01
Measure error of the sensor can be effectively compensated with prediction. Aiming at large random drift error of MEMS(Micro Electro Mechanical System))gyroscope, an improved learning algorithm of Radial Basis Function(RBF) Neural Network(NN) based on K-means clustering and Orthogonal Least-Squares (OLS) is proposed in this paper. The algorithm selects the typical samples as the initial cluster centers of RBF NN firstly, candidates centers with K-means algorithm secondly, and optimizes the candidate centers with OLS algorithm thirdly, which makes the network structure simpler and makes the prediction performance better. Experimental results show that the proposed K-means clustering OLS learning algorithm can predict the random drift of MEMS gyroscope effectively, the prediction error of which is 9.8019e-007°/s and the prediction time of which is 2.4169e-006s
Composeable Chat over Low-Bandwidth Intermittent Communication Links
2007-04-01
Compression (STC), introduced in this report, is a data compression algorithm intended to compress alphanumeric... Ziv - Lempel coding, the grandfather of most modern general-purpose file compression programs, watches for input symbol sequences that have previously... data . This section applies these techniques to create a new compression algorithm called Small Text Compression . Various sequence compression
A Modified MinMax k-Means Algorithm Based on PSO
2016-01-01
The MinMax k-means algorithm is widely used to tackle the effect of bad initialization by minimizing the maximum intraclustering errors. Two parameters, including the exponent parameter and memory parameter, are involved in the executive process. Since different parameters have different clustering errors, it is crucial to choose appropriate parameters. In the original algorithm, a practical framework is given. Such framework extends the MinMax k-means to automatically adapt the exponent parameter to the data set. It has been believed that if the maximum exponent parameter has been set, then the programme can reach the lowest intraclustering errors. However, our experiments show that this is not always correct. In this paper, we modified the MinMax k-means algorithm by PSO to determine the proper values of parameters which can subject the algorithm to attain the lowest clustering errors. The proposed clustering method is tested on some favorite data sets in several different initial situations and is compared to the k-means algorithm and the original MinMax k-means algorithm. The experimental results indicate that our proposed algorithm can reach the lowest clustering errors automatically. PMID:27656201
A Linked List-Based Algorithm for Blob Detection on Embedded Vision-Based Sensors
Acevedo-Avila, Ricardo; Gonzalez-Mendoza, Miguel; Garcia-Garcia, Andres
2016-01-01
Blob detection is a common task in vision-based applications. Most existing algorithms are aimed at execution on general purpose computers; while very few can be adapted to the computing restrictions present in embedded platforms. This paper focuses on the design of an algorithm capable of real-time blob detection that minimizes system memory consumption. The proposed algorithm detects objects in one image scan; it is based on a linked-list data structure tree used to label blobs depending on their shape and node information. An example application showing the results of a blob detection co-processor has been built on a low-powered field programmable gate array hardware as a step towards developing a smart video surveillance system. The detection method is intended for general purpose application. As such, several test cases focused on character recognition are also examined. The results obtained present a fair trade-off between accuracy and memory requirements; and prove the validity of the proposed approach for real-time implementation on resource-constrained computing platforms. PMID:27240382
NASA Astrophysics Data System (ADS)
Tan, Chao; Chen, Hui; Wang, Chao; Zhu, Wanping; Wu, Tong; Diao, Yuanbo
2013-03-01
Near and mid-infrared (NIR/MIR) spectroscopy techniques have gained great acceptance in the industry due to their multiple applications and versatility. However, a success of application often depends heavily on the construction of accurate and stable calibration models. For this purpose, a simple multi-model fusion strategy is proposed. It is actually the combination of Kohonen self-organizing map (KSOM), mutual information (MI) and partial least squares (PLSs) and therefore named as KMICPLS. It works as follows: First, the original training set is fed into a KSOM for unsupervised clustering of samples, on which a series of training subsets are constructed. Thereafter, on each of the training subsets, a MI spectrum is calculated and only the variables with higher MI values than the mean value are retained, based on which a candidate PLS model is constructed. Finally, a fixed number of PLS models are selected to produce a consensus model. Two NIR/MIR spectral datasets from brewing industry are used for experiments. The results confirms its superior performance to two reference algorithms, i.e., the conventional PLS and genetic algorithm-PLS (GAPLS). It can build more accurate and stable calibration models without increasing the complexity, and can be generalized to other NIR/MIR applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Küchlin, Stephan, E-mail: kuechlin@ifd.mavt.ethz.ch; Jenny, Patrick
2017-01-01
A major challenge for the conventional Direct Simulation Monte Carlo (DSMC) technique lies in the fact that its computational cost becomes prohibitive in the near continuum regime, where the Knudsen number (Kn)—characterizing the degree of rarefaction—becomes small. In contrast, the Fokker–Planck (FP) based particle Monte Carlo scheme allows for computationally efficient simulations of rarefied gas flows in the low and intermediate Kn regime. The Fokker–Planck collision operator—instead of performing binary collisions employed by the DSMC method—integrates continuous stochastic processes for the phase space evolution in time. This allows for time step and grid cell sizes larger than the respective collisionalmore » scales required by DSMC. Dynamically switching between the FP and the DSMC collision operators in each computational cell is the basis of the combined FP-DSMC method, which has been proven successful in simulating flows covering the whole Kn range. Until recently, this algorithm had only been applied to two-dimensional test cases. In this contribution, we present the first general purpose implementation of the combined FP-DSMC method. Utilizing both shared- and distributed-memory parallelization, this implementation provides the capability for simulations involving many particles and complex geometries by exploiting state of the art computer cluster technologies.« less
NASA Astrophysics Data System (ADS)
Madsen, Niels Kristian; Godtliebsen, Ian H.; Losilla, Sergio A.; Christiansen, Ove
2018-01-01
A new implementation of vibrational coupled-cluster (VCC) theory is presented, where all amplitude tensors are represented in the canonical polyadic (CP) format. The CP-VCC algorithm solves the non-linear VCC equations without ever constructing the amplitudes or error vectors in full dimension but still formally includes the full parameter space of the VCC[n] model in question resulting in the same vibrational energies as the conventional method. In a previous publication, we have described the non-linear-equation solver for CP-VCC calculations. In this work, we discuss the general algorithm for evaluating VCC error vectors in CP format including the rank-reduction methods used during the summation of the many terms in the VCC amplitude equations. Benchmark calculations for studying the computational scaling and memory usage of the CP-VCC algorithm are performed on a set of molecules including thiadiazole and an array of polycyclic aromatic hydrocarbons. The results show that the reduced scaling and memory requirements of the CP-VCC algorithm allows for performing high-order VCC calculations on systems with up to 66 vibrational modes (anthracene), which indeed are not possible using the conventional VCC method. This paves the way for obtaining highly accurate vibrational spectra and properties of larger molecules.
Marateb, Hamid Reza; Mansourian, Marjan; Adibi, Peyman; Farina, Dario
2014-01-01
Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables. PMID:24672565
A Comparative Evaluation of Anomaly Detection Algorithms for Maritime Video Surveillance
2011-01-01
of k-means clustering and the k- NN Localized p-value Estimator ( KNN -LPE). K-means is a popular distance-based clustering algorithm while KNN -LPE...implemented the sparse cluster identification rule we described in Section 3.1. 2. k-NN Localized p-value Estimator ( KNN -LPE): We implemented this using...Average Density ( KNN -NAD): This was implemented as described in Section 3.4. Algorithm Parameter Settings The global and local density-based anomaly
NASA Astrophysics Data System (ADS)
Chang, Bingguo; Chen, Xiaofei
2018-05-01
Ultrasonography is an important examination for the diagnosis of chronic liver disease. The doctor gives the liver indicators and suggests the patient's condition according to the description of ultrasound report. With the rapid increase in the amount of data of ultrasound report, the workload of professional physician to manually distinguish ultrasound results significantly increases. In this paper, we use the spectral clustering method to cluster analysis of the description of the ultrasound report, and automatically generate the ultrasonic diagnostic diagnosis by machine learning. 110 groups ultrasound examination report of chronic liver disease were selected as test samples in this experiment, and the results were validated by spectral clustering and compared with k-means clustering algorithm. The results show that the accuracy of spectral clustering is 92.73%, which is higher than that of k-means clustering algorithm, which provides a powerful ultrasound-assisted diagnosis for patients with chronic liver disease.
Jiang, Peng; Xu, Yiming; Wu, Feng
2016-01-01
Existing move-restricted node self-deployment algorithms are based on a fixed node communication radius, evaluate the performance based on network coverage or the connectivity rate and do not consider the number of nodes near the sink node and the energy consumption distribution of the network topology, thereby degrading network reliability and the energy consumption balance. Therefore, we propose a distributed underwater node self-deployment algorithm. First, each node begins the uneven clustering based on the distance on the water surface. Each cluster head node selects its next-hop node to synchronously construct a connected path to the sink node. Second, the cluster head node adjusts its depth while maintaining the layout formed by the uneven clustering and then adjusts the positions of in-cluster nodes. The algorithm originally considers the network reliability and energy consumption balance during node deployment and considers the coverage redundancy rate of all positions that a node may reach during the node position adjustment. Simulation results show, compared to the connected dominating set (CDS) based depth computation algorithm, that the proposed algorithm can increase the number of the nodes near the sink node and improve network reliability while guaranteeing the network connectivity rate. Moreover, it can balance energy consumption during network operation, further improve network coverage rate and reduce energy consumption. PMID:26784193
Orbit Clustering Based on Transfer Cost
NASA Technical Reports Server (NTRS)
Gustafson, Eric D.; Arrieta-Camacho, Juan J.; Petropoulos, Anastassios E.
2013-01-01
We propose using cluster analysis to perform quick screening for combinatorial global optimization problems. The key missing component currently preventing cluster analysis from use in this context is the lack of a useable metric function that defines the cost to transfer between two orbits. We study several proposed metrics and clustering algorithms, including k-means and the expectation maximization algorithm. We also show that proven heuristic methods such as the Q-law can be modified to work with cluster analysis.
Perualila-Tan, Nolen Joy; Shkedy, Ziv; Talloen, Willem; Göhlmann, Hinrich W H; Moerbeke, Marijke Van; Kasim, Adetayo
2016-08-01
The modern process of discovering candidate molecules in early drug discovery phase includes a wide range of approaches to extract vital information from the intersection of biology and chemistry. A typical strategy in compound selection involves compound clustering based on chemical similarity to obtain representative chemically diverse compounds (not incorporating potency information). In this paper, we propose an integrative clustering approach that makes use of both biological (compound efficacy) and chemical (structural features) data sources for the purpose of discovering a subset of compounds with aligned structural and biological properties. The datasets are integrated at the similarity level by assigning complementary weights to produce a weighted similarity matrix, serving as a generic input in any clustering algorithm. This new analysis work flow is semi-supervised method since, after the determination of clusters, a secondary analysis is performed wherein it finds differentially expressed genes associated to the derived integrated cluster(s) to further explain the compound-induced biological effects inside the cell. In this paper, datasets from two drug development oncology projects are used to illustrate the usefulness of the weighted similarity-based clustering approach to integrate multi-source high-dimensional information to aid drug discovery. Compounds that are structurally and biologically similar to the reference compounds are discovered using this proposed integrative approach.
Clustering by reordering of similarity and Laplacian matrices: Application to galaxy clusters
NASA Astrophysics Data System (ADS)
Mahmoud, E.; Shoukry, A.; Takey, A.
2018-04-01
Similarity metrics, kernels and similarity-based algorithms have gained much attention due to their increasing applications in information retrieval, data mining, pattern recognition and machine learning. Similarity Graphs are often adopted as the underlying representation of similarity matrices and are at the origin of known clustering algorithms such as spectral clustering. Similarity matrices offer the advantage of working in object-object (two-dimensional) space where visualization of clusters similarities is available instead of object-features (multi-dimensional) space. In this paper, sparse ɛ-similarity graphs are constructed and decomposed into strong components using appropriate methods such as Dulmage-Mendelsohn permutation (DMperm) and/or Reverse Cuthill-McKee (RCM) algorithms. The obtained strong components correspond to groups (clusters) in the input (feature) space. Parameter ɛi is estimated locally, at each data point i from a corresponding narrow range of the number of nearest neighbors. Although more advanced clustering techniques are available, our method has the advantages of simplicity, better complexity and direct visualization of the clusters similarities in a two-dimensional space. Also, no prior information about the number of clusters is needed. We conducted our experiments on two and three dimensional, low and high-sized synthetic datasets as well as on an astronomical real-dataset. The results are verified graphically and analyzed using gap statistics over a range of neighbors to verify the robustness of the algorithm and the stability of the results. Combining the proposed algorithm with gap statistics provides a promising tool for solving clustering problems. An astronomical application is conducted for confirming the existence of 45 galaxy clusters around the X-ray positions of galaxy clusters in the redshift range [0.1..0.8]. We re-estimate the photometric redshifts of the identified galaxy clusters and obtain acceptable values compared to published spectroscopic redshifts with a 0.029 standard deviation of their differences.
Improving clustering with metabolic pathway data.
Milone, Diego H; Stegmayer, Georgina; López, Mariana; Kamenetzky, Laura; Carrari, Fernando
2014-04-10
It is a common practice in bioinformatics to validate each group returned by a clustering algorithm through manual analysis, according to a-priori biological knowledge. This procedure helps finding functionally related patterns to propose hypotheses for their behavior and the biological processes involved. Therefore, this knowledge is used only as a second step, after data are just clustered according to their expression patterns. Thus, it could be very useful to be able to improve the clustering of biological data by incorporating prior knowledge into the cluster formation itself, in order to enhance the biological value of the clusters. A novel training algorithm for clustering is presented, which evaluates the biological internal connections of the data points while the clusters are being formed. Within this training algorithm, the calculation of distances among data points and neurons centroids includes a new term based on information from well-known metabolic pathways. The standard self-organizing map (SOM) training versus the biologically-inspired SOM (bSOM) training were tested with two real data sets of transcripts and metabolites from Solanum lycopersicum and Arabidopsis thaliana species. Classical data mining validation measures were used to evaluate the clustering solutions obtained by both algorithms. Moreover, a new measure that takes into account the biological connectivity of the clusters was applied. The results of bSOM show important improvements in the convergence and performance for the proposed clustering method in comparison to standard SOM training, in particular, from the application point of view. Analyses of the clusters obtained with bSOM indicate that including biological information during training can certainly increase the biological value of the clusters found with the proposed method. It is worth to highlight that this fact has effectively improved the results, which can simplify their further analysis.The algorithm is available as a web-demo at http://fich.unl.edu.ar/sinc/web-demo/bsom-lite/. The source code and the data sets supporting the results of this article are available at http://sourceforge.net/projects/sourcesinc/files/bsom.
NASA Astrophysics Data System (ADS)
Chen, Xiao; Li, Yaan; Yu, Jing; Li, Yuxing
2018-01-01
For fast and more effective implementation of tracking multiple targets in a cluttered environment, we propose a multiple targets tracking (MTT) algorithm called maximum entropy fuzzy c-means clustering joint probabilistic data association that combines fuzzy c-means clustering and the joint probabilistic data association (PDA) algorithm. The algorithm uses the membership value to express the probability of the target originating from measurement. The membership value is obtained through fuzzy c-means clustering objective function optimized by the maximum entropy principle. When considering the effect of the public measurement, we use a correction factor to adjust the association probability matrix to estimate the state of the target. As this algorithm avoids confirmation matrix splitting, it can solve the high computational load problem of the joint PDA algorithm. The results of simulations and analysis conducted for tracking neighbor parallel targets and cross targets in a different density cluttered environment show that the proposed algorithm can realize MTT quickly and efficiently in a cluttered environment. Further, the performance of the proposed algorithm remains constant with increasing process noise variance. The proposed algorithm has the advantages of efficiency and low computational load, which can ensure optimum performance when tracking multiple targets in a dense cluttered environment.
A similarity based agglomerative clustering algorithm in networks
NASA Astrophysics Data System (ADS)
Liu, Zhiyuan; Wang, Xiujuan; Ma, Yinghong
2018-04-01
The detection of clusters is benefit for understanding the organizations and functions of networks. Clusters, or communities, are usually groups of nodes densely interconnected but sparsely linked with any other clusters. To identify communities, an efficient and effective community agglomerative algorithm based on node similarity is proposed. The proposed method initially calculates similarities between each pair of nodes, and form pre-partitions according to the principle that each node is in the same community as its most similar neighbor. After that, check each partition whether it satisfies community criterion. For the pre-partitions who do not satisfy, incorporate them with others that having the biggest attraction until there are no changes. To measure the attraction ability of a partition, we propose an attraction index that based on the linked node's importance in networks. Therefore, our proposed method can better exploit the nodes' properties and network's structure. To test the performance of our algorithm, both synthetic and empirical networks ranging in different scales are tested. Simulation results show that the proposed algorithm can obtain superior clustering results compared with six other widely used community detection algorithms.
Lin, Nan; Jiang, Junhai; Guo, Shicheng; Xiong, Momiao
2015-01-01
Due to the advancement in sensor technology, the growing large medical image data have the ability to visualize the anatomical changes in biological tissues. As a consequence, the medical images have the potential to enhance the diagnosis of disease, the prediction of clinical outcomes and the characterization of disease progression. But in the meantime, the growing data dimensions pose great methodological and computational challenges for the representation and selection of features in image cluster analysis. To address these challenges, we first extend the functional principal component analysis (FPCA) from one dimension to two dimensions to fully capture the space variation of image the signals. The image signals contain a large number of redundant features which provide no additional information for clustering analysis. The widely used methods for removing the irrelevant features are sparse clustering algorithms using a lasso-type penalty to select the features. However, the accuracy of clustering using a lasso-type penalty depends on the selection of the penalty parameters and the threshold value. In practice, they are difficult to determine. Recently, randomized algorithms have received a great deal of attentions in big data analysis. This paper presents a randomized algorithm for accurate feature selection in image clustering analysis. The proposed method is applied to both the liver and kidney cancer histology image data from the TCGA database. The results demonstrate that the randomized feature selection method coupled with functional principal component analysis substantially outperforms the current sparse clustering algorithms in image cluster analysis. PMID:26196383
Clustering approaches to identifying gene expression patterns from DNA microarray data.
Do, Jin Hwan; Choi, Dong-Kug
2008-04-30
The analysis of microarray data is essential for large amounts of gene expression data. In this review we focus on clustering techniques. The biological rationale for this approach is the fact that many co-expressed genes are co-regulated, and identifying co-expressed genes could aid in functional annotation of novel genes, de novo identification of transcription factor binding sites and elucidation of complex biological pathways. Co-expressed genes are usually identified in microarray experiments by clustering techniques. There are many such methods, and the results obtained even for the same datasets may vary considerably depending on the algorithms and metrics for dissimilarity measures used, as well as on user-selectable parameters such as desired number of clusters and initial values. Therefore, biologists who want to interpret microarray data should be aware of the weakness and strengths of the clustering methods used. In this review, we survey the basic principles of clustering of DNA microarray data from crisp clustering algorithms such as hierarchical clustering, K-means and self-organizing maps, to complex clustering algorithms like fuzzy clustering.
Identification of chronic rhinosinusitis phenotypes using cluster analysis.
Soler, Zachary M; Hyer, J Madison; Ramakrishnan, Viswanathan; Smith, Timothy L; Mace, Jess; Rudmik, Luke; Schlosser, Rodney J
2015-05-01
Current clinical classifications of chronic rhinosinusitis (CRS) have been largely defined based upon preconceived notions of factors thought to be important, such as polyp or eosinophil status. Unfortunately, these classification systems have little correlation with symptom severity or treatment outcomes. Unsupervised clustering can be used to identify phenotypic subgroups of CRS patients, describe clinical differences in these clusters and define simple algorithms for classification. A multi-institutional, prospective study of 382 patients with CRS who had failed initial medical therapy completed the Sino-Nasal Outcome Test (SNOT-22), Rhinosinusitis Disability Index (RSDI), Medical Outcomes Study Short Form-12 (SF-12), Pittsburgh Sleep Quality Index (PSQI), and Patient Health Questionnaire (PHQ-2). Objective measures of CRS severity included Brief Smell Identification Test (B-SIT), CT, and endoscopy scoring. All variables were reduced and unsupervised hierarchical clustering was performed. After clusters were defined, variations in medication usage were analyzed. Discriminant analysis was performed to develop a simplified, clinically useful algorithm for clustering. Clustering was largely determined by age, severity of patient reported outcome measures, depression, and fibromyalgia. CT and endoscopy varied somewhat among clusters. Traditional clinical measures, including polyp/atopic status, prior surgery, B-SIT and asthma, did not vary among clusters. A simplified algorithm based upon productivity loss, SNOT-22 score, and age predicted clustering with 89% accuracy. Medication usage among clusters did vary significantly. A simplified algorithm based upon hierarchical clustering is able to classify CRS patients and predict medication usage. Further studies are warranted to determine if such clustering predicts treatment outcomes. © 2015 ARS-AAOA, LLC.
Quantum annealing for combinatorial clustering
NASA Astrophysics Data System (ADS)
Kumar, Vaibhaw; Bass, Gideon; Tomlin, Casey; Dulny, Joseph
2018-02-01
Clustering is a powerful machine learning technique that groups "similar" data points based on their characteristics. Many clustering algorithms work by approximating the minimization of an objective function, namely the sum of within-the-cluster distances between points. The straightforward approach involves examining all the possible assignments of points to each of the clusters. This approach guarantees the solution will be a global minimum; however, the number of possible assignments scales quickly with the number of data points and becomes computationally intractable even for very small datasets. In order to circumvent this issue, cost function minima are found using popular local search-based heuristic approaches such as k-means and hierarchical clustering. Due to their greedy nature, such techniques do not guarantee that a global minimum will be found and can lead to sub-optimal clustering assignments. Other classes of global search-based techniques, such as simulated annealing, tabu search, and genetic algorithms, may offer better quality results but can be too time-consuming to implement. In this work, we describe how quantum annealing can be used to carry out clustering. We map the clustering objective to a quadratic binary optimization problem and discuss two clustering algorithms which are then implemented on commercially available quantum annealing hardware, as well as on a purely classical solver "qbsolv." The first algorithm assigns N data points to K clusters, and the second one can be used to perform binary clustering in a hierarchical manner. We present our results in the form of benchmarks against well-known k-means clustering and discuss the advantages and disadvantages of the proposed techniques.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mackey, Lester; Nachman, Benjamin; Schwartzman, Ariel
Collimated streams of particles produced in high energy physics experiments are organized using clustering algorithms to form jets . To construct jets, the experimental collaborations based at the Large Hadron Collider (LHC) primarily use agglomerative hierarchical clustering schemes known as sequential recombination. We propose a new class of algorithms for clustering jets that use infrared and collinear safe mixture models. These new algorithms, known as fuzzy jets , are clustered using maximum likelihood techniques and can dynamically determine various properties of jets like their size. We show that the fuzzy jet size adds additional information to conventional jet tagging variablesmore » in boosted topologies. Furthermore, we study the impact of pileup and show that with some slight modifications to the algorithm, fuzzy jets can be stable up to high pileup interaction multiplicities.« less
Mackey, Lester; Nachman, Benjamin; Schwartzman, Ariel; ...
2016-06-01
Collimated streams of particles produced in high energy physics experiments are organized using clustering algorithms to form jets . To construct jets, the experimental collaborations based at the Large Hadron Collider (LHC) primarily use agglomerative hierarchical clustering schemes known as sequential recombination. We propose a new class of algorithms for clustering jets that use infrared and collinear safe mixture models. These new algorithms, known as fuzzy jets , are clustered using maximum likelihood techniques and can dynamically determine various properties of jets like their size. We show that the fuzzy jet size adds additional information to conventional jet tagging variablesmore » in boosted topologies. Furthermore, we study the impact of pileup and show that with some slight modifications to the algorithm, fuzzy jets can be stable up to high pileup interaction multiplicities.« less
Datta, Somnath; Nevalainen, Jaakko; Oja, Hannu
2012-01-01
SUMMARY Rank based tests are alternatives to likelihood based tests popularized by their relative robustness and underlying elegant mathematical theory. There has been a serge in research activities in this area in recent years since a number of researchers are working to develop and extend rank based procedures to clustered dependent data which include situations with known correlation structures (e.g., as in mixed effects models) as well as more general form of dependence. The purpose of this paper is to test the symmetry of a marginal distribution under clustered data. However, unlike most other papers in the area, we consider the possibility that the cluster size is a random variable whose distribution is dependent on the distribution of the variable of interest within a cluster. This situation typically arises when the clusters are defined in a natural way (e.g., not controlled by the experimenter or statistician) and in which the size of the cluster may carry information about the distribution of data values within a cluster. Under the scenario of an informative cluster size, attempts to use some form of variance adjusted sign or signed rank tests would fail since they would not maintain the correct size under the distribution of marginal symmetry. To overcome this difficulty Datta and Satten (2008; Biometrics, 64, 501–507) proposed a Wilcoxon type signed rank test based on the principle of within cluster resampling. In this paper we study this problem in more generality by introducing a class of valid tests employing a general score function. Asymptotic null distribution of these tests is obtained. A simulation study shows that a more general choice of the score function can sometimes result in greater power than the Datta and Satten test; furthermore, this development offers the user a wider choice. We illustrate our tests using a real data example on spinal cord injury patients. PMID:23074359
Datta, Somnath; Nevalainen, Jaakko; Oja, Hannu
2012-09-01
Rank based tests are alternatives to likelihood based tests popularized by their relative robustness and underlying elegant mathematical theory. There has been a serge in research activities in this area in recent years since a number of researchers are working to develop and extend rank based procedures to clustered dependent data which include situations with known correlation structures (e.g., as in mixed effects models) as well as more general form of dependence.The purpose of this paper is to test the symmetry of a marginal distribution under clustered data. However, unlike most other papers in the area, we consider the possibility that the cluster size is a random variable whose distribution is dependent on the distribution of the variable of interest within a cluster. This situation typically arises when the clusters are defined in a natural way (e.g., not controlled by the experimenter or statistician) and in which the size of the cluster may carry information about the distribution of data values within a cluster.Under the scenario of an informative cluster size, attempts to use some form of variance adjusted sign or signed rank tests would fail since they would not maintain the correct size under the distribution of marginal symmetry. To overcome this difficulty Datta and Satten (2008; Biometrics, 64, 501-507) proposed a Wilcoxon type signed rank test based on the principle of within cluster resampling. In this paper we study this problem in more generality by introducing a class of valid tests employing a general score function. Asymptotic null distribution of these tests is obtained. A simulation study shows that a more general choice of the score function can sometimes result in greater power than the Datta and Satten test; furthermore, this development offers the user a wider choice. We illustrate our tests using a real data example on spinal cord injury patients.
Search automation of the generalized method of device operational characteristics improvement
NASA Astrophysics Data System (ADS)
Petrova, I. Yu; Puchkova, A. A.; Zaripova, V. M.
2017-01-01
The article presents brief results of analysis of existing search methods of the closest patents, which can be applied to determine generalized methods of device operational characteristics improvement. There were observed the most widespread clustering algorithms and metrics for determining the proximity degree between two documents. The article proposes the technique of generalized methods determination; it has two implementation variants and consists of 7 steps. This technique has been implemented in the “Patents search” subsystem of the “Intellect” system. Also the article gives an example of the use of the proposed technique.
Blooming Trees: Substructures and Surrounding Groups of Galaxy Clusters
NASA Astrophysics Data System (ADS)
Yu, Heng; Diaferio, Antonaldo; Serra, Ana Laura; Baldi, Marco
2018-06-01
We develop the Blooming Tree Algorithm, a new technique that uses spectroscopic redshift data alone to identify the substructures and the surrounding groups of galaxy clusters, along with their member galaxies. Based on the estimated binding energy of galaxy pairs, the algorithm builds a binary tree that hierarchically arranges all of the galaxies in the field of view. The algorithm searches for buds, corresponding to gravitational potential minima on the binary tree branches; for each bud, the algorithm combines the number of galaxies, their velocity dispersion, and their average pairwise distance into a parameter that discriminates between the buds that do not correspond to any substructure or group, and thus eventually die, and the buds that correspond to substructures and groups, and thus bloom into the identified structures. We test our new algorithm with a sample of 300 mock redshift surveys of clusters in different dynamical states; the clusters are extracted from a large cosmological N-body simulation of a ΛCDM model. We limit our analysis to substructures and surrounding groups identified in the simulation with mass larger than 1013 h ‑1 M ⊙. With mock redshift surveys with 200 galaxies within 6 h ‑1 Mpc from the cluster center, the technique recovers 80% of the real substructures and 60% of the surrounding groups; in 57% of the identified structures, at least 60% of the member galaxies of the substructures and groups belong to the same real structure. These results improve by roughly a factor of two the performance of the best substructure identification algorithm currently available, the σ plateau algorithm, and suggest that our Blooming Tree Algorithm can be an invaluable tool for detecting substructures of galaxy clusters and investigating their complex dynamics.
Weighted community detection and data clustering using message passing
NASA Astrophysics Data System (ADS)
Shi, Cheng; Liu, Yanchen; Zhang, Pan
2018-03-01
Grouping objects into clusters based on the similarities or weights between them is one of the most important problems in science and engineering. In this work, by extending message-passing algorithms and spectral algorithms proposed for an unweighted community detection problem, we develop a non-parametric method based on statistical physics, by mapping the problem to the Potts model at the critical temperature of spin-glass transition and applying belief propagation to solve the marginals corresponding to the Boltzmann distribution. Our algorithm is robust to over-fitting and gives a principled way to determine whether there are significant clusters in the data and how many clusters there are. We apply our method to different clustering tasks. In the community detection problem in weighted and directed networks, we show that our algorithm significantly outperforms existing algorithms. In the clustering problem, where the data were generated by mixture models in the sparse regime, we show that our method works all the way down to the theoretical limit of detectability and gives accuracy very close to that of the optimal Bayesian inference. In the semi-supervised clustering problem, our method only needs several labels to work perfectly in classic datasets. Finally, we further develop Thouless-Anderson-Palmer equations which heavily reduce the computation complexity in dense networks but give almost the same performance as belief propagation.
NASA Astrophysics Data System (ADS)
Heidari, A. A.; Moayedi, A.; Abbaspour, R. Ali
2017-09-01
Automated fare collection (AFC) systems are regarded as valuable resources for public transport planners. In this paper, the AFC data are utilized to analysis and extract mobility patterns in a public transportation system. For this purpose, the smart card data are inserted into a proposed metaheuristic-based aggregation model and then converted to O-D matrix between stops, since the size of O-D matrices makes it difficult to reproduce the measured passenger flows precisely. The proposed strategy is applied to a case study from Haaglanden, Netherlands. In this research, moth-flame optimizer (MFO) is utilized and evaluated for the first time as a new metaheuristic algorithm (MA) in estimating transit origin-destination matrices. The MFO is a novel, efficient swarm-based MA inspired from the celestial navigation of moth insects in nature. To investigate the capabilities of the proposed MFO-based approach, it is compared to methods that utilize the K-means algorithm, gray wolf optimization algorithm (GWO) and genetic algorithm (GA). The sum of the intra-cluster distances and computational time of operations are considered as the evaluation criteria to assess the efficacy of the optimizers. The optimality of solutions of different algorithms is measured in detail. The traveler's behavior is analyzed to achieve to a smooth and optimized transport system. The results reveal that the proposed MFO-based aggregation strategy can outperform other evaluated approaches in terms of convergence tendency and optimality of the results. The results show that it can be utilized as an efficient approach to estimating the transit O-D matrices.
Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences
NASA Technical Reports Server (NTRS)
Budalakoti, Suratna; Srivastava, Ashok N.; Akella, Ram; Turkov, Eugene
2006-01-01
This paper addresses the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. The approach taken uses unsupervised clustering of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by detailed analysis of outliers to detect anomalies. As the LCS measure is expensive to compute, the first part of the paper discusses existing algorithms, such as the Hunt-Szymanski algorithm, that have low time-complexity. We then discuss why these algorithms often do not work well in practice and present a new hybrid algorithm for computing the LCS that, in our tests, outperforms the Hunt-Szymanski algorithm by a factor of five. The second part of the paper presents new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. The algorithms provide a coherent description to an analyst of the anomalies in the sequence, compared to more normal sequences. The algorithms we present are general and domain-independent, so we discuss applications in related areas such as anomaly detection.
NASA Astrophysics Data System (ADS)
Talingdan, J. A.; Trinidad, J. T., Jr.; Palaoag, T. D.
2018-03-01
Alternative Learning System (ALS) is a subsystem of Depatment of Education (DepEd) that serves as an option of learners who cannot afford to go in a formal education. The research focuses on the data exploration and analysis of ALS accreditation and equivalency test result using data mining. The ALS 2014 to 2016 A & E test results in the secondary level were used as data sets in the study. The A & E test results revealed that the passing rate is doubled per year. The results were clustered using k- means clustering algorithm and they were grouped into good, medium, and low standard learners to identify students need exceptional stuff for enhancement. From the clustered data, it was found out that the strand they are weak in is strand 4 which is the Development of Self and a Sense of Community with a general average of 84.23. It also revealed that the essay type of exam got the lowest score with a general average of 2.14 compared to the multiple type of exam that covers the five learning strands. Furthermore, decision tree and naive bayes were also employed in the study to predict the performance of the learners in the A & E test and determine which is better to use for prediction. It was concluded that naive bayes performs better because the accuracy rate is higher than the decision tree algorithm.
Coburn, T.C.; Freeman, P.A.; Attanasi, E.D.
2012-01-01
The primary objectives of this research were to (1) investigate empirical methods for establishing regional trends in unconventional gas resources as exhibited by historical production data and (2) determine whether or not incorporating additional knowledge of a regional trend in a suite of previously established local nonparametric resource prediction algorithms influences assessment results. Three different trend detection methods were applied to publicly available production data (well EUR aggregated to 80-acre cells) from the Devonian Antrim Shale gas play in the Michigan Basin. This effort led to the identification of a southeast-northwest trend in cell EUR values across the play that, in a very general sense, conforms to the primary fracture and structural orientations of the province. However, including this trend in the resource prediction algorithms did not lead to improved results. Further analysis indicated the existence of clustering among cell EUR values that likely dampens the contribution of the regional trend. The reason for the clustering, a somewhat unexpected result, is not completely understood, although the geological literature provides some possible explanations. With appropriate data, a better understanding of this clustering phenomenon may lead to important information about the factors and their interactions that control Antrim Shale gas production, which may, in turn, help establish a more general protocol for better estimating resources in this and other shale gas plays. ?? 2011 International Association for Mathematical Geology (outside the USA).
Gaur, Pallavi; Chaturvedi, Anoop
2017-07-22
The clustering pattern and motifs give immense information about any biological data. An application of machine learning algorithms for clustering and candidate motif detection in miRNAs derived from exosomes is depicted in this paper. Recent progress in the field of exosome research and more particularly regarding exosomal miRNAs has led much bioinformatic-based research to come into existence. The information on clustering pattern and candidate motifs in miRNAs of exosomal origin would help in analyzing existing, as well as newly discovered miRNAs within exosomes. Along with obtaining clustering pattern and candidate motifs in exosomal miRNAs, this work also elaborates the usefulness of the machine learning algorithms that can be efficiently used and executed on various programming languages/platforms. Data were clustered and sequence candidate motifs were detected successfully. The results were compared and validated with some available web tools such as 'BLASTN' and 'MEME suite'. The machine learning algorithms for aforementioned objectives were applied successfully. This work elaborated utility of machine learning algorithms and language platforms to achieve the tasks of clustering and candidate motif detection in exosomal miRNAs. With the information on mentioned objectives, deeper insight would be gained for analyses of newly discovered miRNAs in exosomes which are considered to be circulating biomarkers. In addition, the execution of machine learning algorithms on various language platforms gives more flexibility to users to try multiple iterations according to their requirements. This approach can be applied to other biological data-mining tasks as well.
Neji, Radhouène; Besbes, Ahmed; Komodakis, Nikos; Deux, Jean-François; Maatouk, Mezri; Rahmouni, Alain; Bassez, Guillaume; Fleury, Gilles; Paragios, Nikos
2009-01-01
In this paper, we present a manifold clustering method fo the classification of fibers obtained from diffusion tensor images (DTI) of the human skeletal muscle. Using a linear programming formulation of prototype-based clustering, we propose a novel fiber classification algorithm over manifolds that circumvents the necessity to embed the data in low dimensional spaces and determines automatically the number of clusters. Furthermore, we propose the use of angular Hilbertian metrics between multivariate normal distributions to define a family of distances between tensors that we generalize to fibers. These metrics are used to approximate the geodesic distances over the fiber manifold. We also discuss the case where only geodesic distances to a reduced set of landmark fibers are available. The experimental validation of the method is done using a manually annotated significant dataset of DTI of the calf muscle for healthy and diseased subjects.
Wehmeyer, Christoph; Falk von Rudorff, Guido; Wolf, Sebastian; Kabbe, Gabriel; Schärf, Daniel; Kühne, Thomas D; Sebastiani, Daniel
2012-11-21
We present a stochastic, swarm intelligence-based optimization algorithm for the prediction of global minima on potential energy surfaces of molecular cluster structures. Our optimization approach is a modification of the artificial bee colony (ABC) algorithm which is inspired by the foraging behavior of honey bees. We apply our modified ABC algorithm to the problem of global geometry optimization of molecular cluster structures and show its performance for clusters with 2-57 particles and different interatomic interaction potentials.
NASA Astrophysics Data System (ADS)
Wehmeyer, Christoph; Falk von Rudorff, Guido; Wolf, Sebastian; Kabbe, Gabriel; Schärf, Daniel; Kühne, Thomas D.; Sebastiani, Daniel
2012-11-01
We present a stochastic, swarm intelligence-based optimization algorithm for the prediction of global minima on potential energy surfaces of molecular cluster structures. Our optimization approach is a modification of the artificial bee colony (ABC) algorithm which is inspired by the foraging behavior of honey bees. We apply our modified ABC algorithm to the problem of global geometry optimization of molecular cluster structures and show its performance for clusters with 2-57 particles and different interatomic interaction potentials.
An adaptive tracker for ShipIR/NTCS
NASA Astrophysics Data System (ADS)
Ramaswamy, Srinivasan; Vaitekunas, David A.
2015-05-01
A key component in any image-based tracking system is the adaptive tracking algorithm used to segment the image into potential targets, rank-and-select the best candidate target, and the gating of the selected target to further improve tracker performance. This paper will describe a new adaptive tracker algorithm added to the naval threat countermeasure simulator (NTCS) of the NATO-standard ship signature model (ShipIR). The new adaptive tracking algorithm is an optional feature used with any of the existing internal NTCS or user-defined seeker algorithms (e.g., binary centroid, intensity centroid, and threshold intensity centroid). The algorithm segments the detected pixels into clusters, and the smallest set of clusters that meet the detection criterion is obtained by using a knapsack algorithm to identify the set of clusters that should not be used. The rectangular area containing the chosen clusters defines an inner boundary, from which a weighted centroid is calculated as the aim-point. A track-gate is then positioned around the clusters, taking into account the rate of change of the bounding area and compensating for any gimbal displacement. A sequence of scenarios is used to test the new tracking algorithm on a generic unclassified DDG ShipIR model, with and without flares, and demonstrate how some of the key seeker signals are impacted by both the ship and flare intrinsic signatures.
Xue, Zhong; Shen, Dinggang; Li, Hai; Wong, Stephen
2010-01-01
The traditional fuzzy clustering algorithm and its extensions have been successfully applied in medical image segmentation. However, because of the variability of tissues and anatomical structures, the clustering results might be biased by the tissue population and intensity differences. For example, clustering-based algorithms tend to over-segment white matter tissues of MR brain images. To solve this problem, we introduce a tissue probability map constrained clustering algorithm and apply it to serial MR brain image segmentation, i.e., a series of 3-D MR brain images of the same subject at different time points. Using the new serial image segmentation algorithm in the framework of the CLASSIC framework, which iteratively segments the images and estimates the longitudinal deformations, we improved both accuracy and robustness for serial image computing, and at the mean time produced longitudinally consistent segmentation and stable measures. In the algorithm, the tissue probability maps consist of both the population-based and subject-specific segmentation priors. Experimental study using both simulated longitudinal MR brain data and the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data confirmed that using both priors more accurate and robust segmentation results can be obtained. The proposed algorithm can be applied in longitudinal follow up studies of MR brain imaging with subtle morphological changes for neurological disorders. PMID:26566399
Mining subspace clusters from DNA microarray data using large itemset techniques.
Chang, Ye-In; Chen, Jiun-Rung; Tsai, Yueh-Chi
2009-05-01
Mining subspace clusters from the DNA microarrays could help researchers identify those genes which commonly contribute to a disease, where a subspace cluster indicates a subset of genes whose expression levels are similar under a subset of conditions. Since in a DNA microarray, the number of genes is far larger than the number of conditions, those previous proposed algorithms which compute the maximum dimension sets (MDSs) for any two genes will take a long time to mine subspace clusters. In this article, we propose the Large Itemset-Based Clustering (LISC) algorithm for mining subspace clusters. Instead of constructing MDSs for any two genes, we construct only MDSs for any two conditions. Then, we transform the task of finding the maximal possible gene sets into the problem of mining large itemsets from the condition-pair MDSs. Since we are only interested in those subspace clusters with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonable large support values in the condition-pair MDSs. From our simulation results, we show that the proposed algorithm needs shorter processing time than those previous proposed algorithms which need to construct gene-pair MDSs.
NASA Astrophysics Data System (ADS)
Nguyen, Sy Dzung; Nguyen, Quoc Hung; Choi, Seung-Bok
2015-01-01
This paper presents a new algorithm for building an adaptive neuro-fuzzy inference system (ANFIS) from a training data set called B-ANFIS. In order to increase accuracy of the model, the following issues are executed. Firstly, a data merging rule is proposed to build and perform a data-clustering strategy. Subsequently, a combination of clustering processes in the input data space and in the joint input-output data space is presented. Crucial reason of this task is to overcome problems related to initialization and contradictory fuzzy rules, which usually happen when building ANFIS. The clustering process in the input data space is accomplished based on a proposed merging-possibilistic clustering (MPC) algorithm. The effectiveness of this process is evaluated to resume a clustering process in the joint input-output data space. The optimal parameters obtained after completion of the clustering process are used to build ANFIS. Simulations based on a numerical data, 'Daily Data of Stock A', and measured data sets of a smart damper are performed to analyze and estimate accuracy. In addition, convergence and robustness of the proposed algorithm are investigated based on both theoretical and testing approaches.
Blocked inverted indices for exact clustering of large chemical spaces.
Thiel, Philipp; Sach-Peltason, Lisa; Ottmann, Christian; Kohlbacher, Oliver
2014-09-22
The calculation of pairwise compound similarities based on fingerprints is one of the fundamental tasks in chemoinformatics. Methods for efficient calculation of compound similarities are of the utmost importance for various applications like similarity searching or library clustering. With the increasing size of public compound databases, exact clustering of these databases is desirable, but often computationally prohibitively expensive. We present an optimized inverted index algorithm for the calculation of all pairwise similarities on 2D fingerprints of a given data set. In contrast to other algorithms, it neither requires GPU computing nor yields a stochastic approximation of the clustering. The algorithm has been designed to work well with multicore architectures and shows excellent parallel speedup. As an application example of this algorithm, we implemented a deterministic clustering application, which has been designed to decompose virtual libraries comprising tens of millions of compounds in a short time on current hardware. Our results show that our implementation achieves more than 400 million Tanimoto similarity calculations per second on a common desktop CPU. Deterministic clustering of the available chemical space thus can be done on modern multicore machines within a few days.
Ancient genomic architecture for mammalian olfactory receptor clusters
Aloni, Ronny; Olender, Tsviya; Lancet, Doron
2006-01-01
Background Mammalian olfactory receptor (OR) genes reside in numerous genomic clusters of up to several dozen genes. Whole-genome sequence alignment nets of five mammals allow their comprehensive comparison, aimed at reconstructing the ancestral olfactory subgenome. Results We developed a new and general tool for genome-wide definition of genomic gene clusters conserved in multiple species. Syntenic orthologs, defined as gene pairs showing conservation of both genomic location and coding sequence, were subjected to a graph theory algorithm for discovering CLICs (clusters in conservation). When applied to ORs in five mammals, including the marsupial opossum, more than 90% of the OR genes were found within a framework of 48 multi-species CLICs, invoking a general conservation of gene order and composition. A detailed analysis of individual CLICs revealed multiple differences among species, interpretable through species-specific genomic rearrangements and reflecting complex mammalian evolutionary dynamics. One significant instance involves CLIC #1, which lacks a human member, implying the human-specific deletion of an OR cluster, whose mouse counterpart has been tentatively associated with isovaleric acid odorant detection. Conclusion The identified multi-species CLICs demonstrate that most of the mammalian OR clusters have a common ancestry, preceding the split between marsupials and placental mammals. However, only two of these CLICs were capable of incorporating chicken OR genes, parsimoniously implying that all other CLICs emerged subsequent to the avian-mammalian divergence. PMID:17010214
NASA Astrophysics Data System (ADS)
Choi, Hon-Chit; Wen, Lingfeng; Eberl, Stefan; Feng, Dagan
2006-03-01
Dynamic Single Photon Emission Computed Tomography (SPECT) has the potential to quantitatively estimate physiological parameters by fitting compartment models to the tracer kinetics. The generalized linear least square method (GLLS) is an efficient method to estimate unbiased kinetic parameters and parametric images. However, due to the low sensitivity of SPECT, noisy data can cause voxel-wise parameter estimation by GLLS to fail. Fuzzy C-Mean (FCM) clustering and modified FCM, which also utilizes information from the immediate neighboring voxels, are proposed to improve the voxel-wise parameter estimation of GLLS. Monte Carlo simulations were performed to generate dynamic SPECT data with different noise levels and processed by general and modified FCM clustering. Parametric images were estimated by Logan and Yokoi graphical analysis and GLLS. The influx rate (K I), volume of distribution (V d) were estimated for the cerebellum, thalamus and frontal cortex. Our results show that (1) FCM reduces the bias and improves the reliability of parameter estimates for noisy data, (2) GLLS provides estimates of micro parameters (K I-k 4) as well as macro parameters, such as volume of distribution (Vd) and binding potential (BP I & BP II) and (3) FCM clustering incorporating neighboring voxel information does not improve the parameter estimates, but improves noise in the parametric images. These findings indicated that it is desirable for pre-segmentation with traditional FCM clustering to generate voxel-wise parametric images with GLLS from dynamic SPECT data.
A Game Theory Algorithm for Intra-Cluster Data Aggregation in a Vehicular Ad Hoc Network
Chen, Yuzhong; Weng, Shining; Guo, Wenzhong; Xiong, Naixue
2016-01-01
Vehicular ad hoc networks (VANETs) have an important role in urban management and planning. The effective integration of vehicle information in VANETs is critical to traffic analysis, large-scale vehicle route planning and intelligent transportation scheduling. However, given the limitations in the precision of the output information of a single sensor and the difficulty of information sharing among various sensors in a highly dynamic VANET, effectively performing data aggregation in VANETs remains a challenge. Moreover, current studies have mainly focused on data aggregation in large-scale environments but have rarely discussed the issue of intra-cluster data aggregation in VANETs. In this study, we propose a multi-player game theory algorithm for intra-cluster data aggregation in VANETs by analyzing the competitive and cooperative relationships among sensor nodes. Several sensor-centric metrics are proposed to measure the data redundancy and stability of a cluster. We then study the utility function to achieve efficient intra-cluster data aggregation by considering both data redundancy and cluster stability. In particular, we prove the existence of a unique Nash equilibrium in the game model, and conduct extensive experiments to validate the proposed algorithm. Results demonstrate that the proposed algorithm has advantages over typical data aggregation algorithms in both accuracy and efficiency. PMID:26907272
A Game Theory Algorithm for Intra-Cluster Data Aggregation in a Vehicular Ad Hoc Network.
Chen, Yuzhong; Weng, Shining; Guo, Wenzhong; Xiong, Naixue
2016-02-19
Vehicular ad hoc networks (VANETs) have an important role in urban management and planning. The effective integration of vehicle information in VANETs is critical to traffic analysis, large-scale vehicle route planning and intelligent transportation scheduling. However, given the limitations in the precision of the output information of a single sensor and the difficulty of information sharing among various sensors in a highly dynamic VANET, effectively performing data aggregation in VANETs remains a challenge. Moreover, current studies have mainly focused on data aggregation in large-scale environments but have rarely discussed the issue of intra-cluster data aggregation in VANETs. In this study, we propose a multi-player game theory algorithm for intra-cluster data aggregation in VANETs by analyzing the competitive and cooperative relationships among sensor nodes. Several sensor-centric metrics are proposed to measure the data redundancy and stability of a cluster. We then study the utility function to achieve efficient intra-cluster data aggregation by considering both data redundancy and cluster stability. In particular, we prove the existence of a unique Nash equilibrium in the game model, and conduct extensive experiments to validate the proposed algorithm. Results demonstrate that the proposed algorithm has advantages over typical data aggregation algorithms in both accuracy and efficiency.
Real Time Intelligent Target Detection and Analysis with Machine Vision
NASA Technical Reports Server (NTRS)
Howard, Ayanna; Padgett, Curtis; Brown, Kenneth
2000-01-01
We present an algorithm for detecting a specified set of targets for an Automatic Target Recognition (ATR) application. ATR involves processing images for detecting, classifying, and tracking targets embedded in a background scene. We address the problem of discriminating between targets and nontarget objects in a scene by evaluating 40x40 image blocks belonging to an image. Each image block is first projected onto a set of templates specifically designed to separate images of targets embedded in a typical background scene from those background images without targets. These filters are found using directed principal component analysis which maximally separates the two groups. The projected images are then clustered into one of n classes based on a minimum distance to a set of n cluster prototypes. These cluster prototypes have previously been identified using a modified clustering algorithm based on prior sensed data. Each projected image pattern is then fed into the associated cluster's trained neural network for classification. A detailed description of our algorithm will be given in this paper. We outline our methodology for designing the templates, describe our modified clustering algorithm, and provide details on the neural network classifiers. Evaluation of the overall algorithm demonstrates that our detection rates approach 96% with a false positive rate of less than 0.03%.
Shoemaker, W C; Patil, R; Appel, P L; Kram, H B
1992-11-01
A generalized decision tree or clinical algorithm for treatment of high-risk elective surgical patients was developed from a physiologic model based on empirical data. First, a large data bank was used to do the following: (1) describe temporal hemodynamic and oxygen transport patterns that interrelate cardiac, pulmonary, and tissue perfusion functions in survivors and nonsurvivors; (2) define optimal therapeutic goals based on the supranormal oxygen transport values of high-risk postoperative survivors; (3) compare the relative effectiveness of alternative therapies in a wide variety of clinical and physiologic conditions; and (4) to develop criteria for titration of therapy to the endpoints of the supranormal optimal goals using cardiac index (CI), oxygen delivery (DO2), and oxygen consumption (VO2) as proxy outcome measures. Second, a general purpose algorithm was generated from these data and tested in preoperatively randomized clinical trials of high-risk surgical patients. Improved outcome was demonstrated with this generalized algorithm. The concept that the supranormal values represent compensations that have survival value has been corroborated by several other groups. We now propose a unique approach to refine the generalized algorithm to develop customized algorithms and individualized decision analysis for each patient's unique problems. The present article describes a preliminary evaluation of the feasibility of artificial intelligence techniques to accomplish individualized algorithms that may further improve patient care and outcome.
Adaptive block online learning target tracking based on super pixel segmentation
NASA Astrophysics Data System (ADS)
Cheng, Yue; Li, Jianzeng
2018-04-01
Video target tracking technology under the unremitting exploration of predecessors has made big progress, but there are still lots of problems not solved. This paper proposed a new algorithm of target tracking based on image segmentation technology. Firstly we divide the selected region using simple linear iterative clustering (SLIC) algorithm, after that, we block the area with the improved density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm. Each sub-block independently trained classifier and tracked, then the algorithm ignore the failed tracking sub-block while reintegrate the rest of the sub-blocks into tracking box to complete the target tracking. The experimental results show that our algorithm can work effectively under occlusion interference, rotation change, scale change and many other problems in target tracking compared with the current mainstream algorithms.
NASA Technical Reports Server (NTRS)
Wharton, S. W.
1980-01-01
An Interactive Cluster Analysis Procedure (ICAP) was developed to derive classifier training statistics from remotely sensed data. The algorithm interfaces the rapid numerical processing capacity of a computer with the human ability to integrate qualitative information. Control of the clustering process alternates between the algorithm, which creates new centroids and forms clusters and the analyst, who evaluate and elect to modify the cluster structure. Clusters can be deleted or lumped pairwise, or new centroids can be added. A summary of the cluster statistics can be requested to facilitate cluster manipulation. The ICAP was implemented in APL (A Programming Language), an interactive computer language. The flexibility of the algorithm was evaluated using data from different LANDSAT scenes to simulate two situations: one in which the analyst is assumed to have no prior knowledge about the data and wishes to have the clusters formed more or less automatically; and the other in which the analyst is assumed to have some knowledge about the data structure and wishes to use that information to closely supervise the clustering process. For comparison, an existing clustering method was also applied to the two data sets.
Multimodal Estimation of Distribution Algorithms.
Yang, Qiang; Chen, Wei-Neng; Li, Yun; Chen, C L Philip; Xu, Xiang-Min; Zhang, Jun
2016-02-15
Taking the advantage of estimation of distribution algorithms (EDAs) in preserving high diversity, this paper proposes a multimodal EDA. Integrated with clustering strategies for crowding and speciation, two versions of this algorithm are developed, which operate at the niche level. Then these two algorithms are equipped with three distinctive techniques: 1) a dynamic cluster sizing strategy; 2) an alternative utilization of Gaussian and Cauchy distributions to generate offspring; and 3) an adaptive local search. The dynamic cluster sizing affords a potential balance between exploration and exploitation and reduces the sensitivity to the cluster size in the niching methods. Taking advantages of Gaussian and Cauchy distributions, we generate the offspring at the niche level through alternatively using these two distributions. Such utilization can also potentially offer a balance between exploration and exploitation. Further, solution accuracy is enhanced through a new local search scheme probabilistically conducted around seeds of niches with probabilities determined self-adaptively according to fitness values of these seeds. Extensive experiments conducted on 20 benchmark multimodal problems confirm that both algorithms can achieve competitive performance compared with several state-of-the-art multimodal algorithms, which is supported by nonparametric tests. Especially, the proposed algorithms are very promising for complex problems with many local optima.
NASA Astrophysics Data System (ADS)
Zhou, Shuguang; Zhou, Kefa; Wang, Jinlin; Yang, Genfang; Wang, Shanshan
2017-12-01
Cluster analysis is a well-known technique that is used to analyze various types of data. In this study, cluster analysis is applied to geochemical data that describe 1444 stream sediment samples collected in northwestern Xinjiang with a sample spacing of approximately 2 km. Three algorithms (the hierarchical, k-means, and fuzzy c-means algorithms) and six data transformation methods (the z-score standardization, ZST; the logarithmic transformation, LT; the additive log-ratio transformation, ALT; the centered log-ratio transformation, CLT; the isometric log-ratio transformation, ILT; and no transformation, NT) are compared in terms of their effects on the cluster analysis of the geochemical compositional data. The study shows that, on the one hand, the ZST does not affect the results of column- or variable-based (R-type) cluster analysis, whereas the other methods, including the LT, the ALT, and the CLT, have substantial effects on the results. On the other hand, the results of the row- or observation-based (Q-type) cluster analysis obtained from the geochemical data after applying NT and the ZST are relatively poor. However, we derive some improved results from the geochemical data after applying the CLT, the ILT, the LT, and the ALT. Moreover, the k-means and fuzzy c-means clustering algorithms are more reliable than the hierarchical algorithm when they are used to cluster the geochemical data. We apply cluster analysis to the geochemical data to explore for Au deposits within the study area, and we obtain a good correlation between the results retrieved by combining the CLT or the ILT with the k-means or fuzzy c-means algorithms and the potential zones of Au mineralization. Therefore, we suggest that the combination of the CLT or the ILT with the k-means or fuzzy c-means algorithms is an effective tool to identify potential zones of mineralization from geochemical data.
Abelian non-global logarithms from soft gluon clustering
NASA Astrophysics Data System (ADS)
Kelley, Randall; Walsh, Jonathan R.; Zuberi, Saba
2012-09-01
Most recombination-style jet algorithms cluster soft gluons in a complex way. This leads to previously identified correlations in the soft gluon phase space and introduces logarithmic corrections to jet cross sections, which are known as clustering logarithms. The leading Abelian clustering logarithms occur at least at next-to leading logarithm (NLL) in the exponent of the distribution. Using the framework of Soft Collinear Effective Theory (SCET), we show that new clustering effects contributing at NLL arise at each order. While numerical resummation of clustering logs is possible, it is unlikely that they can be analytically resummed to NLL. Clustering logarithms make the anti-kT algorithm theoretically preferred, for which they are power suppressed. They can arise in Abelian and non-Abelian terms, and we calculate the Abelian clustering logarithms at O ( {α_s^2} ) for the jet mass distribution using the Cambridge/Aachen and kT algorithms, including jet radius dependence, which extends previous results. We find that clustering logarithms can be naturally thought of as a class of non-global logarithms, which have traditionally been tied to non-Abelian correlations in soft gluon emission.
Jing Jin; Dauwels, Justin; Cash, Sydney; Westover, M Brandon
2014-01-01
Detection of interictal discharges is a key element of interpreting EEGs during the diagnosis and management of epilepsy. Because interpretation of clinical EEG data is time-intensive and reliant on experts who are in short supply, there is a great need for automated spike detectors. However, attempts to develop general-purpose spike detectors have so far been severely limited by a lack of expert-annotated data. Huge databases of interictal discharges are therefore in great demand for the development of general-purpose detectors. Detailed manual annotation of interictal discharges is time consuming, which severely limits the willingness of experts to participate. To address such problems, a graphical user interface "SpikeGUI" was developed in our work for the purposes of EEG viewing and rapid interictal discharge annotation. "SpikeGUI" substantially speeds up the task of annotating interictal discharges using a custom-built algorithm based on a combination of template matching and online machine learning techniques. While the algorithm is currently tailored to annotation of interictal epileptiform discharges, it can easily be generalized to other waveforms and signal types.
Jin, Jing; Dauwels, Justin; Cash, Sydney; Westover, M. Brandon
2015-01-01
Detection of interictal discharges is a key element of interpreting EEGs during the diagnosis and management of epilepsy. Because interpretation of clinical EEG data is time-intensive and reliant on experts who are in short supply, there is a great need for automated spike detectors. However, attempts to develop general-purpose spike detectors have so far been severely limited by a lack of expert-annotated data. Huge databases of interictal discharges are therefore in great demand for the development of general-purpose detectors. Detailed manual annotation of interictal discharges is time consuming, which severely limits the willingness of experts to participate. To address such problems, a graphical user interface “SpikeGUI” was developed in our work for the purposes of EEG viewing and rapid interictal discharge annotation. “SpikeGUI” substantially speeds up the task of annotating interictal discharges using a custom-built algorithm based on a combination of template matching and online machine learning techniques. While the algorithm is currently tailored to annotation of interictal epileptiform discharges, it can easily be generalized to other waveforms and signal types. PMID:25570976
Clustering Millions of Faces by Identity.
Otto, Charles; Wang, Dayong; Jain, Anil K
2018-02-01
Given a large collection of unlabeled face images, we address the problem of clustering faces into an unknown number of identities. This problem is of interest in social media, law enforcement, and other applications, where the number of faces can be of the order of hundreds of million, while the number of identities (clusters) can range from a few thousand to millions. To address the challenges of run-time complexity and cluster quality, we present an approximate Rank-Order clustering algorithm that performs better than popular clustering algorithms (k-Means and Spectral). Our experiments include clustering up to 123 million face images into over 10 million clusters. Clustering results are analyzed in terms of external (known face labels) and internal (unknown face labels) quality measures, and run-time. Our algorithm achieves an F-measure of 0.87 on the LFW benchmark (13 K faces of 5,749 individuals), which drops to 0.27 on the largest dataset considered (13 K faces in LFW + 123M distractor images). Additionally, we show that frames in the YouTube benchmark can be clustered with an F-measure of 0.71. An internal per-cluster quality measure is developed to rank individual clusters for manual exploration of high quality clusters that are compact and isolated.
A noniterative greedy algorithm for multiframe point correspondence.
Shafique, Khurram; Shah, Mubarak
2005-01-01
This paper presents a framework for finding point correspondences in monocular image sequences over multiple frames. The general problem of multiframe point correspondence is NP-hard for three or more frames. A polynomial time algorithm for a restriction of this problem is presented and is used as the basis of the proposed greedy algorithm for the general problem. The greedy nature of the proposed algorithm allows it to be used in real-time systems for tracking and surveillance, etc. In addition, the proposed algorithm deals with the problems of occlusion, missed detections, and false positives by using a single noniterative greedy optimization scheme and, hence, reduces the complexity of the overall algorithm as compared to most existing approaches where multiple heuristics are used for the same purpose. While most greedy algorithms for point tracking do not allow for entry and exit of the points from the scene, this is not a limitation for the proposed algorithm. Experiments with real and synthetic data over a wide range of scenarios and system parameters are presented to validate the claims about the performance of the proposed algorithm.
Generalized SMO algorithm for SVM-based multitask learning.
Cai, Feng; Cherkassky, Vladimir
2012-06-01
Exploiting additional information to improve traditional inductive learning is an active research area in machine learning. In many supervised-learning applications, training data can be naturally separated into several groups, and incorporating this group information into learning may improve generalization. Recently, Vapnik proposed a general approach to formalizing such problems, known as "learning with structured data" and its support vector machine (SVM) based optimization formulation called SVM+. Liang and Cherkassky showed the connection between SVM+ and multitask learning (MTL) approaches in machine learning, and proposed an SVM-based formulation for MTL called SVM+MTL for classification. Training the SVM+MTL classifier requires the solution of a large quadratic programming optimization problem which scales as O(n(3)) with sample size n. So there is a need to develop computationally efficient algorithms for implementing SVM+MTL. This brief generalizes Platt's sequential minimal optimization (SMO) algorithm to the SVM+MTL setting. Empirical results show that, for typical SVM+MTL problems, the proposed generalized SMO achieves over 100 times speed-up, in comparison with general-purpose optimization routines.
Application of artificial intelligence to search ground-state geometry of clusters
NASA Astrophysics Data System (ADS)
Lemes, Maurício Ruv; Marim, L. R.; dal Pino, A.
2002-08-01
We introduce a global optimization procedure, the neural-assisted genetic algorithm (NAGA). It combines the power of an artificial neural network (ANN) with the versatility of the genetic algorithm. This method is suitable to solve optimization problems that depend on some kind of heuristics to limit the search space. If a reasonable amount of data is available, the ANN can ``understand'' the problem and provide the genetic algorithm with a selected population of elements that will speed up the search for the optimum solution. We tested the method in a search for the ground-state geometry of silicon clusters. We trained the ANN with information about the geometry and energetics of small silicon clusters. Next, the ANN learned how to restrict the configurational space for larger silicon clusters. For Si10 and Si20, we noticed that the NAGA is at least three times faster than the ``pure'' genetic algorithm. As the size of the cluster increases, it is expected that the gain in terms of time will increase as well.
Macromolecular target prediction by self-organizing feature maps.
Schneider, Gisbert; Schneider, Petra
2017-03-01
Rational drug discovery would greatly benefit from a more nuanced appreciation of the activity of pharmacologically active compounds against a diverse panel of macromolecular targets. Already, computational target-prediction models assist medicinal chemists in library screening, de novo molecular design, optimization of active chemical agents, drug re-purposing, in the spotting of potential undesired off-target activities, and in the 'de-orphaning' of phenotypic screening hits. The self-organizing map (SOM) algorithm has been employed successfully for these and other purposes. Areas covered: The authors recapitulate contemporary artificial neural network methods for macromolecular target prediction, and present the basic SOM algorithm at a conceptual level. Specifically, they highlight consensus target-scoring by the employment of multiple SOMs, and discuss the opportunities and limitations of this technique. Expert opinion: Self-organizing feature maps represent a straightforward approach to ligand clustering and classification. Some of the appeal lies in their conceptual simplicity and broad applicability domain. Despite known algorithmic shortcomings, this computational target prediction concept has been proven to work in prospective settings with high success rates. It represents a prototypic technique for future advances in the in silico identification of the modes of action and macromolecular targets of bioactive molecules.
Satisfiability of logic programming based on radial basis function neural networks
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hamadneh, Nawaf; Sathasivam, Saratha; Tilahun, Surafel Luleseged
2014-07-10
In this paper, we propose a new technique to test the Satisfiability of propositional logic programming and quantified Boolean formula problem in radial basis function neural networks. For this purpose, we built radial basis function neural networks to represent the proportional logic which has exactly three variables in each clause. We used the Prey-predator algorithm to calculate the output weights of the neural networks, while the K-means clustering algorithm is used to determine the hidden parameters (the centers and the widths). Mean of the sum squared error function is used to measure the activity of the two algorithms. We appliedmore » the developed technique with the recurrent radial basis function neural networks to represent the quantified Boolean formulas. The new technique can be applied to solve many applications such as electronic circuits and NP-complete problems.« less
NASA Technical Reports Server (NTRS)
Mazzoni, Dominic; Wagstaff, Kiri; Bornstein, Benjamin; Tang, Nghia; Roden, Joseph
2006-01-01
PixelLearn is an integrated user-interface computer program for classifying pixels in scientific images. Heretofore, training a machine-learning algorithm to classify pixels in images has been tedious and difficult. PixelLearn provides a graphical user interface that makes it faster and more intuitive, leading to more interactive exploration of image data sets. PixelLearn also provides image-enhancement controls to make it easier to see subtle details in images. PixelLearn opens images or sets of images in a variety of common scientific file formats and enables the user to interact with several supervised or unsupervised machine-learning pixel-classifying algorithms while the user continues to browse through the images. The machinelearning algorithms in PixelLearn use advanced clustering and classification methods that enable accuracy much higher than is achievable by most other software previously available for this purpose. PixelLearn is written in portable C++ and runs natively on computers running Linux, Windows, or Mac OS X.
NASA Astrophysics Data System (ADS)
Joseph, R.; Courbin, F.; Starck, J.-L.
2016-05-01
We introduce a new algorithm for colour separation and deblending of multi-band astronomical images called MuSCADeT which is based on Morpho-spectral Component Analysis of multi-band images. The MuSCADeT algorithm takes advantage of the sparsity of astronomical objects in morphological dictionaries such as wavelets and their differences in spectral energy distribution (SED) across multi-band observations. This allows us to devise a model independent and automated approach to separate objects with different colours. We show with simulations that we are able to separate highly blended objects and that our algorithm is robust against SED variations of objects across the field of view. To confront our algorithm with real data, we use HST images of the strong lensing galaxy cluster MACS J1149+2223 and we show that MuSCADeT performs better than traditional profile-fitting techniques in deblending the foreground lensing galaxies from background lensed galaxies. Although the main driver for our work is the deblending of strong gravitational lenses, our method is fit to be used for any purpose related to deblending of objects in astronomical images. An example of such an application is the separation of the red and blue stellar populations of a spiral galaxy in the galaxy cluster Abell 2744. We provide a python package along with all simulations and routines used in this paper to contribute to reproducible research efforts. Codes can be found at http://lastro.epfl.ch/page-126973.html
A multimembership catalogue for 1876 open clusters using UCAC4 data
NASA Astrophysics Data System (ADS)
Sampedro, L.; Dias, W. S.; Alfaro, E. J.; Monteiro, H.; Molino, A.
2017-10-01
The main objective of this work is to determine the cluster members of 1876 open clusters, using positions and proper motions of the astrometric fourth United States Naval Observatory (USNO) CCD Astrograph Catalog (UCAC4). For this purpose, we apply three different methods, all based on a Bayesian approach, but with different formulations: a purely parametric method, another completely non-parametric algorithm and a third, recently developed by Sampedro & Alfaro, using both formulations at different steps of the whole process. The first and second statistical moments of the members' phase-space subspace, obtained after applying the three methods, are compared for every cluster. Although, on average, the three methods yield similar results, there are also specific differences between them, as well as for some particular clusters. The comparison with other published catalogues shows good agreement. We have also estimated, for the first time, the mean proper motion for a sample of 18 clusters. The results are organized in a single catalogue formed by two main files, one with the most relevant information for each cluster, partially including that in UCAC4, and the other showing the individual membership probabilities for each star in the cluster area. The final catalogue, with an interface design that enables an easy interaction with the user, is available in electronic format at the Stellar Systems Group (SSG-IAA) web site (http://ssg.iaa.es/en/content/sampedro-cluster-catalog).
Automatic Clustering Using FSDE-Forced Strategy Differential Evolution
NASA Astrophysics Data System (ADS)
Yasid, A.
2018-01-01
Clustering analysis is important in datamining for unsupervised data, cause no adequate prior knowledge. One of the important tasks is defining the number of clusters without user involvement that is known as automatic clustering. This study intends on acquiring cluster number automatically utilizing forced strategy differential evolution (AC-FSDE). Two mutation parameters, namely: constant parameter and variable parameter are employed to boost differential evolution performance. Four well-known benchmark datasets were used to evaluate the algorithm. Moreover, the result is compared with other state of the art automatic clustering methods. The experiment results evidence that AC-FSDE is better or competitive with other existing automatic clustering algorithm.
Hybrid fuzzy cluster ensemble framework for tumor clustering from biomolecular data.
Yu, Zhiwen; Chen, Hantao; You, Jane; Han, Guoqiang; Li, Le
2013-01-01
Cancer class discovery using biomolecular data is one of the most important tasks for cancer diagnosis and treatment. Tumor clustering from gene expression data provides a new way to perform cancer class discovery. Most of the existing research works adopt single-clustering algorithms to perform tumor clustering is from biomolecular data that lack robustness, stability, and accuracy. To further improve the performance of tumor clustering from biomolecular data, we introduce the fuzzy theory into the cluster ensemble framework for tumor clustering from biomolecular data, and propose four kinds of hybrid fuzzy cluster ensemble frameworks (HFCEF), named as HFCEF-I, HFCEF-II, HFCEF-III, and HFCEF-IV, respectively, to identify samples that belong to different types of cancers. The difference between HFCEF-I and HFCEF-II is that they adopt different ensemble generator approaches to generate a set of fuzzy matrices in the ensemble. Specifically, HFCEF-I applies the affinity propagation algorithm (AP) to perform clustering on the sample dimension and generates a set of fuzzy matrices in the ensemble based on the fuzzy membership function and base samples selected by AP. HFCEF-II adopts AP to perform clustering on the attribute dimension, generates a set of subspaces, and obtains a set of fuzzy matrices in the ensemble by performing fuzzy c-means on subspaces. Compared with HFCEF-I and HFCEF-II, HFCEF-III and HFCEF-IV consider the characteristics of HFCEF-I and HFCEF-II. HFCEF-III combines HFCEF-I and HFCEF-II in a serial way, while HFCEF-IV integrates HFCEF-I and HFCEF-II in a concurrent way. HFCEFs adopt suitable consensus functions, such as the fuzzy c-means algorithm or the normalized cut algorithm (Ncut), to summarize generated fuzzy matrices, and obtain the final results. The experiments on real data sets from UCI machine learning repository and cancer gene expression profiles illustrate that 1) the proposed hybrid fuzzy cluster ensemble frameworks work well on real data sets, especially biomolecular data, and 2) the proposed approaches are able to provide more robust, stable, and accurate results when compared with the state-of-the-art single clustering algorithms and traditional cluster ensemble approaches.
ABCluster: the artificial bee colony algorithm for cluster global optimization.
Zhang, Jun; Dolg, Michael
2015-10-07
Global optimization of cluster geometries is of fundamental importance in chemistry and an interesting problem in applied mathematics. In this work, we introduce a relatively new swarm intelligence algorithm, i.e. the artificial bee colony (ABC) algorithm proposed in 2005, to this field. It is inspired by the foraging behavior of a bee colony, and only three parameters are needed to control it. We applied it to several potential functions of quite different nature, i.e., the Coulomb-Born-Mayer, Lennard-Jones, Morse, Z and Gupta potentials. The benchmarks reveal that for long-ranged potentials the ABC algorithm is very efficient in locating the global minimum, while for short-ranged ones it is sometimes trapped into a local minimum funnel on a potential energy surface of large clusters. We have released an efficient, user-friendly, and free program "ABCluster" to realize the ABC algorithm. It is a black-box program for non-experts as well as experts and might become a useful tool for chemists to study clusters.
Burchett, John; Shankar, Mohan; Hamza, A Ben; Guenther, Bob D; Pitsianis, Nikos; Brady, David J
2006-05-01
We use pyroelectric detectors that are differential in nature to detect motion in humans by their heat emissions. Coded Fresnel lens arrays create boundaries that help to localize humans in space as well as to classify the nature of their motion. We design and implement a low-cost biometric tracking system by using off-the-shelf components. We demonstrate two classification methods by using data gathered from sensor clusters of dual-element pyroelectric detectors with coded Fresnel lens arrays. We propose two algorithms for person identification, a more generalized spectral clustering method and a more rigorous example that uses principal component regression to perform a blind classification.
Quantum cluster variational method and message passing algorithms revisited
NASA Astrophysics Data System (ADS)
Domínguez, E.; Mulet, Roberto
2018-02-01
We present a general framework to study quantum disordered systems in the context of the Kikuchi's cluster variational method (CVM). The method relies in the solution of message passing-like equations for single instances or in the iterative solution of complex population dynamic algorithms for an average case scenario. We first show how a standard application of the Kikuchi's CVM can be easily translated to message passing equations for specific instances of the disordered system. We then present an "ad hoc" extension of these equations to a population dynamic algorithm representing an average case scenario. At the Bethe level, these equations are equivalent to the dynamic population equations that can be derived from a proper cavity ansatz. However, at the plaquette approximation, the interpretation is more subtle and we discuss it taking also into account previous results in classical disordered models. Moreover, we develop a formalism to properly deal with the average case scenario using a replica-symmetric ansatz within this CVM for quantum disordered systems. Finally, we present and discuss numerical solutions of the different approximations for the quantum transverse Ising model and the quantum random field Ising model in two-dimensional lattices.
Consensus properties and their large-scale applications for the gene duplication problem.
Moon, Jucheol; Lin, Harris T; Eulenstein, Oliver
2016-06-01
Solving the gene duplication problem is a classical approach for species tree inference from gene trees that are confounded by gene duplications. This problem takes a collection of gene trees and seeks a species tree that implies the minimum number of gene duplications. Wilkinson et al. posed the conjecture that the gene duplication problem satisfies the desirable Pareto property for clusters. That is, for every instance of the problem, all clusters that are commonly present in the input gene trees of this instance, called strict consensus, will also be found in every solution to this instance. We prove that this conjecture does not generally hold. Despite this negative result we show that the gene duplication problem satisfies a weaker version of the Pareto property where the strict consensus is found in at least one solution (rather than all solutions). This weaker property contributes to our design of an efficient scalable algorithm for the gene duplication problem. We demonstrate the performance of our algorithm in analyzing large-scale empirical datasets. Finally, we utilize the algorithm to evaluate the accuracy of standard heuristics for the gene duplication problem using simulated datasets.
A novel framework for feature extraction in multi-sensor action potential sorting.
Wu, Shun-Chi; Swindlehurst, A Lee; Nenadic, Zoran
2015-09-30
Extracellular recordings of multi-unit neural activity have become indispensable in neuroscience research. The analysis of the recordings begins with the detection of the action potentials (APs), followed by a classification step where each AP is associated with a given neural source. A feature extraction step is required prior to classification in order to reduce the dimensionality of the data and the impact of noise, allowing source clustering algorithms to work more efficiently. In this paper, we propose a novel framework for multi-sensor AP feature extraction based on the so-called Matched Subspace Detector (MSD), which is shown to be a natural generalization of standard single-sensor algorithms. Clustering using both simulated data and real AP recordings taken in the locust antennal lobe demonstrates that the proposed approach yields features that are discriminatory and lead to promising results. Unlike existing methods, the proposed algorithm finds joint spatio-temporal feature vectors that match the dominant subspace observed in the two-dimensional data without needs for a forward propagation model and AP templates. The proposed MSD approach provides more discriminatory features for unsupervised AP sorting applications. Copyright © 2015 Elsevier B.V. All rights reserved.
Hybrid approach of selecting hyperparameters of support vector machine for regression.
Jeng, Jin-Tsong
2006-06-01
To select the hyperparameters of the support vector machine for regression (SVR), a hybrid approach is proposed to determine the kernel parameter of the Gaussian kernel function and the epsilon value of Vapnik's epsilon-insensitive loss function. The proposed hybrid approach includes a competitive agglomeration (CA) clustering algorithm and a repeated SVR (RSVR) approach. Since the CA clustering algorithm is used to find the nearly "optimal" number of clusters and the centers of clusters in the clustering process, the CA clustering algorithm is applied to select the Gaussian kernel parameter. Additionally, an RSVR approach that relies on the standard deviation of a training error is proposed to obtain an epsilon in the loss function. Finally, two functions, one real data set (i.e., a time series of quarterly unemployment rate for West Germany) and an identification of nonlinear plant are used to verify the usefulness of the hybrid approach.
Thematic clustering of text documents using an EM-based approach
2012-01-01
Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE® dataset represent the subjects of clusters reasonably well. PMID:23046528
NASA Technical Reports Server (NTRS)
Tarabalka, Y.; Tilton, J. C.; Benediktsson, J. A.; Chanussot, J.
2012-01-01
The Hierarchical SEGmentation (HSEG) algorithm, which combines region object finding with region object clustering, has given good performances for multi- and hyperspectral image analysis. This technique produces at its output a hierarchical set of image segmentations. The automated selection of a single segmentation level is often necessary. We propose and investigate the use of automatically selected markers for this purpose. In this paper, a novel Marker-based HSEG (M-HSEG) method for spectral-spatial classification of hyperspectral images is proposed. Two classification-based approaches for automatic marker selection are adapted and compared for this purpose. Then, a novel constrained marker-based HSEG algorithm is applied, resulting in a spectral-spatial classification map. Three different implementations of the M-HSEG method are proposed and their performances in terms of classification accuracies are compared. The experimental results, presented for three hyperspectral airborne images, demonstrate that the proposed approach yields accurate segmentation and classification maps, and thus is attractive for remote sensing image analysis.
2015-01-01
Background Though cluster analysis has become a routine analytic task for bioinformatics research, it is still arduous for researchers to assess the quality of a clustering result. To select the best clustering method and its parameters for a dataset, researchers have to run multiple clustering algorithms and compare them. However, such a comparison task with multiple clustering results is cognitively demanding and laborious. Results In this paper, we present XCluSim, a visual analytics tool that enables users to interactively compare multiple clustering results based on the Visual Information Seeking Mantra. We build a taxonomy for categorizing existing techniques of clustering results visualization in terms of the Gestalt principles of grouping. Using the taxonomy, we choose the most appropriate interactive visualizations for presenting individual clustering results from different types of clustering algorithms. The efficacy of XCluSim is shown through case studies with a bioinformatician. Conclusions Compared to other relevant tools, XCluSim enables users to compare multiple clustering results in a more scalable manner. Moreover, XCluSim supports diverse clustering algorithms and dedicated visualizations and interactions for different types of clustering results, allowing more effective exploration of details on demand. Through case studies with a bioinformatics researcher, we received positive feedback on the functionalities of XCluSim, including its ability to help identify stably clustered items across multiple clustering results. PMID:26328893
Clustering Binary Data in the Presence of Masking Variables
ERIC Educational Resources Information Center
Brusco, Michael J.
2004-01-01
A number of important applications require the clustering of binary data sets. Traditional nonhierarchical cluster analysis techniques, such as the popular K-means algorithm, can often be successfully applied to these data sets. However, the presence of masking variables in a data set can impede the ability of the K-means algorithm to recover the…
A Multilevel Gamma-Clustering Layout Algorithm for Visualization of Biological Networks
Hruz, Tomas; Lucas, Christoph; Laule, Oliver; Zimmermann, Philip
2013-01-01
Visualization of large complex networks has become an indispensable part of systems biology, where organisms need to be considered as one complex system. The visualization of the corresponding network is challenging due to the size and density of edges. In many cases, the use of standard visualization algorithms can lead to high running times and poorly readable visualizations due to many edge crossings. We suggest an approach that analyzes the structure of the graph first and then generates a new graph which contains specific semantic symbols for regular substructures like dense clusters. We propose a multilevel gamma-clustering layout visualization algorithm (MLGA) which proceeds in three subsequent steps: (i) a multilevel γ-clustering is used to identify the structure of the underlying network, (ii) the network is transformed to a tree, and (iii) finally, the resulting tree which shows the network structure is drawn using a variation of a force-directed algorithm. The algorithm has a potential to visualize very large networks because it uses modern clustering heuristics which are optimized for large graphs. Moreover, most of the edges are removed from the visual representation which allows keeping the overview over complex graphs with dense subgraphs. PMID:23864855
Iterative Track Fitting Using Cluster Classification in Multi Wire Proportional Chamber
NASA Astrophysics Data System (ADS)
Primor, David; Mikenberg, Giora; Etzion, Erez; Messer, Hagit
2007-10-01
This paper addresses the problem of track fitting of a charged particle in a multi wire proportional chamber (MWPC) using cathode readout strips. When a charged particle crosses a MWPC, a positive charge is induced on a cluster of adjacent strips. In the presence of high radiation background, the cluster charge measurements may be contaminated due to background particles, leading to less accurate hit position estimation. The least squares method for track fitting assumes the same position error distribution for all hits and thus loses its optimal properties on contaminated data. For this reason, a new robust algorithm is proposed. The algorithm first uses the known spatial charge distribution caused by a single charged particle over the strips, and classifies the clusters into ldquocleanrdquo and ldquodirtyrdquo clusters. Then, using the classification results, it performs an iterative weighted least squares fitting procedure, updating its optimal weights each iteration. The performance of the suggested algorithm is compared to other track fitting techniques using a simulation of tracks with radiation background. It is shown that the algorithm improves the track fitting performance significantly. A practical implementation of the algorithm is presented for muon track fitting in the cathode strip chamber (CSC) of the ATLAS experiment.
Advances in Significance Testing for Cluster Detection
NASA Astrophysics Data System (ADS)
Coleman, Deidra Andrea
Over the past two decades, much attention has been given to data driven project goals such as the Human Genome Project and the development of syndromic surveillance systems. A major component of these types of projects is analyzing the abundance of data. Detecting clusters within the data can be beneficial as it can lead to the identification of specified sequences of DNA nucleotides that are related to important biological functions or the locations of epidemics such as disease outbreaks or bioterrorism attacks. Cluster detection techniques require efficient and accurate hypothesis testing procedures. In this dissertation, we improve upon the hypothesis testing procedures for cluster detection by enhancing distributional theory and providing an alternative method for spatial cluster detection using syndromic surveillance data. In Chapter 2, we provide an efficient method to compute the exact distribution of the number and coverage of h-clumps of a collection of words. This method involves defining a Markov chain using a minimal deterministic automaton to reduce the number of states needed for computation. We allow words of the collection to contain other words of the collection making the method more general. We use our method to compute the distributions of the number and coverage of h-clumps in the Chi motif of H. influenza.. In Chapter 3, we provide an efficient algorithm to compute the exact distribution of multiple window discrete scan statistics for higher-order, multi-state Markovian sequences. This algorithm involves defining a Markov chain to efficiently keep track of probabilities needed to compute p-values of the statistic. We use our algorithm to identify cases where the available approximation does not perform well. We also use our algorithm to detect unusual clusters of made free throw shots by National Basketball Association players during the 2009-2010 regular season. In Chapter 4, we give a procedure to detect outbreaks using syndromic surveillance data while controlling the Bayesian False Discovery Rate (BFDR). The procedure entails choosing an appropriate Bayesian model that captures the spatial dependency inherent in epidemiological data and considers all days of interest, selecting a test statistic based on a chosen measure that provides the magnitude of the maximumal spatial cluster for each day, and identifying a cutoff value that controls the BFDR for rejecting the collective null hypothesis of no outbreak over a collection of days for a specified region.We use our procedure to analyze botulism-like syndrome data collected by the North Carolina Disease Event Tracking and Epidemiologic Collection Tool (NC DETECT).
A Fast Implementation of the ISODATA Clustering Algorithm
NASA Technical Reports Server (NTRS)
Memarsadeghi, Nargess; Mount, David M.; Netanyahu, Nathan S.; LeMoigne, Jacqueline
2005-01-01
Clustering is central to many image processing and remote sensing applications. ISODATA is one of the most popular and widely used clustering methods in geoscience applications, but it can run slowly, particularly with large data sets. We present a more efficient approach to ISODATA clustering, which achieves better running times by storing the points in a kd-tree and through a modification of the way in which the algorithm estimates the dispersion of each cluster. We also present an approximate version of the algorithm which allows the user to further improve the running time, at the expense of lower fidelity in computing the nearest cluster center to each point. We provide both theoretical and empirical justification that our modified approach produces clusterings that are very similar to those produced by the standard ISODATA approach. We also provide empirical studies on both synthetic data and remotely sensed Landsat and MODIS images that show that our approach has significantly lower running times.
A Fast Implementation of the Isodata Clustering Algorithm
NASA Technical Reports Server (NTRS)
Memarsadeghi, Nargess; Le Moigne, Jacqueline; Mount, David M.; Netanyahu, Nathan S.
2007-01-01
Clustering is central to many image processing and remote sensing applications. ISODATA is one of the most popular and widely used clustering methods in geoscience applications, but it can run slowly, particularly with large data sets. We present a more efficient approach to IsoDATA clustering, which achieves better running times by storing the points in a kd-tree and through a modification of the way in which the algorithm estimates the dispersion of each cluster. We also present an approximate version of the algorithm which allows the user to further improve the running time, at the expense of lower fidelity in computing the nearest cluster center to each point. We provide both theoretical and empirical justification that our modified approach produces clusterings that are very similar to those produced by the standard ISODATA approach. We also provide empirical studies on both synthetic data and remotely sensed Landsat and MODIS images that show that our approach has significantly lower running times.
Chiu, Chia-Yi; Köhn, Hans-Friedrich
2016-09-01
The asymptotic classification theory of cognitive diagnosis (ACTCD) provided the theoretical foundation for using clustering methods that do not rely on a parametric statistical model for assigning examinees to proficiency classes. Like general diagnostic classification models, clustering methods can be useful in situations where the true diagnostic classification model (DCM) underlying the data is unknown and possibly misspecified, or the items of a test conform to a mix of multiple DCMs. Clustering methods can also be an option when fitting advanced and complex DCMs encounters computational difficulties. These can range from the use of excessive CPU times to plain computational infeasibility. However, the propositions of the ACTCD have only been proven for the Deterministic Input Noisy Output "AND" gate (DINA) model and the Deterministic Input Noisy Output "OR" gate (DINO) model. For other DCMs, there does not exist a theoretical justification to use clustering for assigning examinees to proficiency classes. But if clustering is to be used legitimately, then the ACTCD must cover a larger number of DCMs than just the DINA model and the DINO model. Thus, the purpose of this article is to prove the theoretical propositions of the ACTCD for two other important DCMs, the Reduced Reparameterized Unified Model and the General Diagnostic Model.
NASA Astrophysics Data System (ADS)
Closser, Kristina Danielle
This thesis presents new developments in excited state electronic structure theory. Contrasted with the ground state, the electronically excited states of atoms and molecules often are unstable and have short lifetimes, exhibit a greater diversity of character and are generally less well understood. The very unusual excited states of helium clusters motivated much of this work. These clusters consist of large numbers of atoms (experimentally 103--109 atoms) and bands of nearly degenerate excited states. For an isolated atom the lowest energy excitation energies are from 1s → 2s and 1s → 2 p transitions, and in clusters describing the lowest energy band minimally requires four states per atom. In the ground state the clusters are weakly bound by van der Waals interactions, however in the excited state they can form well-defined covalent bonds. The computational cost of quantum chemical calculations rapidly becomes prohibitive as the size of the systems increase. Standard excited-state methods such as configuration interaction singles (CIS) and time-dependent density functional theory (TD-DFT) can be used with ≈100 atoms, and are optimized to treat only a few states. Thus, one of our primary aims is to develop a method which can treat these large systems with large numbers of nearly degenerate excited states. Additionally, excited states are generally formed far from their equilibrium structures. Vertical excitations from the ground state induce dynamics in the excited states. Thus, another focus of this work is to explore the results of these forces and the fate of the excited states. Very little was known about helium cluster excited states when this work began, thus we first investigated the excitations in small helium clusters consisting of 7 or 25 atoms using CIS. The character of these excited states was determined using attachment/detachment density analysis and we found that in the n = 2 manifold the excitations could generally be interpreted as superpositions of atomic states with surface states appearing close to the atomic excitation energies and interior states being blue shifted by up to ≈2 eV. The dynamics resulting from excitation of He_7 were subsequently explored using ab initio molecular dynamics (AIMD). These simulations were performed with classical adiabatic dynamics coupled to a new state-following algorithm on CIS potential energy surfaces. Most clusters were found to completely dissociate and resulted in a single excited atomic state (90%), however, some trajectories formed bound, He*2 (3%), and a few yielded excited trimers (<0.5%). Comparisons were made with available experimental information on much larger clusters. Various applications of this state following algorithm are also presented. In addition to AIMD, these include excited-state geometry optimization and minimal energy path finding via the growing string method. When using state following we demonstrate that more physical results can be obtained with AIMD calculations. Also, the optimized geometries of three excited states of cytosine, two of which were not found without state following, and the minimal energy path between the lowest two singlet excited states of protonated formaldimine are offered as example applications. Finally, to address large clusters, a local variation of CIS was developed. This method exploits the properties of absolutely localized molecular orbitals (ALMOs) to limit the total number of excitations to scaling only linearly with cluster size, which results in formal scaling with the third power of the system size. The derivation of the equations and design of the algorithm are discussed in detail, and computational timings as well as a pilot application to the size dependence of the helium cluster spectrum are presented.
Unsupervised classification of multivariate geostatistical data: Two algorithms
NASA Astrophysics Data System (ADS)
Romary, Thomas; Ors, Fabien; Rivoirard, Jacques; Deraisme, Jacques
2015-12-01
With the increasing development of remote sensing platforms and the evolution of sampling facilities in mining and oil industry, spatial datasets are becoming increasingly large, inform a growing number of variables and cover wider and wider areas. Therefore, it is often necessary to split the domain of study to account for radically different behaviors of the natural phenomenon over the domain and to simplify the subsequent modeling step. The definition of these areas can be seen as a problem of unsupervised classification, or clustering, where we try to divide the domain into homogeneous domains with respect to the values taken by the variables in hand. The application of classical clustering methods, designed for independent observations, does not ensure the spatial coherence of the resulting classes. Image segmentation methods, based on e.g. Markov random fields, are not adapted to irregularly sampled data. Other existing approaches, based on mixtures of Gaussian random functions estimated via the expectation-maximization algorithm, are limited to reasonable sample sizes and a small number of variables. In this work, we propose two algorithms based on adaptations of classical algorithms to multivariate geostatistical data. Both algorithms are model free and can handle large volumes of multivariate, irregularly spaced data. The first one proceeds by agglomerative hierarchical clustering. The spatial coherence is ensured by a proximity condition imposed for two clusters to merge. This proximity condition relies on a graph organizing the data in the coordinates space. The hierarchical algorithm can then be seen as a graph-partitioning algorithm. Following this interpretation, a spatial version of the spectral clustering algorithm is also proposed. The performances of both algorithms are assessed on toy examples and a mining dataset.
A Decentralized Eigenvalue Computation Method for Spectrum Sensing Based on Average Consensus
NASA Astrophysics Data System (ADS)
Mohammadi, Jafar; Limmer, Steffen; Stańczak, Sławomir
2016-07-01
This paper considers eigenvalue estimation for the decentralized inference problem for spectrum sensing. We propose a decentralized eigenvalue computation algorithm based on the power method, which is referred to as generalized power method GPM; it is capable of estimating the eigenvalues of a given covariance matrix under certain conditions. Furthermore, we have developed a decentralized implementation of GPM by splitting the iterative operations into local and global computation tasks. The global tasks require data exchange to be performed among the nodes. For this task, we apply an average consensus algorithm to efficiently perform the global computations. As a special case, we consider a structured graph that is a tree with clusters of nodes at its leaves. For an accelerated distributed implementation, we propose to use computation over multiple access channel (CoMAC) as a building block of the algorithm. Numerical simulations are provided to illustrate the performance of the two algorithms.
Probabilistic cosmological mass mapping from weak lensing shear
Schneider, M. D.; Ng, K. Y.; Dawson, W. A.; ...
2017-04-10
Here, we infer gravitational lensing shear and convergence fields from galaxy ellipticity catalogs under a spatial process prior for the lensing potential. We demonstrate the performance of our algorithm with simulated Gaussian-distributed cosmological lensing shear maps and a reconstruction of the mass distribution of the merging galaxy cluster Abell 781 using galaxy ellipticities measured with the Deep Lens Survey. Given interim posterior samples of lensing shear or convergence fields on the sky, we describe an algorithm to infer cosmological parameters via lens field marginalization. In the most general formulation of our algorithm we make no assumptions about weak shear ormore » Gaussian-distributed shape noise or shears. Because we require solutions and matrix determinants of a linear system of dimension that scales with the number of galaxies, we expect our algorithm to require parallel high-performance computing resources for application to ongoing wide field lensing surveys.« less
Probabilistic Cosmological Mass Mapping from Weak Lensing Shear
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schneider, M. D.; Dawson, W. A.; Ng, K. Y.
2017-04-10
We infer gravitational lensing shear and convergence fields from galaxy ellipticity catalogs under a spatial process prior for the lensing potential. We demonstrate the performance of our algorithm with simulated Gaussian-distributed cosmological lensing shear maps and a reconstruction of the mass distribution of the merging galaxy cluster Abell 781 using galaxy ellipticities measured with the Deep Lens Survey. Given interim posterior samples of lensing shear or convergence fields on the sky, we describe an algorithm to infer cosmological parameters via lens field marginalization. In the most general formulation of our algorithm we make no assumptions about weak shear or Gaussian-distributedmore » shape noise or shears. Because we require solutions and matrix determinants of a linear system of dimension that scales with the number of galaxies, we expect our algorithm to require parallel high-performance computing resources for application to ongoing wide field lensing surveys.« less
Unsupervised feature relevance analysis applied to improve ECG heartbeat clustering.
Rodríguez-Sotelo, J L; Peluffo-Ordoñez, D; Cuesta-Frau, D; Castellanos-Domínguez, G
2012-10-01
The computer-assisted analysis of biomedical records has become an essential tool in clinical settings. However, current devices provide a growing amount of data that often exceeds the processing capacity of normal computers. As this amount of information rises, new demands for more efficient data extracting methods appear. This paper addresses the task of data mining in physiological records using a feature selection scheme. An unsupervised method based on relevance analysis is described. This scheme uses a least-squares optimization of the input feature matrix in a single iteration. The output of the algorithm is a feature weighting vector. The performance of the method was assessed using a heartbeat clustering test on real ECG records. The quantitative cluster validity measures yielded a correctly classified heartbeat rate of 98.69% (specificity), 85.88% (sensitivity) and 95.04% (general clustering performance), which is even higher than the performance achieved by other similar ECG clustering studies. The number of features was reduced on average from 100 to 18, and the temporal cost was a 43% lower than in previous ECG clustering schemes. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
NASA Technical Reports Server (NTRS)
Jain, Abhinandan
2011-01-01
Ndarts software provides algorithms for computing quantities associated with the dynamics of articulated, rigid-link, multibody systems. It is designed as a general-purpose dynamics library that can be used for the modeling of robotic platforms, space vehicles, molecular dynamics, and other such applications. The architecture and algorithms in Ndarts are based on the Spatial Operator Algebra (SOA) theory for computational multibody and robot dynamics developed at JPL. It uses minimal, internal coordinate models. The algorithms are low-order, recursive scatter/ gather algorithms. In comparison with the earlier Darts++ software, this version has a more general and cleaner design needed to support a larger class of computational dynamics needs. It includes a frames infrastructure, allows algorithms to operate on subgraphs of the system, and implements lazy and deferred computation for better efficiency. Dynamics modeling modules such as Ndarts are core building blocks of control and simulation software for space, robotic, mechanism, bio-molecular, and material systems modeling.
Pandini, Alessandro; Fraccalvieri, Domenico; Bonati, Laura
2013-01-01
The biological function of proteins is strictly related to their molecular flexibility and dynamics: enzymatic activity, protein-protein interactions, ligand binding and allosteric regulation are important mechanisms involving protein motions. Computational approaches, such as Molecular Dynamics (MD) simulations, are now routinely used to study the intrinsic dynamics of target proteins as well as to complement molecular docking approaches. These methods have also successfully supported the process of rational design and discovery of new drugs. Identification of functionally relevant conformations is a key step in these studies. This is generally done by cluster analysis of the ensemble of structures in the MD trajectory. Recently Artificial Neural Network (ANN) approaches, in particular methods based on Self-Organising Maps (SOMs), have been reported performing more accurately and providing more consistent results than traditional clustering algorithms in various data-mining problems. In the specific case of conformational analysis, SOMs have been successfully used to compare multiple ensembles of protein conformations demonstrating a potential in efficiently detecting the dynamic signatures central to biological function. Moreover, examples of the use of SOMs to address problems relevant to other stages of the drug-design process, including clustering of docking poses, have been reported. In this contribution we review recent applications of ANN algorithms in analysing conformational and structural ensembles and we discuss their potential in computer-based approaches for medicinal chemistry.
Competitive learning with pairwise constraints.
Covões, Thiago F; Hruschka, Eduardo R; Ghosh, Joydeep
2013-01-01
Constrained clustering has been an active research topic since the last decade. Most studies focus on batch-mode algorithms. This brief introduces two algorithms for on-line constrained learning, named on-line linear constrained vector quantization error (O-LCVQE) and constrained rival penalized competitive learning (C-RPCL). The former is a variant of the LCVQE algorithm for on-line settings, whereas the latter is an adaptation of the (on-line) RPCL algorithm to deal with constrained clustering. The accuracy results--in terms of the normalized mutual information (NMI)--from experiments with nine datasets show that the partitions induced by O-LCVQE are competitive with those found by the (batch-mode) LCVQE. Compared with this formidable baseline algorithm, it is surprising that C-RPCL can provide better partitions (in terms of the NMI) for most of the datasets. Also, experiments on a large dataset show that on-line algorithms for constrained clustering can significantly reduce the computational time.
An Integrated Intrusion Detection Model of Cluster-Based Wireless Sensor Network
Sun, Xuemei; Yan, Bo; Zhang, Xinzhong; Rong, Chuitian
2015-01-01
Considering wireless sensor network characteristics, this paper combines anomaly and mis-use detection and proposes an integrated detection model of cluster-based wireless sensor network, aiming at enhancing detection rate and reducing false rate. Adaboost algorithm with hierarchical structures is used for anomaly detection of sensor nodes, cluster-head nodes and Sink nodes. Cultural-Algorithm and Artificial-Fish–Swarm-Algorithm optimized Back Propagation is applied to mis-use detection of Sink node. Plenty of simulation demonstrates that this integrated model has a strong performance of intrusion detection. PMID:26447696
An Integrated Intrusion Detection Model of Cluster-Based Wireless Sensor Network.
Sun, Xuemei; Yan, Bo; Zhang, Xinzhong; Rong, Chuitian
2015-01-01
Considering wireless sensor network characteristics, this paper combines anomaly and mis-use detection and proposes an integrated detection model of cluster-based wireless sensor network, aiming at enhancing detection rate and reducing false rate. Adaboost algorithm with hierarchical structures is used for anomaly detection of sensor nodes, cluster-head nodes and Sink nodes. Cultural-Algorithm and Artificial-Fish-Swarm-Algorithm optimized Back Propagation is applied to mis-use detection of Sink node. Plenty of simulation demonstrates that this integrated model has a strong performance of intrusion detection.
Automated extraction and classification of time-frequency contours in humpback vocalizations.
Ou, Hui; Au, Whitlow W L; Zurk, Lisa M; Lammers, Marc O
2013-01-01
A time-frequency contour extraction and classification algorithm was created to analyze humpback whale vocalizations. The algorithm automatically extracted contours of whale vocalization units by searching for gray-level discontinuities in the spectrogram images. The unit-to-unit similarity was quantified by cross-correlating the contour lines. A library of distinctive humpback units was then generated by applying an unsupervised, cluster-based learning algorithm. The purpose of this study was to provide a fast and automated feature selection tool to describe the vocal signatures of animal groups. This approach could benefit a variety of applications such as species description, identification, and evolution of song structures. The algorithm was tested on humpback whale song data recorded at various locations in Hawaii from 2002 to 2003. Results presented in this paper showed low probability of false alarm (0%-4%) under noisy environments with small boat vessels and snapping shrimp. The classification algorithm was tested on a controlled set of 30 units forming six unit types, and all the units were correctly classified. In a case study on humpback data collected in the Auau Chanel, Hawaii, in 2002, the algorithm extracted 951 units, which were classified into 12 distinctive types.
Interactive visual exploration and refinement of cluster assignments.
Kern, Michael; Lex, Alexander; Gehlenborg, Nils; Johnson, Chris R
2017-09-12
With ever-increasing amounts of data produced in biology research, scientists are in need of efficient data analysis methods. Cluster analysis, combined with visualization of the results, is one such method that can be used to make sense of large data volumes. At the same time, cluster analysis is known to be imperfect and depends on the choice of algorithms, parameters, and distance measures. Most clustering algorithms don't properly account for ambiguity in the source data, as records are often assigned to discrete clusters, even if an assignment is unclear. While there are metrics and visualization techniques that allow analysts to compare clusterings or to judge cluster quality, there is no comprehensive method that allows analysts to evaluate, compare, and refine cluster assignments based on the source data, derived scores, and contextual data. In this paper, we introduce a method that explicitly visualizes the quality of cluster assignments, allows comparisons of clustering results and enables analysts to manually curate and refine cluster assignments. Our methods are applicable to matrix data clustered with partitional, hierarchical, and fuzzy clustering algorithms. Furthermore, we enable analysts to explore clustering results in context of other data, for example, to observe whether a clustering of genomic data results in a meaningful differentiation in phenotypes. Our methods are integrated into Caleydo StratomeX, a popular, web-based, disease subtype analysis tool. We show in a usage scenario that our approach can reveal ambiguities in cluster assignments and produce improved clusterings that better differentiate genotypes and phenotypes.
MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering
Kim, Eun-Youn; Kim, Seon-Young; Ashlock, Daniel; Nam, Dougu
2009-01-01
Background Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance. Results We present a cluster-number-based ensemble clustering algorithm, called MULTI-K, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple k-means runs by varying the number of clusters and identifies clusters that manifest the most robust co-memberships of elements. In addition to the original algorithm, we newly devised the entropy-plot to control the separation of singletons or small clusters. MULTI-K, unlike the simple k-means or other widely used methods, was able to capture clusters with complex and high-dimensional structures accurately. MULTI-K outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real gene-expression data sets. Conclusion The geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors. PMID:19698124