Performance analysis of clustering techniques over microarray data: A case study
NASA Astrophysics Data System (ADS)
Dash, Rasmita; Misra, Bijan Bihari
2018-03-01
Handling big data is one of the major issues in the field of statistical data analysis. In such investigation cluster analysis plays a vital role to deal with the large scale data. There are many clustering techniques with different cluster analysis approach. But which approach suits a particular dataset is difficult to predict. To deal with this problem a grading approach is introduced over many clustering techniques to identify a stable technique. But the grading approach depends on the characteristic of dataset as well as on the validity indices. So a two stage grading approach is implemented. In this study the grading approach is implemented over five clustering techniques like hybrid swarm based clustering (HSC), k-means, partitioning around medoids (PAM), vector quantization (VQ) and agglomerative nesting (AGNES). The experimentation is conducted over five microarray datasets with seven validity indices. The finding of grading approach that a cluster technique is significant is also established by Nemenyi post-hoc hypothetical test.
Statistical Significance for Hierarchical Clustering
Kimes, Patrick K.; Liu, Yufeng; Hayes, D. Neil; Marron, J. S.
2017-01-01
Summary Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this paper, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets. PMID:28099990
2017-01-30
dynamic structural time- history response analysis of flexible approach walls founded on clustered pile groups using Impact_Deck. In Preparation, ERDC...research (Ebeling et al. 2012) has developed simplified analysis procedures for flexible approach wall systems founded on clustered groups of vertical...history response analysis of flexible approach walls founded on clustered pile groups using Impact_Deck. In Preparation, ERDC/ITL TR-16-X. Vicksburg, MS
Two-Way Regularized Fuzzy Clustering of Multiple Correspondence Analysis.
Kim, Sunmee; Choi, Ji Yeh; Hwang, Heungsun
2017-01-01
Multiple correspondence analysis (MCA) is a useful tool for investigating the interrelationships among dummy-coded categorical variables. MCA has been combined with clustering methods to examine whether there exist heterogeneous subclusters of a population, which exhibit cluster-level heterogeneity. These combined approaches aim to classify either observations only (one-way clustering of MCA) or both observations and variable categories (two-way clustering of MCA). The latter approach is favored because its solutions are easier to interpret by providing explicitly which subgroup of observations is associated with which subset of variable categories. Nonetheless, the two-way approach has been built on hard classification that assumes observations and/or variable categories to belong to only one cluster. To relax this assumption, we propose two-way fuzzy clustering of MCA. Specifically, we combine MCA with fuzzy k-means simultaneously to classify a subgroup of observations and a subset of variable categories into a common cluster, while allowing both observations and variable categories to belong partially to multiple clusters. Importantly, we adopt regularized fuzzy k-means, thereby enabling us to decide the degree of fuzziness in cluster memberships automatically. We evaluate the performance of the proposed approach through the analysis of simulated and real data, in comparison with existing two-way clustering approaches.
Clustering analysis strategies for electron energy loss spectroscopy (EELS).
Torruella, Pau; Estrader, Marta; López-Ortega, Alberto; Baró, Maria Dolors; Varela, Maria; Peiró, Francesca; Estradé, Sònia
2018-02-01
In this work, the use of cluster analysis algorithms, widely applied in the field of big data, is proposed to explore and analyze electron energy loss spectroscopy (EELS) data sets. Three different data clustering approaches have been tested both with simulated and experimental data from Fe 3 O 4 /Mn 3 O 4 core/shell nanoparticles. The first method consists on applying data clustering directly to the acquired spectra. A second approach is to analyze spectral variance with principal component analysis (PCA) within a given data cluster. Lastly, data clustering on PCA score maps is discussed. The advantages and requirements of each approach are studied. Results demonstrate how clustering is able to recover compositional and oxidation state information from EELS data with minimal user input, giving great prospects for its usage in EEL spectroscopy. Copyright © 2017 Elsevier B.V. All rights reserved.
Mixture modelling for cluster analysis.
McLachlan, G J; Chang, S U
2004-10-01
Cluster analysis via a finite mixture model approach is considered. With this approach to clustering, the data can be partitioned into a specified number of clusters g by first fitting a mixture model with g components. An outright clustering of the data is then obtained by assigning an observation to the component to which it has the highest estimated posterior probability of belonging; that is, the ith cluster consists of those observations assigned to the ith component (i = 1,..., g). The focus is on the use of mixtures of normal components for the cluster analysis of data that can be regarded as being continuous. But attention is also given to the case of mixed data, where the observations consist of both continuous and discrete variables.
An effective fuzzy kernel clustering analysis approach for gene expression data.
Sun, Lin; Xu, Jiucheng; Yin, Jiaojiao
2015-01-01
Fuzzy clustering is an important tool for analyzing microarray data. A major problem in applying fuzzy clustering method to microarray gene expression data is the choice of parameters with cluster number and centers. This paper proposes a new approach to fuzzy kernel clustering analysis (FKCA) that identifies desired cluster number and obtains more steady results for gene expression data. First of all, to optimize characteristic differences and estimate optimal cluster number, Gaussian kernel function is introduced to improve spectrum analysis method (SAM). By combining subtractive clustering with max-min distance mean, maximum distance method (MDM) is proposed to determine cluster centers. Then, the corresponding steps of improved SAM (ISAM) and MDM are given respectively, whose superiority and stability are illustrated through performing experimental comparisons on gene expression data. Finally, by introducing ISAM and MDM into FKCA, an effective improved FKCA algorithm is proposed. Experimental results from public gene expression data and UCI database show that the proposed algorithms are feasible for cluster analysis, and the clustering accuracy is higher than the other related clustering algorithms.
Cluster Correspondence Analysis.
van de Velden, M; D'Enza, A Iodice; Palumbo, F
2017-03-01
A method is proposed that combines dimension reduction and cluster analysis for categorical data by simultaneously assigning individuals to clusters and optimal scaling values to categories in such a way that a single between variance maximization objective is achieved. In a unified framework, a brief review of alternative methods is provided and we show that the proposed method is equivalent to GROUPALS applied to categorical data. Performance of the methods is appraised by means of a simulation study. The results of the joint dimension reduction and clustering methods are compared with the so-called tandem approach, a sequential analysis of dimension reduction followed by cluster analysis. The tandem approach is conjectured to perform worse when variables are added that are unrelated to the cluster structure. Our simulation study confirms this conjecture. Moreover, the results of the simulation study indicate that the proposed method also consistently outperforms alternative joint dimension reduction and clustering methods.
A formal concept analysis approach to consensus clustering of multi-experiment expression data
2014-01-01
Background Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological signals. Conclusions The proposed FCA-enhanced consensus clustering technique is a general approach to the combination of clustering algorithms with FCA for deriving clustering solutions from multiple gene expression matrices. The experimental results presented herein demonstrate that it is a robust data integration technique able to produce good quality clustering solution that is representative for the whole set of expression matrices. PMID:24885407
ERIC Educational Resources Information Center
Gartstein, Maria A.; Prokasky, Amanda; Bell, Martha Ann; Calkins, Susan; Bridgett, David J.; Braungart-Rieker, Julia; Leerkes, Esther; Cheatham, Carol L.; Eiden, Rina D.; Mize, Krystal D.; Jones, Nancy Aaron; Mireault, Gina; Seamon, Erich
2017-01-01
There is renewed interest in person-centered approaches to understanding the structure of temperament. However, questions concerning temperament types are not frequently framed in a developmental context, especially during infancy. In addition, the most common person-centered techniques, cluster analysis (CA) and latent profile analysis (LPA),…
Missing continuous outcomes under covariate dependent missingness in cluster randomised trials
Diaz-Ordaz, Karla; Bartlett, Jonathan W
2016-01-01
Attrition is a common occurrence in cluster randomised trials which leads to missing outcome data. Two approaches for analysing such trials are cluster-level analysis and individual-level analysis. This paper compares the performance of unadjusted cluster-level analysis, baseline covariate adjusted cluster-level analysis and linear mixed model analysis, under baseline covariate dependent missingness in continuous outcomes, in terms of bias, average estimated standard error and coverage probability. The methods of complete records analysis and multiple imputation are used to handle the missing outcome data. We considered four scenarios, with the missingness mechanism and baseline covariate effect on outcome either the same or different between intervention groups. We show that both unadjusted cluster-level analysis and baseline covariate adjusted cluster-level analysis give unbiased estimates of the intervention effect only if both intervention groups have the same missingness mechanisms and there is no interaction between baseline covariate and intervention group. Linear mixed model and multiple imputation give unbiased estimates under all four considered scenarios, provided that an interaction of intervention and baseline covariate is included in the model when appropriate. Cluster mean imputation has been proposed as a valid approach for handling missing outcomes in cluster randomised trials. We show that cluster mean imputation only gives unbiased estimates when missingness mechanism is the same between the intervention groups and there is no interaction between baseline covariate and intervention group. Multiple imputation shows overcoverage for small number of clusters in each intervention group. PMID:27177885
Missing continuous outcomes under covariate dependent missingness in cluster randomised trials.
Hossain, Anower; Diaz-Ordaz, Karla; Bartlett, Jonathan W
2017-06-01
Attrition is a common occurrence in cluster randomised trials which leads to missing outcome data. Two approaches for analysing such trials are cluster-level analysis and individual-level analysis. This paper compares the performance of unadjusted cluster-level analysis, baseline covariate adjusted cluster-level analysis and linear mixed model analysis, under baseline covariate dependent missingness in continuous outcomes, in terms of bias, average estimated standard error and coverage probability. The methods of complete records analysis and multiple imputation are used to handle the missing outcome data. We considered four scenarios, with the missingness mechanism and baseline covariate effect on outcome either the same or different between intervention groups. We show that both unadjusted cluster-level analysis and baseline covariate adjusted cluster-level analysis give unbiased estimates of the intervention effect only if both intervention groups have the same missingness mechanisms and there is no interaction between baseline covariate and intervention group. Linear mixed model and multiple imputation give unbiased estimates under all four considered scenarios, provided that an interaction of intervention and baseline covariate is included in the model when appropriate. Cluster mean imputation has been proposed as a valid approach for handling missing outcomes in cluster randomised trials. We show that cluster mean imputation only gives unbiased estimates when missingness mechanism is the same between the intervention groups and there is no interaction between baseline covariate and intervention group. Multiple imputation shows overcoverage for small number of clusters in each intervention group.
Simultaneous Two-Way Clustering of Multiple Correspondence Analysis
ERIC Educational Resources Information Center
Hwang, Heungsun; Dillon, William R.
2010-01-01
A 2-way clustering approach to multiple correspondence analysis is proposed to account for cluster-level heterogeneity of both respondents and variable categories in multivariate categorical data. Specifically, in the proposed method, multiple correspondence analysis is combined with k-means in a unified framework in which "k"-means is…
Phenotypes Determined by Cluster Analysis in Moderate to Severe Bronchial Asthma.
Youroukova, Vania M; Dimitrova, Denitsa G; Valerieva, Anna D; Lesichkova, Spaska S; Velikova, Tsvetelina V; Ivanova-Todorova, Ekaterina I; Tumangelova-Yuzeir, Kalina D
2017-06-01
Bronchial asthma is a heterogeneous disease that includes various subtypes. They may share similar clinical characteristics, but probably have different pathological mechanisms. To identify phenotypes using cluster analysis in moderate to severe bronchial asthma and to compare differences in clinical, physiological, immunological and inflammatory data between the clusters. Forty adult patients with moderate to severe bronchial asthma out of exacerbation were included. All underwent clinical assessment, anthropometric measurements, skin prick testing, standard spirometry and measurement fraction of exhaled nitric oxide. Blood eosinophilic count, serum total IgE and periostin levels were determined. Two-step cluster approach, hierarchical clustering method and k-mean analysis were used for identification of the clusters. We have identified four clusters. Cluster 1 (n=14) - late-onset, non-atopic asthma with impaired lung function, Cluster 2 (n=13) - late-onset, atopic asthma, Cluster 3 (n=6) - late-onset, aspirin sensitivity, eosinophilic asthma, and Cluster 4 (n=7) - early-onset, atopic asthma. Our study is the first in Bulgaria in which cluster analysis is applied to asthmatic patients. We identified four clusters. The variables with greatest force for differentiation in our study were: age of asthma onset, duration of diseases, atopy, smoking, blood eosinophils, nonsteroidal anti-inflammatory drugs hypersensitivity, baseline FEV1/FVC and symptoms severity. Our results support the concept of heterogeneity of bronchial asthma and demonstrate that cluster analysis can be an useful tool for phenotyping of disease and personalized approach to the treatment of patients.
High- and low-level hierarchical classification algorithm based on source separation process
NASA Astrophysics Data System (ADS)
Loghmari, Mohamed Anis; Karray, Emna; Naceur, Mohamed Saber
2016-10-01
High-dimensional data applications have earned great attention in recent years. We focus on remote sensing data analysis on high-dimensional space like hyperspectral data. From a methodological viewpoint, remote sensing data analysis is not a trivial task. Its complexity is caused by many factors, such as large spectral or spatial variability as well as the curse of dimensionality. The latter describes the problem of data sparseness. In this particular ill-posed problem, a reliable classification approach requires appropriate modeling of the classification process. The proposed approach is based on a hierarchical clustering algorithm in order to deal with remote sensing data in high-dimensional space. Indeed, one obvious method to perform dimensionality reduction is to use the independent component analysis process as a preprocessing step. The first particularity of our method is the special structure of its cluster tree. Most of the hierarchical algorithms associate leaves to individual clusters, and start from a large number of individual classes equal to the number of pixels; however, in our approach, leaves are associated with the most relevant sources which are represented according to mutually independent axes to specifically represent some land covers associated with a limited number of clusters. These sources contribute to the refinement of the clustering by providing complementary rather than redundant information. The second particularity of our approach is that at each level of the cluster tree, we combine both a high-level divisive clustering and a low-level agglomerative clustering. This approach reduces the computational cost since the high-level divisive clustering is controlled by a simple Boolean operator, and optimizes the clustering results since the low-level agglomerative clustering is guided by the most relevant independent sources. Then at each new step we obtain a new finer partition that will participate in the clustering process to enhance semantic capabilities and give good identification rates.
Boyack, Kevin W.; Newman, David; Duhon, Russell J.; Klavans, Richard; Patek, Michael; Biberstine, Joseph R.; Schijvenaars, Bob; Skupin, André; Ma, Nianli; Börner, Katy
2011-01-01
Background We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents. Methodology We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE. Conclusions PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts. PMID:21437291
Bayesian network meta-analysis for cluster randomized trials with binary outcomes.
Uhlmann, Lorenz; Jensen, Katrin; Kieser, Meinhard
2017-06-01
Network meta-analysis is becoming a common approach to combine direct and indirect comparisons of several treatment arms. In recent research, there have been various developments and extensions of the standard methodology. Simultaneously, cluster randomized trials are experiencing an increased popularity, especially in the field of health services research, where, for example, medical practices are the units of randomization but the outcome is measured at the patient level. Combination of the results of cluster randomized trials is challenging. In this tutorial, we examine and compare different approaches for the incorporation of cluster randomized trials in a (network) meta-analysis. Furthermore, we provide practical insight on the implementation of the models. In simulation studies, it is shown that some of the examined approaches lead to unsatisfying results. However, there are alternatives which are suitable to combine cluster randomized trials in a network meta-analysis as they are unbiased and reach accurate coverage rates. In conclusion, the methodology can be extended in such a way that an adequate inclusion of the results obtained in cluster randomized trials becomes feasible. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Analysis of Tropical Cyclone Tracks in the North Indian Ocean
NASA Astrophysics Data System (ADS)
Patwardhan, A.; Paliwal, M.; Mohapatra, M.
2011-12-01
Cyclones are regarded as one of the most dangerous meteorological phenomena of the tropical region. The probability of landfall of a tropical cyclone depends on its movement (trajectory). Analysis of trajectories of tropical cyclones could be useful for identifying potentially predictable characteristics. There is long history of analysis of tropical cyclones tracks. A common approach is using different clustering techniques to group the cyclone tracks on the basis of certain characteristics. Various clustering method have been used to study the tropical cyclones in different ocean basins like western North Pacific ocean (Elsner and Liu, 2003; Camargo et al., 2007), North Atlantic Ocean (Elsner, 2003; Gaffney et al. 2007; Nakamura et al., 2009). In this study, tropical cyclone tracks in the North Indian Ocean basin, for the period 1961-2010 have been analyzed and grouped into clusters based on their spatial characteristics. A tropical cyclone trajectory is approximated as an open curve and described by its first two moments. The resulting clusters have different centroid locations and also differently shaped variance ellipses. These track characteristics are then used in the standard clustering algorithms which allow the whole track shape, length, and location to be incorporated into the clustering methodology. The resulting clusters have different genesis locations and trajectory shapes. We have also examined characteristics such as life span, maximum sustained wind speed, landfall, seasonality, many of which are significantly different across the identified clusters. The clustering approach groups cyclones with higher maximum wind speed and longest life span in to one cluster. Another cluster includes short duration cyclonic events that are mostly deep depressions and significant for rainfall over Eastern and Central India. The clustering approach is likely to prove useful for analysis of events of significance with regard to impacts.
Gorzalczany, Marian B; Rudzinski, Filip
2017-06-07
This paper presents a generalization of self-organizing maps with 1-D neighborhoods (neuron chains) that can be effectively applied to complex cluster analysis problems. The essence of the generalization consists in introducing mechanisms that allow the neuron chain--during learning--to disconnect into subchains, to reconnect some of the subchains again, and to dynamically regulate the overall number of neurons in the system. These features enable the network--working in a fully unsupervised way (i.e., using unlabeled data without a predefined number of clusters)--to automatically generate collections of multiprototypes that are able to represent a broad range of clusters in data sets. First, the operation of the proposed approach is illustrated on some synthetic data sets. Then, this technique is tested using several real-life, complex, and multidimensional benchmark data sets available from the University of California at Irvine (UCI) Machine Learning repository and the Knowledge Extraction based on Evolutionary Learning data set repository. A sensitivity analysis of our approach to changes in control parameters and a comparative analysis with an alternative approach are also performed.
Cluster ensemble based on Random Forests for genetic data.
Alhusain, Luluah; Hafez, Alaaeldin M
2017-01-01
Clustering plays a crucial role in several application domains, such as bioinformatics. In bioinformatics, clustering has been extensively used as an approach for detecting interesting patterns in genetic data. One application is population structure analysis, which aims to group individuals into subpopulations based on shared genetic variations, such as single nucleotide polymorphisms. Advances in DNA sequencing technology have facilitated the obtainment of genetic datasets with exceptional sizes. Genetic data usually contain hundreds of thousands of genetic markers genotyped for thousands of individuals, making an efficient means for handling such data desirable. Random Forests (RFs) has emerged as an efficient algorithm capable of handling high-dimensional data. RFs provides a proximity measure that can capture different levels of co-occurring relationships between variables. RFs has been widely considered a supervised learning method, although it can be converted into an unsupervised learning method. Therefore, RF-derived proximity measure combined with a clustering technique may be well suited for determining the underlying structure of unlabeled data. This paper proposes, RFcluE, a cluster ensemble approach for determining the underlying structure of genetic data based on RFs. The approach comprises a cluster ensemble framework to combine multiple runs of RF clustering. Experiments were conducted on high-dimensional, real genetic dataset to evaluate the proposed approach. The experiments included an examination of the impact of parameter changes, comparing RFcluE performance against other clustering methods, and an assessment of the relationship between the diversity and quality of the ensemble and its effect on RFcluE performance. This paper proposes, RFcluE, a cluster ensemble approach based on RF clustering to address the problem of population structure analysis and demonstrate the effectiveness of the approach. The paper also illustrates that applying a cluster ensemble approach, combining multiple RF clusterings, produces more robust and higher-quality results as a consequence of feeding the ensemble with diverse views of high-dimensional genetic data obtained through bagging and random subspace, the two key features of the RF algorithm.
Water quality assessment with hierarchical cluster analysis based on Mahalanobis distance.
Du, Xiangjun; Shao, Fengjing; Wu, Shunyao; Zhang, Hanlin; Xu, Si
2017-07-01
Water quality assessment is crucial for assessment of marine eutrophication, prediction of harmful algal blooms, and environment protection. Previous studies have developed many numeric modeling methods and data driven approaches for water quality assessment. The cluster analysis, an approach widely used for grouping data, has also been employed. However, there are complex correlations between water quality variables, which play important roles in water quality assessment but have always been overlooked. In this paper, we analyze correlations between water quality variables and propose an alternative method for water quality assessment with hierarchical cluster analysis based on Mahalanobis distance. Further, we cluster water quality data collected form coastal water of Bohai Sea and North Yellow Sea of China, and apply clustering results to evaluate its water quality. To evaluate the validity, we also cluster the water quality data with cluster analysis based on Euclidean distance, which are widely adopted by previous studies. The results show that our method is more suitable for water quality assessment with many correlated water quality variables. To our knowledge, it is the first attempt to apply Mahalanobis distance for coastal water quality assessment.
Cluster Analysis to Identify Possible Subgroups in Tinnitus Patients.
van den Berge, Minke J C; Free, Rolien H; Arnold, Rosemarie; de Kleine, Emile; Hofman, Rutger; van Dijk, J Marc C; van Dijk, Pim
2017-01-01
In tinnitus treatment, there is a tendency to shift from a "one size fits all" to a more individual, patient-tailored approach. Insight in the heterogeneity of the tinnitus spectrum might improve the management of tinnitus patients in terms of choice of treatment and identification of patients with severe mental distress. The goal of this study was to identify subgroups in a large group of tinnitus patients. Data were collected from patients with severe tinnitus complaints visiting our tertiary referral tinnitus care group at the University Medical Center Groningen. Patient-reported and physician-reported variables were collected during their visit to our clinic. Cluster analyses were used to characterize subgroups. For the selection of the right variables to enter in the cluster analysis, two approaches were used: (1) variable reduction with principle component analysis and (2) variable selection based on expert opinion. Various variables of 1,783 tinnitus patients were included in the analyses. Cluster analysis (1) included 976 patients and resulted in a four-cluster solution. The effect of external influences was the most discriminative between the groups, or clusters, of patients. The "silhouette measure" of the cluster outcome was low (0.2), indicating a "no substantial" cluster structure. Cluster analysis (2) included 761 patients and resulted in a three-cluster solution, comparable to the first analysis. Again, a "no substantial" cluster structure was found (0.2). Two cluster analyses on a large database of tinnitus patients revealed that clusters of patients are mostly formed by a different response of external influences on their disease. However, both cluster outcomes based on this dataset showed a poor stability, suggesting that our tinnitus population comprises a continuum rather than a number of clearly defined subgroups.
Analysis of cytokine release assay data using machine learning approaches.
Xiong, Feiyu; Janko, Marco; Walker, Mindi; Makropoulos, Dorie; Weinstock, Daniel; Kam, Moshe; Hrebien, Leonid
2014-10-01
The possible onset of Cytokine Release Syndrome (CRS) is an important consideration in the development of monoclonal antibody (mAb) therapeutics. In this study, several machine learning approaches are used to analyze CRS data. The analyzed data come from a human blood in vitro assay which was used to assess the potential of mAb-based therapeutics to produce cytokine release similar to that induced by Anti-CD28 superagonistic (Anti-CD28 SA) mAbs. The data contain 7 mAbs and two negative controls, a total of 423 samples coming from 44 donors. Three (3) machine learning approaches were applied in combination to observations obtained from that assay, namely (i) Hierarchical Cluster Analysis (HCA); (ii) Principal Component Analysis (PCA) followed by K-means clustering; and (iii) Decision Tree Classification (DTC). All three approaches were able to identify the treatment that caused the most severe cytokine response. HCA was able to provide information about the expected number of clusters in the data. PCA coupled with K-means clustering allowed classification of treatments sample by sample, and visualizing clusters of treatments. DTC models showed the relative importance of various cytokines such as IFN-γ, TNF-α and IL-10 to CRS. The use of these approaches in tandem provides better selection of parameters for one method based on outcomes from another, and an overall improved analysis of the data through complementary approaches. Moreover, the DTC analysis showed in addition that IL-17 may be correlated with CRS reactions, although this correlation has not yet been corroborated in the literature. Copyright © 2014 Elsevier B.V. All rights reserved.
Wang, Juan; Nishikawa, Robert M; Yang, Yongyi
2017-04-01
In computerized detection of clustered microcalcifications (MCs) from mammograms, the traditional approach is to apply a pattern detector to locate the presence of individual MCs, which are subsequently grouped into clusters. Such an approach is often susceptible to the occurrence of false positives (FPs) caused by local image patterns that resemble MCs. We investigate the feasibility of a direct detection approach to determining whether an image region contains clustered MCs or not. Toward this goal, we develop a deep convolutional neural network (CNN) as the classifier model to which the input consists of a large image window ([Formula: see text] in size). The multiple layers in the CNN classifier are trained to automatically extract image features relevant to MCs at different spatial scales. In the experiments, we demonstrated this approach on a dataset consisting of both screen-film mammograms and full-field digital mammograms. We evaluated the detection performance both on classifying image regions of clustered MCs using a receiver operating characteristic (ROC) analysis and on detecting clustered MCs from full mammograms by a free-response receiver operating characteristic analysis. For comparison, we also considered a recently developed MC detector with FP suppression. In classifying image regions of clustered MCs, the CNN classifier achieved 0.971 in the area under the ROC curve, compared to 0.944 for the MC detector. In detecting clustered MCs from full mammograms, at 90% sensitivity, the CNN classifier obtained an FP rate of 0.69 clusters/image, compared to 1.17 clusters/image by the MC detector. These results indicate that using global image features can be more effective in discriminating clustered MCs from FPs caused by various sources, such as linear structures, thereby providing a more accurate detection of clustered MCs on mammograms.
Perualila-Tan, Nolen Joy; Shkedy, Ziv; Talloen, Willem; Göhlmann, Hinrich W H; Moerbeke, Marijke Van; Kasim, Adetayo
2016-08-01
The modern process of discovering candidate molecules in early drug discovery phase includes a wide range of approaches to extract vital information from the intersection of biology and chemistry. A typical strategy in compound selection involves compound clustering based on chemical similarity to obtain representative chemically diverse compounds (not incorporating potency information). In this paper, we propose an integrative clustering approach that makes use of both biological (compound efficacy) and chemical (structural features) data sources for the purpose of discovering a subset of compounds with aligned structural and biological properties. The datasets are integrated at the similarity level by assigning complementary weights to produce a weighted similarity matrix, serving as a generic input in any clustering algorithm. This new analysis work flow is semi-supervised method since, after the determination of clusters, a secondary analysis is performed wherein it finds differentially expressed genes associated to the derived integrated cluster(s) to further explain the compound-induced biological effects inside the cell. In this paper, datasets from two drug development oncology projects are used to illustrate the usefulness of the weighted similarity-based clustering approach to integrate multi-source high-dimensional information to aid drug discovery. Compounds that are structurally and biologically similar to the reference compounds are discovered using this proposed integrative approach.
Mapping Informative Clusters in a Hierarchial Framework of fMRI Multivariate Analysis
Xu, Rui; Zhen, Zonglei; Liu, Jia
2010-01-01
Pattern recognition methods have become increasingly popular in fMRI data analysis, which are powerful in discriminating between multi-voxel patterns of brain activities associated with different mental states. However, when they are used in functional brain mapping, the location of discriminative voxels varies significantly, raising difficulties in interpreting the locus of the effect. Here we proposed a hierarchical framework of multivariate approach that maps informative clusters rather than voxels to achieve reliable functional brain mapping without compromising the discriminative power. In particular, we first searched for local homogeneous clusters that consisted of voxels with similar response profiles. Then, a multi-voxel classifier was built for each cluster to extract discriminative information from the multi-voxel patterns. Finally, through multivariate ranking, outputs from the classifiers were served as a multi-cluster pattern to identify informative clusters by examining interactions among clusters. Results from both simulated and real fMRI data demonstrated that this hierarchical approach showed better performance in the robustness of functional brain mapping than traditional voxel-based multivariate methods. In addition, the mapped clusters were highly overlapped for two perceptually equivalent object categories, further confirming the validity of our approach. In short, the hierarchical framework of multivariate approach is suitable for both pattern classification and brain mapping in fMRI studies. PMID:21152081
Atlas-guided cluster analysis of large tractography datasets.
Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer
2013-01-01
Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.
Detecting synchronization clusters in multivariate time series via coarse-graining of Markov chains.
Allefeld, Carsten; Bialonski, Stephan
2007-12-01
Synchronization cluster analysis is an approach to the detection of underlying structures in data sets of multivariate time series, starting from a matrix R of bivariate synchronization indices. A previous method utilized the eigenvectors of R for cluster identification, analogous to several recent attempts at group identification using eigenvectors of the correlation matrix. All of these approaches assumed a one-to-one correspondence of dominant eigenvectors and clusters, which has however been shown to be wrong in important cases. We clarify the usefulness of eigenvalue decomposition for synchronization cluster analysis by translating the problem into the language of stochastic processes, and derive an enhanced clustering method harnessing recent insights from the coarse-graining of finite-state Markov processes. We illustrate the operation of our method using a simulated system of coupled Lorenz oscillators, and we demonstrate its superior performance over the previous approach. Finally we investigate the question of robustness of the algorithm against small sample size, which is important with regard to field applications.
Kent, Peter; Stochkendahl, Mette Jensen; Christensen, Henrik Wulff; Kongsted, Alice
2015-01-01
Recognition of homogeneous subgroups of patients can usefully improve prediction of their outcomes and the targeting of treatment. There are a number of research approaches that have been used to recognise homogeneity in such subgroups and to test their implications. One approach is to use statistical clustering techniques, such as Cluster Analysis or Latent Class Analysis, to detect latent relationships between patient characteristics. Influential patient characteristics can come from diverse domains of health, such as pain, activity limitation, physical impairment, social role participation, psychological factors, biomarkers and imaging. However, such 'whole person' research may result in data-driven subgroups that are complex, difficult to interpret and challenging to recognise clinically. This paper describes a novel approach to applying statistical clustering techniques that may improve the clinical interpretability of derived subgroups and reduce sample size requirements. This approach involves clustering in two sequential stages. The first stage involves clustering within health domains and therefore requires creating as many clustering models as there are health domains in the available data. This first stage produces scoring patterns within each domain. The second stage involves clustering using the scoring patterns from each health domain (from the first stage) to identify subgroups across all domains. We illustrate this using chest pain data from the baseline presentation of 580 patients. The new two-stage clustering resulted in two subgroups that approximated the classic textbook descriptions of musculoskeletal chest pain and atypical angina chest pain. The traditional single-stage clustering resulted in five clusters that were also clinically recognisable but displayed less distinct differences. In this paper, a new approach to using clustering techniques to identify clinically useful subgroups of patients is suggested. Research designs, statistical methods and outcome metrics suitable for performing that testing are also described. This approach has potential benefits but requires broad testing, in multiple patient samples, to determine its clinical value. The usefulness of the approach is likely to be context-specific, depending on the characteristics of the available data and the research question being asked of it.
Lee, Alexandra J; Chang, Ivan; Burel, Julie G; Lindestam Arlehamn, Cecilia S; Mandava, Aishwarya; Weiskopf, Daniela; Peters, Bjoern; Sette, Alessandro; Scheuermann, Richard H; Qian, Yu
2018-04-17
Computational methods for identification of cell populations from polychromatic flow cytometry data are changing the paradigm of cytometry bioinformatics. Data clustering is the most common computational approach to unsupervised identification of cell populations from multidimensional cytometry data. However, interpretation of the identified data clusters is labor-intensive. Certain types of user-defined cell populations are also difficult to identify by fully automated data clustering analysis. Both are roadblocks before a cytometry lab can adopt the data clustering approach for cell population identification in routine use. We found that combining recursive data filtering and clustering with constraints converted from the user manual gating strategy can effectively address these two issues. We named this new approach DAFi: Directed Automated Filtering and Identification of cell populations. Design of DAFi preserves the data-driven characteristics of unsupervised clustering for identifying novel cell subsets, but also makes the results interpretable to experimental scientists through mapping and merging the multidimensional data clusters into the user-defined two-dimensional gating hierarchy. The recursive data filtering process in DAFi helped identify small data clusters which are otherwise difficult to resolve by a single run of the data clustering method due to the statistical interference of the irrelevant major clusters. Our experiment results showed that the proportions of the cell populations identified by DAFi, while being consistent with those by expert centralized manual gating, have smaller technical variances across samples than those from individual manual gating analysis and the nonrecursive data clustering analysis. Compared with manual gating segregation, DAFi-identified cell populations avoided the abrupt cut-offs on the boundaries. DAFi has been implemented to be used with multiple data clustering methods including K-means, FLOCK, FlowSOM, and the ClusterR package. For cell population identification, DAFi supports multiple options including clustering, bisecting, slope-based gating, and reversed filtering to meet various autogating needs from different scientific use cases. © 2018 International Society for Advancement of Cytometry. © 2018 International Society for Advancement of Cytometry.
ERIC Educational Resources Information Center
Brown, S. J.; White, S.; Power, N.
2015-01-01
A cluster analysis data classification technique was used on assessment scores from 157 undergraduate nursing students who passed 2 successive compulsory courses in human anatomy and physiology. Student scores in five summative assessment tasks, taken in each of the courses, were used as inputs for a cluster analysis procedure. We aimed to group…
A roadmap of clustering algorithms: finding a match for a biomedical application.
Andreopoulos, Bill; An, Aijun; Wang, Xiaogang; Schroeder, Michael
2009-05-01
Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.
Aftershock identification problem via the nearest-neighbor analysis for marked point processes
NASA Astrophysics Data System (ADS)
Gabrielov, A.; Zaliapin, I.; Wong, H.; Keilis-Borok, V.
2007-12-01
The centennial observations on the world seismicity have revealed a wide variety of clustering phenomena that unfold in the space-time-energy domain and provide most reliable information about the earthquake dynamics. However, there is neither a unifying theory nor a convenient statistical apparatus that would naturally account for the different types of seismic clustering. In this talk we present a theoretical framework for nearest-neighbor analysis of marked processes and obtain new results on hierarchical approach to studying seismic clustering introduced by Baiesi and Paczuski (2004). Recall that under this approach one defines an asymmetric distance D in space-time-energy domain such that the nearest-neighbor spanning graph with respect to D becomes a time- oriented tree. We demonstrate how this approach can be used to detect earthquake clustering. We apply our analysis to the observed seismicity of California and synthetic catalogs from ETAS model and show that the earthquake clustering part is statistically different from the homogeneous part. This finding may serve as a basis for an objective aftershock identification procedure.
Supervised group Lasso with applications to microarray data analysis
Ma, Shuangge; Song, Xiao; Huang, Jian
2007-01-01
Background A tremendous amount of efforts have been devoted to identifying genes for diagnosis and prognosis of diseases using microarray gene expression data. It has been demonstrated that gene expression data have cluster structure, where the clusters consist of co-regulated genes which tend to have coordinated functions. However, most available statistical methods for gene selection do not take into consideration the cluster structure. Results We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we first divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method. The supervised group Lasso consists of two steps. In the first step, we identify important genes within each cluster using the Lasso method. In the second step, we select important clusters using the group Lasso. Tuning parameters are determined using V-fold cross validation at both steps to allow for further flexibility. Prediction performance is evaluated using leave-one-out cross validation. We apply the proposed method to disease classification and survival analysis with microarray data. Conclusion We analyze four microarray data sets using the proposed approach: two cancer data sets with binary cancer occurrence as outcomes and two lymphoma data sets with survival outcomes. The results show that the proposed approach is capable of identifying a small number of influential gene clusters and important genes within those clusters, and has better prediction performance than existing methods. PMID:17316436
Atlas-Guided Cluster Analysis of Large Tractography Datasets
Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer
2013-01-01
Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment. PMID:24386292
Liao, Fuyuan; Jan, Yih-Kuen
2012-06-01
This paper presents a recurrence network approach for the analysis of skin blood flow dynamics in response to loading pressure. Recurrence is a fundamental property of many dynamical systems, which can be explored in phase spaces constructed from observational time series. A visualization tool of recurrence analysis called recurrence plot (RP) has been proved to be highly effective to detect transitions in the dynamics of the system. However, it was found that delay embedding can produce spurious structures in RPs. Network-based concepts have been applied for the analysis of nonlinear time series recently. We demonstrate that time series with different types of dynamics exhibit distinct global clustering coefficients and distributions of local clustering coefficients and that the global clustering coefficient is robust to the embedding parameters. We applied the approach to study skin blood flow oscillations (BFO) response to loading pressure. The results showed that global clustering coefficients of BFO significantly decreased in response to loading pressure (p<0.01). Moreover, surrogate tests indicated that such a decrease was associated with a loss of nonlinearity of BFO. Our results suggest that the recurrence network approach can practically quantify the nonlinear dynamics of BFO.
The application of cluster analysis in the intercomparison of loop structures in RNA.
Huang, Hung-Chung; Nagaswamy, Uma; Fox, George E
2005-04-01
We have developed a computational approach for the comparison and classification of RNA loop structures. Hairpin or interior loops identified in atomic resolution RNA structures were intercompared by conformational matching. The root-mean-square deviation (RMSD) values between all pairs of RNA fragments of interest, even if from different molecules, are calculated. Subsequently, cluster analysis is performed on the resulting matrix of RMSD distances using the unweighted pair group method with arithmetic mean (UPGMA). The cluster analysis objectively reveals groups of folds that resemble one another. To demonstrate the utility of the approach, a comprehensive analysis of all the terminal hairpin tetraloops that have been observed in 15 RNA structures that have been determined by X-ray crystallography was undertaken. The method found major clusters corresponding to the well-known GNRA and UNCG types. In addition, two tetraloops with the unusual primary sequence UMAC (M is A or C) were successfully assigned to the GNRA cluster. Larger loop structures were also examined and the clustering results confirmed the occurrence of variations of the GNRA and UNCG tetraloops in these loops and provided a systematic means for locating them. Nineteen examples of larger loops that closely resemble either the GNRA or UNCG tetraloop were found in the large ribosomal RNAs. When the clustering approach was extended to include all structures in the SCOR database, novel relationships were detected including one between the ANYA motif and a less common folding of the GAAA tetraloop sequence.
The application of cluster analysis in the intercomparison of loop structures in RNA
HUANG, HUNG-CHUNG; NAGASWAMY, UMA; FOX, GEORGE E.
2005-01-01
We have developed a computational approach for the comparison and classification of RNA loop structures. Hairpin or interior loops identified in atomic resolution RNA structures were intercompared by conformational matching. The root-mean-square deviation (RMSD) values between all pairs of RNA fragments of interest, even if from different molecules, are calculated. Subsequently, cluster analysis is performed on the resulting matrix of RMSD distances using the unweighted pair group method with arithmetic mean (UPGMA). The cluster analysis objectively reveals groups of folds that resemble one another. To demonstrate the utility of the approach, a comprehensive analysis of all the terminal hairpin tetraloops that have been observed in 15 RNA structures that have been determined by X-ray crystallography was undertaken. The method found major clusters corresponding to the well-known GNRA and UNCG types. In addition, two tetraloops with the unusual primary sequence UMAC (M is A or C) were successfully assigned to the GNRA cluster. Larger loop structures were also examined and the clustering results confirmed the occurrence of variations of the GNRA and UNCG tetraloops in these loops and provided a systematic means for locating them. Nineteen examples of larger loops that closely resemble either the GNRA or UNCG tetraloop were found in the large ribosomal RNAs. When the clustering approach was extended to include all structures in the SCOR database, novel relationships were detected including one between the ANYA motif and a less common folding of the GAAA tetraloop sequence. PMID:15769871
A Multicriteria Decision Making Approach for Estimating the Number of Clusters in a Data Set
Peng, Yi; Zhang, Yong; Kou, Gang; Shi, Yong
2012-01-01
Determining the number of clusters in a data set is an essential yet difficult step in cluster analysis. Since this task involves more than one criterion, it can be modeled as a multiple criteria decision making (MCDM) problem. This paper proposes a multiple criteria decision making (MCDM)-based approach to estimate the number of clusters for a given data set. In this approach, MCDM methods consider different numbers of clusters as alternatives and the outputs of any clustering algorithm on validity measures as criteria. The proposed method is examined by an experimental study using three MCDM methods, the well-known clustering algorithm–k-means, ten relative measures, and fifteen public-domain UCI machine learning data sets. The results show that MCDM methods work fairly well in estimating the number of clusters in the data and outperform the ten relative measures considered in the study. PMID:22870181
Ankle plantarflexion strength in rearfoot and forefoot runners: a novel clusteranalytic approach.
Liebl, Dominik; Willwacher, Steffen; Hamill, Joseph; Brüggemann, Gert-Peter
2014-06-01
The purpose of the present study was to test for differences in ankle plantarflexion strengths of habitually rearfoot and forefoot runners. In order to approach this issue, we revisit the problem of classifying different footfall patterns in human runners. A dataset of 119 subjects running shod and barefoot (speed 3.5m/s) was analyzed. The footfall patterns were clustered by a novel statistical approach, which is motivated by advances in the statistical literature on functional data analysis. We explain the novel statistical approach in detail and compare it to the classically used strike index of Cavanagh and Lafortune (1980). The two groups found by the new cluster approach are well interpretable as a forefoot and a rearfoot footfall groups. The subsequent comparison study of the clustered subjects reveals that runners with a forefoot footfall pattern are capable of producing significantly higher joint moments in a maximum voluntary contraction (MVC) of their ankle plantarflexor muscles tendon units; difference in means: 0.28Nm/kg. This effect remains significant after controlling for an additional gender effect and for differences in training levels. Our analysis confirms the hypothesis that forefoot runners have a higher mean MVC plantarflexion strength than rearfoot runners. Furthermore, we demonstrate that our proposed stochastic cluster analysis provides a robust and useful framework for clustering foot strikes. Copyright © 2014 Elsevier B.V. All rights reserved.
A genetic graph-based approach for partitional clustering.
Menéndez, Héctor D; Barrero, David F; Camacho, David
2014-05-01
Clustering is one of the most versatile tools for data analysis. In the recent years, clustering that seeks the continuity of data (in opposition to classical centroid-based approaches) has attracted an increasing research interest. It is a challenging problem with a remarkable practical interest. The most popular continuity clustering method is the spectral clustering (SC) algorithm, which is based on graph cut: It initially generates a similarity graph using a distance measure and then studies its graph spectrum to find the best cut. This approach is sensitive to the parameters of the metric, and a correct parameter choice is critical to the quality of the cluster. This work proposes a new algorithm, inspired by SC, that reduces the parameter dependency while maintaining the quality of the solution. The new algorithm, named genetic graph-based clustering (GGC), takes an evolutionary approach introducing a genetic algorithm (GA) to cluster the similarity graph. The experimental validation shows that GGC increases robustness of SC and has competitive performance in comparison with classical clustering methods, at least, in the synthetic and real dataset used in the experiments.
A Constraint-Based Approach to Acquisition of Word-Final Consonant Clusters in Turkish Children
ERIC Educational Resources Information Center
Gokgoz-Kurt, Burcu
2017-01-01
The current study provides a constraint-based analysis of L1 word-final consonant cluster acquisition in Turkish child language, based on the data originally presented by Topbas and Kopkalli-Yavuz (2008). The present analysis was done using [?]+obstruent consonant cluster acquisition. A comparison of Gradual Learning Algorithm (GLA) under…
Combining Multiobjective Optimization and Cluster Analysis to Study Vocal Fold Functional Morphology
Palaparthi, Anil; Riede, Tobias
2017-01-01
Morphological design and the relationship between form and function have great influence on the functionality of a biological organ. However, the simultaneous investigation of morphological diversity and function is difficult in complex natural systems. We have developed a multiobjective optimization (MOO) approach in association with cluster analysis to study the form-function relation in vocal folds. An evolutionary algorithm (NSGA-II) was used to integrate MOO with an existing finite element model of the laryngeal sound source. Vocal fold morphology parameters served as decision variables and acoustic requirements (fundamental frequency, sound pressure level) as objective functions. A two-layer and a three-layer vocal fold configuration were explored to produce the targeted acoustic requirements. The mutation and crossover parameters of the NSGA-II algorithm were chosen to maximize a hypervolume indicator. The results were expressed using cluster analysis and were validated against a brute force method. Results from the MOO and the brute force approaches were comparable. The MOO approach demonstrated greater resolution in the exploration of the morphological space. In association with cluster analysis, MOO can efficiently explore vocal fold functional morphology. PMID:24771563
Tian, Ting; McLachlan, Geoffrey J.; Dieters, Mark J.; Basford, Kaye E.
2015-01-01
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances. PMID:26689369
Tian, Ting; McLachlan, Geoffrey J; Dieters, Mark J; Basford, Kaye E
2015-01-01
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances.
ICAP - An Interactive Cluster Analysis Procedure for analyzing remotely sensed data
NASA Technical Reports Server (NTRS)
Wharton, S. W.; Turner, B. J.
1981-01-01
An Interactive Cluster Analysis Procedure (ICAP) was developed to derive classifier training statistics from remotely sensed data. ICAP differs from conventional clustering algorithms by allowing the analyst to optimize the cluster configuration by inspection, rather than by manipulating process parameters. Control of the clustering process alternates between the algorithm, which creates new centroids and forms clusters, and the analyst, who can evaluate and elect to modify the cluster structure. Clusters can be deleted, or lumped together pairwise, or new centroids can be added. A summary of the cluster statistics can be requested to facilitate cluster manipulation. The principal advantage of this approach is that it allows prior information (when available) to be used directly in the analysis, since the analyst interacts with ICAP in a straightforward manner, using basic terms with which he is more likely to be familiar. Results from testing ICAP showed that an informed use of ICAP can improve classification, as compared to an existing cluster analysis procedure.
Functional clustering of time series gene expression data by Granger causality
2012-01-01
Background A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes. Results In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence. Conclusions This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them. PMID:23107425
Longo, Dario Livio; Dastrù, Walter; Consolino, Lorena; Espak, Miklos; Arigoni, Maddalena; Cavallo, Federica; Aime, Silvio
2015-07-01
The objective of this study was to compare a clustering approach to conventional analysis methods for assessing changes in pharmacokinetic parameters obtained from dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) during antiangiogenic treatment in a breast cancer model. BALB/c mice bearing established transplantable her2+ tumors were treated with a DNA-based antiangiogenic vaccine or with an empty plasmid (untreated group). DCE-MRI was carried out by administering a dose of 0.05 mmol/kg of Gadocoletic acid trisodium salt, a Gd-based blood pool contrast agent (CA) at 1T. Changes in pharmacokinetic estimates (K(trans) and vp) in a nine-day interval were compared between treated and untreated groups on a voxel-by-voxel analysis. The tumor response to therapy was assessed by a clustering approach and compared with conventional summary statistics, with sub-regions analysis and with histogram analysis. Both the K(trans) and vp estimates, following blood-pool CA injection, showed marked and spatial heterogeneous changes with antiangiogenic treatment. Averaged values for the whole tumor region, as well as from the rim/core sub-regions analysis were unable to assess the antiangiogenic response. Histogram analysis resulted in significant changes only in the vp estimates (p<0.05). The proposed clustering approach depicted marked changes in both the K(trans) and vp estimates, with significant spatial heterogeneity in vp maps in response to treatment (p<0.05), provided that DCE-MRI data are properly clustered in three or four sub-regions. This study demonstrated the value of cluster analysis applied to pharmacokinetic DCE-MRI parametric maps for assessing tumor response to antiangiogenic therapy. Copyright © 2015 Elsevier Inc. All rights reserved.
Henry, David; Dymnicki, Allison B.; Mohatt, Nathaniel; Allen, James; Kelly, James G.
2016-01-01
Qualitative methods potentially add depth to prevention research, but can produce large amounts of complex data even with small samples. Studies conducted with culturally distinct samples often produce voluminous qualitative data, but may lack sufficient sample sizes for sophisticated quantitative analysis. Currently lacking in mixed methods research are methods allowing for more fully integrating qualitative and quantitative analysis techniques. Cluster analysis can be applied to coded qualitative data to clarify the findings of prevention studies by aiding efforts to reveal such things as the motives of participants for their actions and the reasons behind counterintuitive findings. By clustering groups of participants with similar profiles of codes in a quantitative analysis, cluster analysis can serve as a key component in mixed methods research. This article reports two studies. In the first study, we conduct simulations to test the accuracy of cluster assignment using three different clustering methods with binary data as produced when coding qualitative interviews. Results indicated that hierarchical clustering, K-Means clustering, and latent class analysis produced similar levels of accuracy with binary data, and that the accuracy of these methods did not decrease with samples as small as 50. Whereas the first study explores the feasibility of using common clustering methods with binary data, the second study provides a “real-world” example using data from a qualitative study of community leadership connected with a drug abuse prevention project. We discuss the implications of this approach for conducting prevention research, especially with small samples and culturally distinct communities. PMID:25946969
Henry, David; Dymnicki, Allison B; Mohatt, Nathaniel; Allen, James; Kelly, James G
2015-10-01
Qualitative methods potentially add depth to prevention research but can produce large amounts of complex data even with small samples. Studies conducted with culturally distinct samples often produce voluminous qualitative data but may lack sufficient sample sizes for sophisticated quantitative analysis. Currently lacking in mixed-methods research are methods allowing for more fully integrating qualitative and quantitative analysis techniques. Cluster analysis can be applied to coded qualitative data to clarify the findings of prevention studies by aiding efforts to reveal such things as the motives of participants for their actions and the reasons behind counterintuitive findings. By clustering groups of participants with similar profiles of codes in a quantitative analysis, cluster analysis can serve as a key component in mixed-methods research. This article reports two studies. In the first study, we conduct simulations to test the accuracy of cluster assignment using three different clustering methods with binary data as produced when coding qualitative interviews. Results indicated that hierarchical clustering, K-means clustering, and latent class analysis produced similar levels of accuracy with binary data and that the accuracy of these methods did not decrease with samples as small as 50. Whereas the first study explores the feasibility of using common clustering methods with binary data, the second study provides a "real-world" example using data from a qualitative study of community leadership connected with a drug abuse prevention project. We discuss the implications of this approach for conducting prevention research, especially with small samples and culturally distinct communities.
A Bayesian cluster analysis method for single-molecule localization microscopy data.
Griffié, Juliette; Shannon, Michael; Bromley, Claire L; Boelen, Lies; Burn, Garth L; Williamson, David J; Heard, Nicholas A; Cope, Andrew P; Owen, Dylan M; Rubin-Delanchy, Patrick
2016-12-01
Cell function is regulated by the spatiotemporal organization of the signaling machinery, and a key facet of this is molecular clustering. Here, we present a protocol for the analysis of clustering in data generated by 2D single-molecule localization microscopy (SMLM)-for example, photoactivated localization microscopy (PALM) or stochastic optical reconstruction microscopy (STORM). Three features of such data can cause standard cluster analysis approaches to be ineffective: (i) the data take the form of a list of points rather than a pixel array; (ii) there is a non-negligible unclustered background density of points that must be accounted for; and (iii) each localization has an associated uncertainty in regard to its position. These issues are overcome using a Bayesian, model-based approach. Many possible cluster configurations are proposed and scored against a generative model, which assumes Gaussian clusters overlaid on a completely spatially random (CSR) background, before every point is scrambled by its localization precision. We present the process of generating simulated and experimental data that are suitable to our algorithm, the analysis itself, and the extraction and interpretation of key cluster descriptors such as the number of clusters, cluster radii and the number of localizations per cluster. Variations in these descriptors can be interpreted as arising from changes in the organization of the cellular nanoarchitecture. The protocol requires no specific programming ability, and the processing time for one data set, typically containing 30 regions of interest, is ∼18 h; user input takes ∼1 h.
ERIC Educational Resources Information Center
Ho, Hsuan-Fu; Hung, Chia-Chi
2008-01-01
Purpose: The purpose of this paper is to examine how a graduate institute at National Chiayi University (NCYU), by using a model that integrates analytic hierarchy process, cluster analysis and correspondence analysis, can develop effective marketing strategies. Design/methodology/approach: This is primarily a quantitative study aimed at…
Peeking Network States with Clustered Patterns
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Jinoh; Sim, Alex
2015-10-20
Network traffic monitoring has long been a core element for effec- tive network management and security. However, it is still a chal- lenging task with a high degree of complexity for comprehensive analysis when considering multiple variables and ever-increasing traffic volumes to monitor. For example, one of the widely con- sidered approaches is to scrutinize probabilistic distributions, but it poses a scalability concern and multivariate analysis is not gen- erally supported due to the exponential increase of the complexity. In this work, we propose a novel method for network traffic moni- toring based on clustering, one of the powerful deep-learningmore » tech- niques. We show that the new approach enables us to recognize clustered results as patterns representing the network states, which can then be utilized to evaluate “similarity” of network states over time. In addition, we define a new quantitative measure for the similarity between two compared network states observed in dif- ferent time windows, as a supportive means for intuitive analysis. Finally, we demonstrate the clustering-based network monitoring with public traffic traces, and show that the proposed approach us- ing the clustering method has a great opportunity for feasible, cost- effective network monitoring.« less
ERIC Educational Resources Information Center
Xu, Beijie; Recker, Mimi; Qi, Xiaojun; Flann, Nicholas; Ye, Lei
2013-01-01
This article examines clustering as an educational data mining method. In particular, two clustering algorithms, the widely used K-means and the model-based Latent Class Analysis, are compared, using usage data from an educational digital library service, the Instructional Architect (IA.usu.edu). Using a multi-faceted approach and multiple data…
SOMFlow: Guided Exploratory Cluster Analysis with Self-Organizing Maps and Analytic Provenance.
Sacha, Dominik; Kraus, Matthias; Bernard, Jurgen; Behrisch, Michael; Schreck, Tobias; Asano, Yuki; Keim, Daniel A
2018-01-01
Clustering is a core building block for data analysis, aiming to extract otherwise hidden structures and relations from raw datasets, such as particular groups that can be effectively related, compared, and interpreted. A plethora of visual-interactive cluster analysis techniques has been proposed to date, however, arriving at useful clusterings often requires several rounds of user interactions to fine-tune the data preprocessing and algorithms. We present a multi-stage Visual Analytics (VA) approach for iterative cluster refinement together with an implementation (SOMFlow) that uses Self-Organizing Maps (SOM) to analyze time series data. It supports exploration by offering the analyst a visual platform to analyze intermediate results, adapt the underlying computations, iteratively partition the data, and to reflect previous analytical activities. The history of previous decisions is explicitly visualized within a flow graph, allowing to compare earlier cluster refinements and to explore relations. We further leverage quality and interestingness measures to guide the analyst in the discovery of useful patterns, relations, and data partitions. We conducted two pair analytics experiments together with a subject matter expert in speech intonation research to demonstrate that the approach is effective for interactive data analysis, supporting enhanced understanding of clustering results as well as the interactive process itself.
Principal Component Clustering Approach to Teaching Quality Discriminant Analysis
ERIC Educational Resources Information Center
Xian, Sidong; Xia, Haibo; Yin, Yubo; Zhai, Zhansheng; Shang, Yan
2016-01-01
Teaching quality is the lifeline of the higher education. Many universities have made some effective achievement about evaluating the teaching quality. In this paper, we establish the Students' evaluation of teaching (SET) discriminant analysis model and algorithm based on principal component clustering analysis. Additionally, we classify the SET…
Chaos theory perspective for industry clusters development
NASA Astrophysics Data System (ADS)
Yu, Haiying; Jiang, Minghui; Li, Chengzhang
2016-03-01
Industry clusters have outperformed in economic development in most developing countries. The contributions of industrial clusters have been recognized as promotion of regional business and the alleviation of economic and social costs. It is no doubt globalization is rendering clusters in accelerating the competitiveness of economic activities. In accordance, many ideas and concepts involve in illustrating evolution tendency, stimulating the clusters development, meanwhile, avoiding industrial clusters recession. The term chaos theory is introduced to explain inherent relationship of features within industry clusters. A preferred life cycle approach is proposed for industrial cluster recessive theory analysis. Lyapunov exponents and Wolf model are presented for chaotic identification and examination. A case study of Tianjin, China has verified the model effectiveness. The investigations indicate that the approaches outperform in explaining chaos properties in industrial clusters, which demonstrates industrial clusters evolution, solves empirical issues and generates corresponding strategies.
ClusterViz: A Cytoscape APP for Cluster Analysis of Biological Network.
Wang, Jianxin; Zhong, Jiancheng; Chen, Gang; Li, Min; Wu, Fang-xiang; Pan, Yi
2015-01-01
Cluster analysis of biological networks is one of the most important approaches for identifying functional modules and predicting protein functions. Furthermore, visualization of clustering results is crucial to uncover the structure of biological networks. In this paper, ClusterViz, an APP of Cytoscape 3 for cluster analysis and visualization, has been developed. In order to reduce complexity and enable extendibility for ClusterViz, we designed the architecture of ClusterViz based on the framework of Open Services Gateway Initiative. According to the architecture, the implementation of ClusterViz is partitioned into three modules including interface of ClusterViz, clustering algorithms and visualization and export. ClusterViz fascinates the comparison of the results of different algorithms to do further related analysis. Three commonly used clustering algorithms, FAG-EC, EAGLE and MCODE, are included in the current version. Due to adopting the abstract interface of algorithms in module of the clustering algorithms, more clustering algorithms can be included for the future use. To illustrate usability of ClusterViz, we provided three examples with detailed steps from the important scientific articles, which show that our tool has helped several research teams do their research work on the mechanism of the biological networks.
Benefits of off-campus education for students in the health sciences: a text-mining analysis.
Nakagawa, Kazumasa; Asakawa, Yasuyoshi; Yamada, Keiko; Ushikubo, Mitsuko; Yoshida, Tohru; Yamaguchi, Haruyasu
2012-08-28
In Japan, few community-based approaches have been adopted in health-care professional education, and the appropriate content for such approaches has not been clarified. In establishing community-based education for health-care professionals, clarification of its learning effects is required. A community-based educational program was started in 2009 in the health sciences course at Gunma University, and one of the main elements in this program is conducting classes outside school. The purpose of this study was to investigate using text-analysis methods how the off-campus program affects students. In all, 116 self-assessment worksheets submitted by students after participating in the off-campus classes were decomposed into words. The extracted words were carefully selected from the perspective of contained meaning or content. With the selected terms, the relations to each word were analyzed by means of cluster analysis. Cluster analysis was used to select and divide 32 extracted words into four clusters: cluster 1-"actually/direct," "learn/watch/hear," "how," "experience/participation," "local residents," "atmosphere in community-based clinical care settings," "favorable," "communication/conversation," and "study"; cluster 2-"work of staff member" and "role"; cluster 3-"interaction/communication," "understanding," "feel," "significant/important/necessity," and "think"; and cluster 4-"community," "confusing," "enjoyable," "proactive," "knowledge," "academic knowledge," and "class." The students who participated in the program achieved different types of learning through the off-campus classes. They also had a positive impression of the community-based experience and interaction with the local residents, which is considered a favorable outcome. Off-campus programs could be a useful educational approach for students in health sciences.
Patterns of Dysmorphic Features in Schizophrenia
Scutt, L.E.; Chow, E.W.C.; Weksberg, R.; Honer, W.G.; Bassett, Anne S.
2011-01-01
Congenital dysmorphic features are prevalent in schizophrenia and may reflect underlying neurodevelopmental abnormalities. A cluster analysis approach delineating patterns of dysmorphic features has been used in genetics to classify individuals into more etiologically homogeneous subgroups. In the present study, this approach was applied to schizophrenia, using a sample with a suspected genetic syndrome as a testable model. Subjects (n = 159) with schizophrenia or schizoaffective disorder were ascertained from chronic patient populations (random, n=123) or referred with possible 22q11 deletion syndrome (referred, n = 36). All subjects were evaluated for presence or absence of 70 reliably assessed dysmorphic features, which were used in a three-step cluster analysis. The analysis produced four major clusters with different patterns of dysmorphic features. Significant between-cluster differences were found for rates of 37 dysmorphic features (P < 0.05), median number of dysmorphic features (P = 0.0001), and validating features not used in the cluster analysis: mild mental retardation (P = 0.001) and congenital heart defects (P = 0.002). Two clusters (1 and 4) appeared to represent more developmental subgroups of schizophrenia with elevated rates of dysmorphic features and validating features. Cluster 1 (n = 27) comprised mostly referred subjects. Cluster 4 (n= 18) had a different pattern of dysmorphic features; one subject had a mosaic Turner syndrome variant. Two other clusters had lower rates and patterns of features consistent with those found in previous studies of schizophrenia. Delineating patterns of dysmorphic features may help identify subgroups that could represent neurodevelopmental forms of schizophrenia with more homogeneous origins. PMID:11803519
Multifractal Approach to Time Clustering of Earthquakes. Application to Mt. Vesuvio Seismicity
NASA Astrophysics Data System (ADS)
Codano, C.; Alonzo, M. L.; Vilardo, G.
The clustering structure of the Vesuvian earthquakes occurring is investigated by means of statistical tools: the inter-event time distribution, the running mean and the multifractal analysis. The first cannot clearly distinguish between a Poissonian process and a clustered one due to the difficulties of clearly distinguishing between an exponential distribution and a power law one. The running mean test reveals the clustering of the earthquakes, but looses information about the structure of the distribution at global scales. The multifractal approach can enlighten the clustering at small scales, while the global behaviour remains Poissonian. Subsequently the clustering of the events is interpreted in terms of diffusive processes of the stress in the earth crust.
A Taxonomic Approach to the Gestalt Theory of Perls
ERIC Educational Resources Information Center
Raming, Henry E.; Frey, David H.
1974-01-01
This study applied content analysis and cluster analysis to the ideas of Fritz Perls to develop a taxonomy of Gestalt processes and goals. Summaries of the typal groups or clusters were written and the implications of taxonomic research in counseling discussed. (Author)
The Gap Procedure: for the identification of phylogenetic clusters in HIV-1 sequence data.
Vrbik, Irene; Stephens, David A; Roger, Michel; Brenner, Bluma G
2015-11-04
In the context of infectious disease, sequence clustering can be used to provide important insights into the dynamics of transmission. Cluster analysis is usually performed using a phylogenetic approach whereby clusters are assigned on the basis of sufficiently small genetic distances and high bootstrap support (or posterior probabilities). The computational burden involved in this phylogenetic threshold approach is a major drawback, especially when a large number of sequences are being considered. In addition, this method requires a skilled user to specify the appropriate threshold values which may vary widely depending on the application. This paper presents the Gap Procedure, a distance-based clustering algorithm for the classification of DNA sequences sampled from individuals infected with the human immunodeficiency virus type 1 (HIV-1). Our heuristic algorithm bypasses the need for phylogenetic reconstruction, thereby supporting the quick analysis of large genetic data sets. Moreover, this fully automated procedure relies on data-driven gaps in sorted pairwise distances to infer clusters, thus no user-specified threshold values are required. The clustering results obtained by the Gap Procedure on both real and simulated data, closely agree with those found using the threshold approach, while only requiring a fraction of the time to complete the analysis. Apart from the dramatic gains in computational time, the Gap Procedure is highly effective in finding distinct groups of genetically similar sequences and obviates the need for subjective user-specified values. The clusters of genetically similar sequences returned by this procedure can be used to detect patterns in HIV-1 transmission and thereby aid in the prevention, treatment and containment of the disease.
NASA Astrophysics Data System (ADS)
Crawford, I.; Ruske, S.; Topping, D. O.; Gallagher, M. W.
2015-07-01
In this paper we present improved methods for discriminating and quantifying Primary Biological Aerosol Particles (PBAP) by applying hierarchical agglomerative cluster analysis to multi-parameter ultra violet-light induced fluorescence (UV-LIF) spectrometer data. The methods employed in this study can be applied to data sets in excess of 1×106 points on a desktop computer, allowing for each fluorescent particle in a dataset to be explicitly clustered. This reduces the potential for misattribution found in subsampling and comparative attribution methods used in previous approaches, improving our capacity to discriminate and quantify PBAP meta-classes. We evaluate the performance of several hierarchical agglomerative cluster analysis linkages and data normalisation methods using laboratory samples of known particle types and an ambient dataset. Fluorescent and non-fluorescent polystyrene latex spheres were sampled with a Wideband Integrated Bioaerosol Spectrometer (WIBS-4) where the optical size, asymmetry factor and fluorescent measurements were used as inputs to the analysis package. It was found that the Ward linkage with z-score or range normalisation performed best, correctly attributing 98 and 98.1 % of the data points respectively. The best performing methods were applied to the BEACHON-RoMBAS ambient dataset where it was found that the z-score and range normalisation methods yield similar results with each method producing clusters representative of fungal spores and bacterial aerosol, consistent with previous results. The z-score result was compared to clusters generated with previous approaches (WIBS AnalysiS Program, WASP) where we observe that the subsampling and comparative attribution method employed by WASP results in the overestimation of the fungal spore concentration by a factor of 1.5 and the underestimation of bacterial aerosol concentration by a factor of 5. We suggest that this likely due to errors arising from misatrribution due to poor centroid definition and failure to assign particles to a cluster as a result of the subsampling and comparative attribution method employed by WASP. The methods used here allow for the entire fluorescent population of particles to be analysed yielding an explict cluster attribution for each particle, improving cluster centroid definition and our capacity to discriminate and quantify PBAP meta-classes compared to previous approaches.
Friederichs, Stijn Ah; Bolman, Catherine; Oenema, Anke; Lechner, Lilian
2015-01-01
In order to promote physical activity uptake and maintenance in individuals who do not comply with physical activity guidelines, it is important to increase our understanding of physical activity motivation among this group. The present study aimed to examine motivational profiles in a large sample of adults who do not comply with physical activity guidelines. The sample for this study consisted of 2473 individuals (31.4% male; age 44.6 ± 12.9). In order to generate motivational profiles based on motivational regulation, a cluster analysis was conducted. One-way analyses of variance were then used to compare the clusters in terms of demographics, physical activity level, motivation to be active and subjective experience while being active. Three motivational clusters were derived based on motivational regulation scores: a low motivation cluster, a controlled motivation cluster and an autonomous motivation cluster. These clusters differed significantly from each other with respect to physical activity behavior, motivation to be active and subjective experience while being active. Overall, the autonomous motivation cluster displayed more favorable characteristics compared to the other two clusters. The results of this study provide additional support for the importance of autonomous motivation in the context of physical activity behavior. The three derived clusters may be relevant in the context of physical activity interventions as individuals within the different clusters might benefit most from different intervention approaches. In addition, this study shows that cluster analysis is a useful method for differentiating between motivational profiles in large groups of individuals who do not comply with physical activity guidelines.
Ju, Chunhua; Xu, Chonghuan
2013-01-01
Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods.
Ju, Chunhua
2013-01-01
Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods. PMID:24381525
Tashobya, Christine K; Dubourg, Dominique; Ssengooba, Freddie; Speybroeck, Niko; Macq, Jean; Criel, Bart
2016-03-01
In 2003, the Uganda Ministry of Health introduced the district league table for district health system performance assessment. The league table presents district performance against a number of input, process and output indicators and a composite index to rank districts. This study explores the use of hierarchical cluster analysis for analysing and presenting district health systems performance data and compares this approach with the use of the league table in Uganda. Ministry of Health and district plans and reports, and published documents were used to provide information on the development and utilization of the Uganda district league table. Quantitative data were accessed from the Ministry of Health databases. Statistical analysis using SPSS version 20 and hierarchical cluster analysis, utilizing Wards' method was used. The hierarchical cluster analysis was conducted on the basis of seven clusters determined for each year from 2003 to 2010, ranging from a cluster of good through moderate-to-poor performers. The characteristics and membership of clusters varied from year to year and were determined by the identity and magnitude of performance of the individual variables. Criticisms of the league table include: perceived unfairness, as it did not take into consideration district peculiarities; and being oversummarized and not adequately informative. Clustering organizes the many data points into clusters of similar entities according to an agreed set of indicators and can provide the beginning point for identifying factors behind the observed performance of districts. Although league table ranking emphasize summation and external control, clustering has the potential to encourage a formative, learning approach. More research is required to shed more light on factors behind observed performance of the different clusters. Other countries especially low-income countries that share many similarities with Uganda can learn from these experiences. © The Author 2015. Published by Oxford University Press in association with The London School of Hygiene and Tropical Medicine.
Tashobya, Christine K; Dubourg, Dominique; Ssengooba, Freddie; Speybroeck, Niko; Macq, Jean; Criel, Bart
2016-01-01
In 2003, the Uganda Ministry of Health introduced the district league table for district health system performance assessment. The league table presents district performance against a number of input, process and output indicators and a composite index to rank districts. This study explores the use of hierarchical cluster analysis for analysing and presenting district health systems performance data and compares this approach with the use of the league table in Uganda. Ministry of Health and district plans and reports, and published documents were used to provide information on the development and utilization of the Uganda district league table. Quantitative data were accessed from the Ministry of Health databases. Statistical analysis using SPSS version 20 and hierarchical cluster analysis, utilizing Wards’ method was used. The hierarchical cluster analysis was conducted on the basis of seven clusters determined for each year from 2003 to 2010, ranging from a cluster of good through moderate-to-poor performers. The characteristics and membership of clusters varied from year to year and were determined by the identity and magnitude of performance of the individual variables. Criticisms of the league table include: perceived unfairness, as it did not take into consideration district peculiarities; and being oversummarized and not adequately informative. Clustering organizes the many data points into clusters of similar entities according to an agreed set of indicators and can provide the beginning point for identifying factors behind the observed performance of districts. Although league table ranking emphasize summation and external control, clustering has the potential to encourage a formative, learning approach. More research is required to shed more light on factors behind observed performance of the different clusters. Other countries especially low-income countries that share many similarities with Uganda can learn from these experiences. PMID:26024882
A Constrained-Clustering Approach to the Analysis of Remote Sensing Data.
1983-01-01
One old and two new clustering methods were applied to the constrained-clustering problem of separating different agricultural fields based on multispectral remote sensing satellite data. (Constrained-clustering involves double clustering in multispectral measurement similarity and geographical location.) The results of applying the three methods are provided along with a discussion of their relative strengths and weaknesses and a detailed description of their algorithms.
Ye, Weimin; Robbins, R. T.
2004-01-01
Hierarchical cluster analysis based on female morphometric character means including body length, distance from vulva opening to anterior end, head width, odontostyle length, esophagus length, body width, tail length, and tail width were used to examine the morphometric relationships and create dendrograms for (i) 62 populations belonging to 9 Longidorus species from Arkansas, (ii) 137 published Longidorus species, and (iii) 137 published Longidorus species plus 86 populations of 16 Longidorus species from Arkansas and various other locations by using JMP 4.02 software (SAS Institute, Cary, NC). Cluster analysis dendograms visually illustrated the grouping and morphometric relationships of the species and populations. It provided a computerized statistical approach to assist by helping to identify and distinguish species, by indicating morphometric relationships among species, and by assisting with new species diagnosis. The preliminary species identification can be accomplished by running cluster analysis for unknown species together with the data matrix of known published Longidorus species. PMID:19262809
NASA Astrophysics Data System (ADS)
Ali, A.; de Bie, C. A. J. M.; Scarrott, R. G.; Ha, N. T. T.; Skidmore, A. K.
2012-07-01
Both agricultural area expansion and intensification are necessary to cope with the growing demand for food, and the growing threat of food insecurity which is rapidly engulfing poor and under-privileged sections of the global population. Therefore, it is of paramount importance to have the ability to accurately estimate crop area and spatial distribution. Remote sensing has become a valuable tool for estimating and mapping cropland areas, useful in food security monitoring. This work contributes to addressing this broad issue, focusing on the comparative performance analysis of two mapping approaches (i) a hyper-temporal Normalized Difference Vegetation Index (NDVI) analysis approach and (ii) a Landscape-ecological approach. The hyper-temporal NDVI analysis approach utilized SPOT 10-day NDVI imagery from April 1998-December 2008, whilst the Landscape-ecological approach used multitemporal Landsat-7 ETM+ imagery acquired intermittently between 1992 and 2002. Pixels in the time-series NDVI dataset were clustered using an ISODATA clustering algorithm adapted to determine the optimal number of pixel clusters to successfully generalize hyper-temporal datasets. Clusters were then characterized with crop cycle information, and flooding information to produce an NDVI unit map of rice classes with flood regime and NDVI profile information. A Landscape-ecological map was generated using a combination of digitized homogenous map units in the Landsat-7 ETM+ imagery, a Land use map 2005 of the Mekong delta, and supplementary datasets on the regions terrain, geo-morphology and flooding depths. The output maps were validated using reported crop statistics, and regression analyses were used to ascertain the relationship between land use area estimated from maps, and those reported in district crop statistics. The regression analysis showed that the hyper-temporal NDVI analysis approach explained 74% and 76% of the variability in reported crop statistics in two rice crop and three rice crop land use systems respectively. In contrast, 64% and 63% of the variability was explained respectively by the Landscape-ecological map. Overall, the results indicate the hyper-temporal NDVI analysis approach is more accurate and more useful in exploring when, why and how agricultural land use manifests itself in space and time. Furthermore, the NDVI analysis approach was found to be easier to implement, was more cost effective, and involved less subjective user intervention than the landscape-ecological approach.
Fiero, Mallorie H; Hsu, Chiu-Hsieh; Bell, Melanie L
2017-11-20
We extend the pattern-mixture approach to handle missing continuous outcome data in longitudinal cluster randomized trials, which randomize groups of individuals to treatment arms, rather than the individuals themselves. Individuals who drop out at the same time point are grouped into the same dropout pattern. We approach extrapolation of the pattern-mixture model by applying multilevel multiple imputation, which imputes missing values while appropriately accounting for the hierarchical data structure found in cluster randomized trials. To assess parameters of interest under various missing data assumptions, imputed values are multiplied by a sensitivity parameter, k, which increases or decreases imputed values. Using simulated data, we show that estimates of parameters of interest can vary widely under differing missing data assumptions. We conduct a sensitivity analysis using real data from a cluster randomized trial by increasing k until the treatment effect inference changes. By performing a sensitivity analysis for missing data, researchers can assess whether certain missing data assumptions are reasonable for their cluster randomized trial. Copyright © 2017 John Wiley & Sons, Ltd.
Chen, Jin; Roth, Robert E; Naito, Adam T; Lengerich, Eugene J; MacEachren, Alan M
2008-01-01
Background Kulldorff's spatial scan statistic and its software implementation – SaTScan – are widely used for detecting and evaluating geographic clusters. However, two issues make using the method and interpreting its results non-trivial: (1) the method lacks cartographic support for understanding the clusters in geographic context and (2) results from the method are sensitive to parameter choices related to cluster scaling (abbreviated as scaling parameters), but the system provides no direct support for making these choices. We employ both established and novel geovisual analytics methods to address these issues and to enhance the interpretation of SaTScan results. We demonstrate our geovisual analytics approach in a case study analysis of cervical cancer mortality in the U.S. Results We address the first issue by providing an interactive visual interface to support the interpretation of SaTScan results. Our research to address the second issue prompted a broader discussion about the sensitivity of SaTScan results to parameter choices. Sensitivity has two components: (1) the method can identify clusters that, while being statistically significant, have heterogeneous contents comprised of both high-risk and low-risk locations and (2) the method can identify clusters that are unstable in location and size as the spatial scan scaling parameter is varied. To investigate cluster result stability, we conducted multiple SaTScan runs with systematically selected parameters. The results, when scanning a large spatial dataset (e.g., U.S. data aggregated by county), demonstrate that no single spatial scan scaling value is known to be optimal to identify clusters that exist at different scales; instead, multiple scans that vary the parameters are necessary. We introduce a novel method of measuring and visualizing reliability that facilitates identification of homogeneous clusters that are stable across analysis scales. Finally, we propose a logical approach to proceed through the analysis of SaTScan results. Conclusion The geovisual analytics approach described in this manuscript facilitates the interpretation of spatial cluster detection methods by providing cartographic representation of SaTScan results and by providing visualization methods and tools that support selection of SaTScan parameters. Our methods distinguish between heterogeneous and homogeneous clusters and assess the stability of clusters across analytic scales. Method We analyzed the cervical cancer mortality data for the United States aggregated by county between 2000 and 2004. We ran SaTScan on the dataset fifty times with different parameter choices. Our geovisual analytics approach couples SaTScan with our visual analytic platform, allowing users to interactively explore and compare SaTScan results produced by different parameter choices. The Standardized Mortality Ratio and reliability scores are visualized for all the counties to identify stable, homogeneous clusters. We evaluated our analysis result by comparing it to that produced by other independent techniques including the Empirical Bayes Smoothing and Kafadar spatial smoother methods. The geovisual analytics approach introduced here is developed and implemented in our Java-based Visual Inquiry Toolkit. PMID:18992163
Chen, Jin; Roth, Robert E; Naito, Adam T; Lengerich, Eugene J; Maceachren, Alan M
2008-11-07
Kulldorff's spatial scan statistic and its software implementation - SaTScan - are widely used for detecting and evaluating geographic clusters. However, two issues make using the method and interpreting its results non-trivial: (1) the method lacks cartographic support for understanding the clusters in geographic context and (2) results from the method are sensitive to parameter choices related to cluster scaling (abbreviated as scaling parameters), but the system provides no direct support for making these choices. We employ both established and novel geovisual analytics methods to address these issues and to enhance the interpretation of SaTScan results. We demonstrate our geovisual analytics approach in a case study analysis of cervical cancer mortality in the U.S. We address the first issue by providing an interactive visual interface to support the interpretation of SaTScan results. Our research to address the second issue prompted a broader discussion about the sensitivity of SaTScan results to parameter choices. Sensitivity has two components: (1) the method can identify clusters that, while being statistically significant, have heterogeneous contents comprised of both high-risk and low-risk locations and (2) the method can identify clusters that are unstable in location and size as the spatial scan scaling parameter is varied. To investigate cluster result stability, we conducted multiple SaTScan runs with systematically selected parameters. The results, when scanning a large spatial dataset (e.g., U.S. data aggregated by county), demonstrate that no single spatial scan scaling value is known to be optimal to identify clusters that exist at different scales; instead, multiple scans that vary the parameters are necessary. We introduce a novel method of measuring and visualizing reliability that facilitates identification of homogeneous clusters that are stable across analysis scales. Finally, we propose a logical approach to proceed through the analysis of SaTScan results. The geovisual analytics approach described in this manuscript facilitates the interpretation of spatial cluster detection methods by providing cartographic representation of SaTScan results and by providing visualization methods and tools that support selection of SaTScan parameters. Our methods distinguish between heterogeneous and homogeneous clusters and assess the stability of clusters across analytic scales. We analyzed the cervical cancer mortality data for the United States aggregated by county between 2000 and 2004. We ran SaTScan on the dataset fifty times with different parameter choices. Our geovisual analytics approach couples SaTScan with our visual analytic platform, allowing users to interactively explore and compare SaTScan results produced by different parameter choices. The Standardized Mortality Ratio and reliability scores are visualized for all the counties to identify stable, homogeneous clusters. We evaluated our analysis result by comparing it to that produced by other independent techniques including the Empirical Bayes Smoothing and Kafadar spatial smoother methods. The geovisual analytics approach introduced here is developed and implemented in our Java-based Visual Inquiry Toolkit.
Hebels, Dennie G A J; Rasche, Axel; Herwig, Ralf; van Westen, Gerard J P; Jennen, Danyel G J; Kleinjans, Jos C S
2016-01-01
When evaluating compound similarity, addressing multiple sources of information to reach conclusions about common pharmaceutical and/or toxicological mechanisms of action is a crucial strategy. In this chapter, we describe a systems biology approach that incorporates analyses of hepatotoxicant data for 33 compounds from three different sources: a chemical structure similarity analysis based on the 3D Tanimoto coefficient, a chemical structure-based protein target prediction analysis, and a cross-study/cross-platform meta-analysis of in vitro and in vivo human and rat transcriptomics data derived from public resources (i.e., the diXa data warehouse). Hierarchical clustering of the outcome scores of the separate analyses did not result in a satisfactory grouping of compounds considering their known toxic mechanism as described in literature. However, a combined analysis of multiple data types may hypothetically compensate for missing or unreliable information in any of the single data types. We therefore performed an integrated clustering analysis of all three data sets using the R-based tool iClusterPlus. This indeed improved the grouping results. The compound clusters that were formed by means of iClusterPlus represent groups that show similar gene expression while simultaneously integrating a similarity in structure and protein targets, which corresponds much better with the known mechanism of action of these toxicants. Using an integrative systems biology approach may thus overcome the limitations of the separate analyses when grouping liver toxicants sharing a similar mechanism of toxicity.
Snell, Deborah L; Surgenor, Lois J; Hay-Smith, E Jean C; Williman, Jonathan; Siegert, Richard J
2015-01-01
Outcomes after mild traumatic brain injury (MTBI) vary, with slow or incomplete recovery for a significant minority. This study examines whether groups of cases with shared psychological factors but with different injury outcomes could be identified using cluster analysis. This is a prospective observational study following 147 adults presenting to a hospital-based emergency department or concussion services in Christchurch, New Zealand. This study examined associations between baseline demographic, clinical, psychological variables (distress, injury beliefs and symptom burden) and outcome 6 months later. A two-step approach to cluster analysis was applied (Ward's method to identify clusters, K-means to refine results). Three meaningful clusters emerged (high-adapters, medium-adapters, low-adapters). Baseline cluster-group membership was significantly associated with outcomes over time. High-adapters appeared recovered by 6-weeks and medium-adapters revealed improvements by 6-months. The low-adapters continued to endorse many symptoms, negative recovery expectations and distress, being significantly at risk for poor outcome more than 6-months after injury (OR (good outcome) = 0.12; CI = 0.03-0.53; p < 0.01). Cluster analysis supported the notion that groups could be identified early post-injury based on psychological factors, with group membership associated with differing outcomes over time. Implications for clinical care providers regarding therapy targets and cases that may benefit from different intensities of intervention are discussed.
Clustering of Variables for Mixed Data
NASA Astrophysics Data System (ADS)
Saracco, J.; Chavent, M.
2016-05-01
This chapter presents clustering of variables which aim is to lump together strongly related variables. The proposed approach works on a mixed data set, i.e. on a data set which contains numerical variables and categorical variables. Two algorithms of clustering of variables are described: a hierarchical clustering and a k-means type clustering. A brief description of PCAmix method (that is a principal component analysis for mixed data) is provided, since the calculus of the synthetic variables summarizing the obtained clusters of variables is based on this multivariate method. Finally, the R packages ClustOfVar and PCAmixdata are illustrated on real mixed data. The PCAmix and ClustOfVar approaches are first used for dimension reduction (step 1) before applying in step 2 a standard clustering method to obtain groups of individuals.
Interactive visual exploration and analysis of origin-destination data
NASA Astrophysics Data System (ADS)
Ding, Linfang; Meng, Liqiu; Yang, Jian; Krisp, Jukka M.
2018-05-01
In this paper, we propose a visual analytics approach for the exploration of spatiotemporal interaction patterns of massive origin-destination data. Firstly, we visually query the movement database for data at certain time windows. Secondly, we conduct interactive clustering to allow the users to select input variables/features (e.g., origins, destinations, distance, and duration) and to adjust clustering parameters (e.g. distance threshold). The agglomerative hierarchical clustering method is applied for the multivariate clustering of the origin-destination data. Thirdly, we design a parallel coordinates plot for visualizing the precomputed clusters and for further exploration of interesting clusters. Finally, we propose a gradient line rendering technique to show the spatial and directional distribution of origin-destination clusters on a map view. We implement the visual analytics approach in a web-based interactive environment and apply it to real-world floating car data from Shanghai. The experiment results show the origin/destination hotspots and their spatial interaction patterns. They also demonstrate the effectiveness of our proposed approach.
Analysis of radiation-induced small Cu particle cluster formation in aqueous CuCl2
Jayanetti, Sumedha; Mayanovic, Robert A.; Anderson, Alan J.; Bassett, William A.; Chou, I.-Ming
2001-01-01
Radition-induced small Cu particle cluster formation in aqueous CuCl2 was analyzed. It was noticed that nearest neighbor distance increased with the increase in the time of irradiation. This showed that the clusters approached the lattice dimension of bulk copper. As the average cluster size approached its bulk dimensions, an increase in the nearest neighbor coordination number was found with the decrease in the surface to volume ratio. Radiolysis of water by incident x-ray beam led to the reduction of copper ions in the solution to themetallic state.
ERIC Educational Resources Information Center
Zettergren, Peter
2007-01-01
A modern clustering technique was applied to age-10 and age-13 sociometric data with the purpose of identifying longitudinally stable peer status clusters. The study included 445 girls from a Swedish longitudinal study. The identified temporally stable clusters of rejected, popular, and average girls were essentially larger than corresponding…
Hsu, Arthur L; Tang, Sen-Lin; Halgamuge, Saman K
2003-11-01
Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). JAVA software of dynamic SOM tree algorithm is available upon request for academic use. A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf
A Dimensionally Reduced Clustering Methodology for Heterogeneous Occupational Medicine Data Mining.
Saâdaoui, Foued; Bertrand, Pierre R; Boudet, Gil; Rouffiac, Karine; Dutheil, Frédéric; Chamoux, Alain
2015-10-01
Clustering is a set of techniques of the statistical learning aimed at finding structures of heterogeneous partitions grouping homogenous data called clusters. There are several fields in which clustering was successfully applied, such as medicine, biology, finance, economics, etc. In this paper, we introduce the notion of clustering in multifactorial data analysis problems. A case study is conducted for an occupational medicine problem with the purpose of analyzing patterns in a population of 813 individuals. To reduce the data set dimensionality, we base our approach on the Principal Component Analysis (PCA), which is the statistical tool most commonly used in factorial analysis. However, the problems in nature, especially in medicine, are often based on heterogeneous-type qualitative-quantitative measurements, whereas PCA only processes quantitative ones. Besides, qualitative data are originally unobservable quantitative responses that are usually binary-coded. Hence, we propose a new set of strategies allowing to simultaneously handle quantitative and qualitative data. The principle of this approach is to perform a projection of the qualitative variables on the subspaces spanned by quantitative ones. Subsequently, an optimal model is allocated to the resulting PCA-regressed subspaces.
A comparison of latent class, K-means, and K-median methods for clustering dichotomous data.
Brusco, Michael J; Shireman, Emilie; Steinley, Douglas
2017-09-01
The problem of partitioning a collection of objects based on their measurements on a set of dichotomous variables is a well-established problem in psychological research, with applications including clinical diagnosis, educational testing, cognitive categorization, and choice analysis. Latent class analysis and K-means clustering are popular methods for partitioning objects based on dichotomous measures in the psychological literature. The K-median clustering method has recently been touted as a potentially useful tool for psychological data and might be preferable to its close neighbor, K-means, when the variable measures are dichotomous. We conducted simulation-based comparisons of the latent class, K-means, and K-median approaches for partitioning dichotomous data. Although all 3 methods proved capable of recovering cluster structure, K-median clustering yielded the best average performance, followed closely by latent class analysis. We also report results for the 3 methods within the context of an application to transitive reasoning data, in which it was found that the 3 approaches can exhibit profound differences when applied to real data. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
A cross-species bi-clustering approach to identifying conserved co-regulated genes.
Sun, Jiangwen; Jiang, Zongliang; Tian, Xiuchun; Bi, Jinbo
2016-06-15
A growing number of studies have explored the process of pre-implantation embryonic development of multiple mammalian species. However, the conservation and variation among different species in their developmental programming are poorly defined due to the lack of effective computational methods for detecting co-regularized genes that are conserved across species. The most sophisticated method to date for identifying conserved co-regulated genes is a two-step approach. This approach first identifies gene clusters for each species by a cluster analysis of gene expression data, and subsequently computes the overlaps of clusters identified from different species to reveal common subgroups. This approach is ineffective to deal with the noise in the expression data introduced by the complicated procedures in quantifying gene expression. Furthermore, due to the sequential nature of the approach, the gene clusters identified in the first step may have little overlap among different species in the second step, thus difficult to detect conserved co-regulated genes. We propose a cross-species bi-clustering approach which first denoises the gene expression data of each species into a data matrix. The rows of the data matrices of different species represent the same set of genes that are characterized by their expression patterns over the developmental stages of each species as columns. A novel bi-clustering method is then developed to cluster genes into subgroups by a joint sparse rank-one factorization of all the data matrices. This method decomposes a data matrix into a product of a column vector and a row vector where the column vector is a consistent indicator across the matrices (species) to identify the same gene cluster and the row vector specifies for each species the developmental stages that the clustered genes co-regulate. Efficient optimization algorithm has been developed with convergence analysis. This approach was first validated on synthetic data and compared to the two-step method and several recent joint clustering methods. We then applied this approach to two real world datasets of gene expression during the pre-implantation embryonic development of the human and mouse. Co-regulated genes consistent between the human and mouse were identified, offering insights into conserved functions, as well as similarities and differences in genome activation timing between the human and mouse embryos. The R package containing the implementation of the proposed method in C ++ is available at: https://github.com/JavonSun/mvbc.git and also at the R platform https://www.r-project.org/ jinbo@engr.uconn.edu. © The Author 2016. Published by Oxford University Press.
Clustering analysis for muon tomography data elaboration in the Muon Portal project
NASA Astrophysics Data System (ADS)
Bandieramonte, M.; Antonuccio-Delogu, V.; Becciani, U.; Costa, A.; La Rocca, P.; Massimino, P.; Petta, C.; Pistagna, C.; Riggi, F.; Riggi, S.; Sciacca, E.; Vitello, F.
2015-05-01
Clustering analysis is one of multivariate data analysis techniques which allows to gather statistical data units into groups, in order to minimize the logical distance within each group and to maximize the one between different groups. In these proceedings, the authors present a novel approach to the muontomography data analysis based on clustering algorithms. As a case study we present the Muon Portal project that aims to build and operate a dedicated particle detector for the inspection of harbor containers to hinder the smuggling of nuclear materials. Clustering techniques, working directly on scattering points, help to detect the presence of suspicious items inside the container, acting, as it will be shown, as a filter for a preliminary analysis of the data.
Circulation Clusters--An Empirical Approach to Decentralization of Academic Libraries.
ERIC Educational Resources Information Center
McGrath, William E.
1986-01-01
Discusses the issue of centralization or decentralization of academic library collections, and describes a statistical analysis of book circulation at the University of Southwestern Louisiana that yielded subject area clusters as a compromise solution to the problem. Applications of the cluster model for all types of library catalogs are…
Multilevel Analysis Methods for Partially Nested Cluster Randomized Trials
ERIC Educational Resources Information Center
Sanders, Elizabeth A.
2011-01-01
This paper explores multilevel modeling approaches for 2-group randomized experiments in which a treatment condition involving clusters of individuals is compared to a control condition involving only ungrouped individuals, otherwise known as partially nested cluster randomized designs (PNCRTs). Strategies for comparing groups from a PNCRT in the…
ADHD and Reading Disabilities: A Cluster Analytic Approach for Distinguishing Subgroups.
ERIC Educational Resources Information Center
Bonafina, Marcela A.; Newcorn, Jeffrey H.; McKay, Kathleen E.; Koda, Vivian H.; Halperin, Jeffrey M.
2000-01-01
Using cluster analysis, a study empirically divided 54 children with attention-deficit/hyperactivity disorder (ADHD) based on their Full Scale IQ and reading ability. Clusters had different patterns of cognitive, behavioral, and neurochemical functions, as determined by discrepancies in Verbal-Performance IQ, academic achievement, parent…
Semi-Supervised Clustering for High-Dimensional and Sparse Features
ERIC Educational Resources Information Center
Yan, Su
2010-01-01
Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side…
Exploratory Item Classification Via Spectral Graph Clustering
Chen, Yunxiao; Li, Xiaoou; Liu, Jingchen; Xu, Gongjun; Ying, Zhiliang
2017-01-01
Large-scale assessments are supported by a large item pool. An important task in test development is to assign items into scales that measure different characteristics of individuals, and a popular approach is cluster analysis of items. Classical methods in cluster analysis, such as the hierarchical clustering, K-means method, and latent-class analysis, often induce a high computational overhead and have difficulty handling missing data, especially in the presence of high-dimensional responses. In this article, the authors propose a spectral clustering algorithm for exploratory item cluster analysis. The method is computationally efficient, effective for data with missing or incomplete responses, easy to implement, and often outperforms traditional clustering algorithms in the context of high dimensionality. The spectral clustering algorithm is based on graph theory, a branch of mathematics that studies the properties of graphs. The algorithm first constructs a graph of items, characterizing the similarity structure among items. It then extracts item clusters based on the graphical structure, grouping similar items together. The proposed method is evaluated through simulations and an application to the revised Eysenck Personality Questionnaire. PMID:29033476
An improved K-means clustering algorithm in agricultural image segmentation
NASA Astrophysics Data System (ADS)
Cheng, Huifeng; Peng, Hui; Liu, Shanmei
Image segmentation is the first important step to image analysis and image processing. In this paper, according to color crops image characteristics, we firstly transform the color space of image from RGB to HIS, and then select proper initial clustering center and cluster number in application of mean-variance approach and rough set theory followed by clustering calculation in such a way as to automatically segment color component rapidly and extract target objects from background accurately, which provides a reliable basis for identification, analysis, follow-up calculation and process of crops images. Experimental results demonstrate that improved k-means clustering algorithm is able to reduce the computation amounts and enhance precision and accuracy of clustering.
A time-series approach for clustering farms based on slaughterhouse health aberration data.
Hulsegge, B; de Greef, K H
2018-05-01
A large amount of data is collected routinely in meat inspection in pig slaughterhouses. A time series clustering approach is presented and applied that groups farms based on similar statistical characteristics of meat inspection data over time. A three step characteristic-based clustering approach was used from the idea that the data contain more info than the incidence figures. A stratified subset containing 511,645 pigs was derived as a study set from 3.5 years of meat inspection data. The monthly averages of incidence of pleuritis and of pneumonia of 44 Dutch farms (delivering 5149 batches to 2 pig slaughterhouses) were subjected to 1) derivation of farm level data characteristics 2) factor analysis and 3) clustering into groups of farms. The characteristic-based clustering was able to cluster farms for both lung aberrations. Three groups of data characteristics were informative, describing incidence, time pattern and degree of autocorrelation. The consistency of clustering similar farms was confirmed by repetition of the analysis in a larger dataset. The robustness of the clustering was tested on a substantially extended dataset. This confirmed the earlier results, three data distribution aspects make up the majority of distinction between groups of farms and in these groups (clusters) the majority of the farms was allocated comparable to the earlier allocation (75% and 62% for pleuritis and pneumonia, respectively). The difference between pleuritis and pneumonia in their seasonal dependency was confirmed, supporting the biological relevance of the clustering. Comparison of the identified clusters of statistically comparable farms can be used to detect farm level risk factors causing the health aberrations beyond comparison on disease incidence and trend alone. Copyright © 2018 Elsevier B.V. All rights reserved.
Potential of SNP markers for the characterization of Brazilian cassava germplasm.
de Oliveira, Eder Jorge; Ferreira, Cláudia Fortes; da Silva Santos, Vanderlei; de Jesus, Onildo Nunes; Oliveira, Gilmara Alvarenga Fachardo; da Silva, Maiane Suzarte
2014-06-01
High-throughput markers, such as SNPs, along with different methodologies were used to evaluate the applicability of the Bayesian approach and the multivariate analysis in structuring the genetic diversity in cassavas. The objective of the present work was to evaluate the diversity and genetic structure of the largest cassava germplasm bank in Brazil. Complementary methodological approaches such as discriminant analysis of principal components (DAPC), Bayesian analysis and molecular analysis of variance (AMOVA) were used to understand the structure and diversity of 1,280 accessions genotyped using 402 single nucleotide polymorphism markers. The genetic diversity (0.327) and the average observed heterozygosity (0.322) were high considering the bi-allelic markers. In terms of population, the presence of a complex genetic structure was observed indicating the formation of 30 clusters by DAPC and 34 clusters by Bayesian analysis. Both methodologies presented difficulties and controversies in terms of the allocation of some accessions to specific clusters. However, the clusters suggested by the DAPC analysis seemed to be more consistent for presenting higher probability of allocation of the accessions within the clusters. Prior information related to breeding patterns and geographic origins of the accessions were not sufficient for providing clear differentiation between the clusters according to the AMOVA analysis. In contrast, the F ST was maximized when considering the clusters suggested by the Bayesian and DAPC analyses. The high frequency of germplasm exchange between producers and the subsequent alteration of the name of the same material may be one of the causes of the low association between genetic diversity and geographic origin. The results of this study may benefit cassava germplasm conservation programs, and contribute to the maximization of genetic gains in breeding programs.
Garcia, Danilo; MacDonald, Shane; Archer, Trevor
2015-01-01
Background. The notion of the affective system as being composed of two dimensions led Archer and colleagues to the development of the affective profiles model. The model consists of four different profiles based on combinations of individuals' experience of high/low positive and negative affect: self-fulfilling, low affective, high affective, and self-destructive. During the past 10 years, an increasing number of studies have used this person-centered model as the backdrop for the investigation of between and within individual differences in ill-being and well-being. The most common approach to this profiling is by dividing individuals' scores of self-reported affect using the median of the population as reference for high/low splits. However, scores just-above and just-below the median might become high and low by arbitrariness, not by reality. Thus, it is plausible to criticize the validity of this variable-oriented approach. Our aim was to compare the median splits approach with a person-oriented approach, namely, cluster analysis. Method. The participants (N = 2, 225) were recruited through Amazons' Mechanical Turk and asked to self-report affect using the Positive Affect Negative Affect Schedule. We compared the profiles' homogeneity and Silhouette coefficients to discern differences in homogeneity and heterogeneity between approaches. We also conducted exact cell-wise analyses matching the profiles from both approaches and matching profiles and gender to investigate profiling agreement with respect to affectivity levels and affectivity and gender. All analyses were conducted using the ROPstat software. Results. The cluster approach (weighted average of cluster homogeneity coefficients = 0.62, Silhouette coefficients = 0.68) generated profiles with greater homogeneity and more distinctive from each other compared to the median splits approach (weighted average of cluster homogeneity coefficients = 0.75, Silhouette coefficients = 0.59). Most of the participants (n = 1,736, 78.0%) were allocated to the same profile (Rand Index = .83), however, 489 (21.98%) were allocated to different profiles depending on the approach. Both approaches allocated females and males similarly in three of the four profiles. Only the cluster analysis approach classified men significantly more often than chance to a self-fulfilling profile (type) and females less often than chance to this very same profile (antitype). Conclusions. Although the question whether one approach is more appropriate than the other is still without answer, the cluster method allocated individuals to profiles that are more in accordance with the conceptual basis of the model and also to expected gender differences. More importantly, regardless of the approach, our findings suggest that the model mirrors a complex and dynamic adaptive system.
Multiple imputation methods for bivariate outcomes in cluster randomised trials.
DiazOrdaz, K; Kenward, M G; Gomes, M; Grieve, R
2016-09-10
Missing observations are common in cluster randomised trials. The problem is exacerbated when modelling bivariate outcomes jointly, as the proportion of complete cases is often considerably smaller than the proportion having either of the outcomes fully observed. Approaches taken to handling such missing data include the following: complete case analysis, single-level multiple imputation that ignores the clustering, multiple imputation with a fixed effect for each cluster and multilevel multiple imputation. We contrasted the alternative approaches to handling missing data in a cost-effectiveness analysis that uses data from a cluster randomised trial to evaluate an exercise intervention for care home residents. We then conducted a simulation study to assess the performance of these approaches on bivariate continuous outcomes, in terms of confidence interval coverage and empirical bias in the estimated treatment effects. Missing-at-random clustered data scenarios were simulated following a full-factorial design. Across all the missing data mechanisms considered, the multiple imputation methods provided estimators with negligible bias, while complete case analysis resulted in biased treatment effect estimates in scenarios where the randomised treatment arm was associated with missingness. Confidence interval coverage was generally in excess of nominal levels (up to 99.8%) following fixed-effects multiple imputation and too low following single-level multiple imputation. Multilevel multiple imputation led to coverage levels of approximately 95% throughout. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
Clustering gene expression regulators: new approach to disease subtyping.
Pyatnitskiy, Mikhail; Mazo, Ilya; Shkrob, Maria; Schwartz, Elena; Kotelnikova, Ekaterina
2014-01-01
One of the main challenges in modern medicine is to stratify different patient groups in terms of underlying disease molecular mechanisms as to develop more personalized approach to therapy. Here we propose novel method for disease subtyping based on analysis of activated expression regulators on a sample-by-sample basis. Our approach relies on Sub-Network Enrichment Analysis algorithm (SNEA) which identifies gene subnetworks with significant concordant changes in expression between two conditions. Subnetwork consists of central regulator and downstream genes connected by relations extracted from global literature-extracted regulation database. Regulators found in each patient separately are clustered together and assigned activity scores which are used for final patients grouping. We show that our approach performs well compared to other related methods and at the same time provides researchers with complementary level of understanding of pathway-level biology behind a disease by identification of significant expression regulators. We have observed the reasonable grouping of neuromuscular disorders (triggered by structural damage vs triggered by unknown mechanisms), that was not revealed using standard expression profile clustering. For another experiment we were able to suggest the clusters of regulators, responsible for colorectal carcinoma vs adenoma discrimination and identify frequently genetically changed regulators that could be of specific importance for the individual characteristics of cancer development. Proposed approach can be regarded as biologically meaningful feature selection, reducing tens of thousands of genes down to dozens of clusters of regulators. Obtained clusters of regulators make possible to generate valuable biological hypotheses about molecular mechanisms related to a clinical outcome for individual patient.
Clustering Gene Expression Regulators: New Approach to Disease Subtyping
Pyatnitskiy, Mikhail; Mazo, Ilya; Shkrob, Maria; Schwartz, Elena; Kotelnikova, Ekaterina
2014-01-01
One of the main challenges in modern medicine is to stratify different patient groups in terms of underlying disease molecular mechanisms as to develop more personalized approach to therapy. Here we propose novel method for disease subtyping based on analysis of activated expression regulators on a sample-by-sample basis. Our approach relies on Sub-Network Enrichment Analysis algorithm (SNEA) which identifies gene subnetworks with significant concordant changes in expression between two conditions. Subnetwork consists of central regulator and downstream genes connected by relations extracted from global literature-extracted regulation database. Regulators found in each patient separately are clustered together and assigned activity scores which are used for final patients grouping. We show that our approach performs well compared to other related methods and at the same time provides researchers with complementary level of understanding of pathway-level biology behind a disease by identification of significant expression regulators. We have observed the reasonable grouping of neuromuscular disorders (triggered by structural damage vs triggered by unknown mechanisms), that was not revealed using standard expression profile clustering. For another experiment we were able to suggest the clusters of regulators, responsible for colorectal carcinoma vs adenoma discrimination and identify frequently genetically changed regulators that could be of specific importance for the individual characteristics of cancer development. Proposed approach can be regarded as biologically meaningful feature selection, reducing tens of thousands of genes down to dozens of clusters of regulators. Obtained clusters of regulators make possible to generate valuable biological hypotheses about molecular mechanisms related to a clinical outcome for individual patient. PMID:24416320
A Study of Pupil Control Ideology: A Person-Oriented Approach to Data Analysis
ERIC Educational Resources Information Center
Adwere-Boamah, Joseph
2010-01-01
Responses of urban school teachers to the Pupil Control Ideology questionnaire were studied using Latent Class Analysis. The results of the analysis suggest that the best fitting model to the data is a two-cluster solution. In particular, the pupil control ideology of the sample delineates into two clusters of teachers, those with humanistic and…
Spatiotemporal Analysis of the Ebola Hemorrhagic Fever in West Africa in 2014
NASA Astrophysics Data System (ADS)
Xu, M.; Cao, C. X.; Guo, H. F.
2017-09-01
Ebola hemorrhagic fever (EHF) is an acute hemorrhagic diseases caused by the Ebola virus, which is highly contagious. This paper aimed to explore the possible gathering area of EHF cases in West Africa in 2014, and identify endemic areas and their tendency by means of time-space analysis. We mapped distribution of EHF incidences and explored statistically significant space, time and space-time disease clusters. We utilized hotspot analysis to find the spatial clustering pattern on the basis of the actual outbreak cases. spatial-temporal cluster analysis is used to analyze the spatial or temporal distribution of agglomeration disease, examine whether its distribution is statistically significant. Local clusters were investigated using Kulldorff's scan statistic approach. The result reveals that the epidemic mainly gathered in the western part of Africa near north Atlantic with obvious regional distribution. For the current epidemic, we have found areas in high incidence of EVD by means of spatial cluster analysis.
Clustering performance comparison using K-means and expectation maximization algorithms.
Jung, Yong Gyu; Kang, Min Soo; Heo, Jun
2014-11-14
Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K -means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K -means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.
Complex networks as a unified framework for descriptive analysis and predictive modeling in climate
DOE Office of Scientific and Technical Information (OSTI.GOV)
Steinhaeuser, Karsten J K; Chawla, Nitesh; Ganguly, Auroop R
The analysis of climate data has relied heavily on hypothesis-driven statistical methods, while projections of future climate are based primarily on physics-based computational models. However, in recent years a wealth of new datasets has become available. Therefore, we take a more data-centric approach and propose a unified framework for studying climate, with an aim towards characterizing observed phenomena as well as discovering new knowledge in the climate domain. Specifically, we posit that complex networks are well-suited for both descriptive analysis and predictive modeling tasks. We show that the structural properties of climate networks have useful interpretation within the domain. Further,more » we extract clusters from these networks and demonstrate their predictive power as climate indices. Our experimental results establish that the network clusters are statistically significantly better predictors than clusters derived using a more traditional clustering approach. Using complex networks as data representation thus enables the unique opportunity for descriptive and predictive modeling to inform each other.« less
Retrospective Benefit-Cost Evaluation of DOE Investment in Photovoltaic Energy Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
O'Connor, Alan C.; Loomis, Ross J.; Braun, Fern M.
2010-08-01
This study is a retrospective analysis of net benefits accruing from DOE's investment in photovoltaic (PV) technology development. The study employed a technology cluster approach. That is, benefits measured for a subset of technologies in a meaningful cluster, or portfolio, of technologies were compared to the total investment in the cluster to provide a lower bound measure of return for the entire cluster.
NASA Astrophysics Data System (ADS)
Crawford, I.; Ruske, S.; Topping, D. O.; Gallagher, M. W.
2015-11-01
In this paper we present improved methods for discriminating and quantifying primary biological aerosol particles (PBAPs) by applying hierarchical agglomerative cluster analysis to multi-parameter ultraviolet-light-induced fluorescence (UV-LIF) spectrometer data. The methods employed in this study can be applied to data sets in excess of 1 × 106 points on a desktop computer, allowing for each fluorescent particle in a data set to be explicitly clustered. This reduces the potential for misattribution found in subsampling and comparative attribution methods used in previous approaches, improving our capacity to discriminate and quantify PBAP meta-classes. We evaluate the performance of several hierarchical agglomerative cluster analysis linkages and data normalisation methods using laboratory samples of known particle types and an ambient data set. Fluorescent and non-fluorescent polystyrene latex spheres were sampled with a Wideband Integrated Bioaerosol Spectrometer (WIBS-4) where the optical size, asymmetry factor and fluorescent measurements were used as inputs to the analysis package. It was found that the Ward linkage with z-score or range normalisation performed best, correctly attributing 98 and 98.1 % of the data points respectively. The best-performing methods were applied to the BEACHON-RoMBAS (Bio-hydro-atmosphere interactions of Energy, Aerosols, Carbon, H2O, Organics and Nitrogen-Rocky Mountain Biogenic Aerosol Study) ambient data set, where it was found that the z-score and range normalisation methods yield similar results, with each method producing clusters representative of fungal spores and bacterial aerosol, consistent with previous results. The z-score result was compared to clusters generated with previous approaches (WIBS AnalysiS Program, WASP) where we observe that the subsampling and comparative attribution method employed by WASP results in the overestimation of the fungal spore concentration by a factor of 1.5 and the underestimation of bacterial aerosol concentration by a factor of 5. We suggest that this likely due to errors arising from misattribution due to poor centroid definition and failure to assign particles to a cluster as a result of the subsampling and comparative attribution method employed by WASP. The methods used here allow for the entire fluorescent population of particles to be analysed, yielding an explicit cluster attribution for each particle and improving cluster centroid definition and our capacity to discriminate and quantify PBAP meta-classes compared to previous approaches.
On selecting a prior for the precision parameter of Dirichlet process mixture models
Dorazio, R.M.
2009-01-01
In hierarchical mixture models the Dirichlet process is used to specify latent patterns of heterogeneity, particularly when the distribution of latent parameters is thought to be clustered (multimodal). The parameters of a Dirichlet process include a precision parameter ?? and a base probability measure G0. In problems where ?? is unknown and must be estimated, inferences about the level of clustering can be sensitive to the choice of prior assumed for ??. In this paper an approach is developed for computing a prior for the precision parameter ?? that can be used in the presence or absence of prior information about the level of clustering. This approach is illustrated in an analysis of counts of stream fishes. The results of this fully Bayesian analysis are compared with an empirical Bayes analysis of the same data and with a Bayesian analysis based on an alternative commonly used prior.
Dorfman, David M; LaPlante, Charlotte D; Pozdnyakova, Olga; Li, Betty
2015-11-01
In our high-sensitivity flow cytometric approach for systemic mastocytosis (SM), we identified mast cell event clustering as a new diagnostic criterion for the disease. To objectively characterize mast cell gated event distributions, we performed cluster analysis using FLOCK, a computational approach to identify cell subsets in multidimensional flow cytometry data in an unbiased, automated fashion. FLOCK identified discrete mast cell populations in most cases of SM (56/75 [75%]) but only a minority of non-SM cases (17/124 [14%]). FLOCK-identified mast cell populations accounted for 2.46% of total cells on average in SM cases and 0.09% of total cells on average in non-SM cases (P < .0001) and were predictive of SM, with a sensitivity of 75%, a specificity of 86%, a positive predictive value of 76%, and a negative predictive value of 85%. FLOCK analysis provides useful diagnostic information for evaluating patients with suspected SM, and may be useful for the analysis of other hematopoietic neoplasms. Copyright© by the American Society for Clinical Pathology.
Cluster analysis in phenotyping a Portuguese population.
Loureiro, C C; Sa-Couto, P; Todo-Bom, A; Bousquet, J
2015-09-03
Unbiased cluster analysis using clinical parameters has identified asthma phenotypes. Adding inflammatory biomarkers to this analysis provided a better insight into the disease mechanisms. This approach has not yet been applied to asthmatic Portuguese patients. To identify phenotypes of asthma using cluster analysis in a Portuguese asthmatic population treated in secondary medical care. Consecutive patients with asthma were recruited from the outpatient clinic. Patients were optimally treated according to GINA guidelines and enrolled in the study. Procedures were performed according to a standard evaluation of asthma. Phenotypes were identified by cluster analysis using Ward's clustering method. Of the 72 patients enrolled, 57 had full data and were included for cluster analysis. Distribution was set in 5 clusters described as follows: cluster (C) 1, early onset mild allergic asthma; C2, moderate allergic asthma, with long evolution, female prevalence and mixed inflammation; C3, allergic brittle asthma in young females with early disease onset and no evidence of inflammation; C4, severe asthma in obese females with late disease onset, highly symptomatic despite low Th2 inflammation; C5, severe asthma with chronic airflow obstruction, late disease onset and eosinophilic inflammation. In our study population, the identified clusters were mainly coincident with other larger-scale cluster analysis. Variables such as age at disease onset, obesity, lung function, FeNO (Th2 biomarker) and disease severity were important for cluster distinction. Copyright © 2015. Published by Elsevier España, S.L.U.
Multivariate Statistical Analysis of MSL APXS Bulk Geochemical Data
NASA Astrophysics Data System (ADS)
Hamilton, V. E.; Edwards, C. S.; Thompson, L. M.; Schmidt, M. E.
2014-12-01
We apply cluster and factor analyses to bulk chemical data of 130 soil and rock samples measured by the Alpha Particle X-ray Spectrometer (APXS) on the Mars Science Laboratory (MSL) rover Curiosity through sol 650. Multivariate approaches such as principal components analysis (PCA), cluster analysis, and factor analysis compliment more traditional approaches (e.g., Harker diagrams), with the advantage of simultaneously examining the relationships between multiple variables for large numbers of samples. Principal components analysis has been applied with success to APXS, Pancam, and Mössbauer data from the Mars Exploration Rovers. Factor analysis and cluster analysis have been applied with success to thermal infrared (TIR) spectral data of Mars. Cluster analyses group the input data by similarity, where there are a number of different methods for defining similarity (hierarchical, density, distribution, etc.). For example, without any assumptions about the chemical contributions of surface dust, preliminary hierarchical and K-means cluster analyses clearly distinguish the physically adjacent rock targets Windjana and Stephen as being distinctly different than lithologies observed prior to Curiosity's arrival at The Kimberley. In addition, they are separated from each other, consistent with chemical trends observed in variation diagrams but without requiring assumptions about chemical relationships. We will discuss the variation in cluster analysis results as a function of clustering method and pre-processing (e.g., log transformation, correction for dust cover) and implications for interpreting chemical data. Factor analysis shares some similarities with PCA, and examines the variability among observed components of a dataset so as to reveal variations attributable to unobserved components. Factor analysis has been used to extract the TIR spectra of components that are typically observed in mixtures and only rarely in isolation; there is the potential for similar results with data from APXS. These techniques offer new ways to understand the chemical relationships between the materials interrogated by Curiosity, and potentially their relation to materials observed by APXS instruments on other landed missions.
Chen, Lin-Yuan; Tang, Ping-Han; Wu, Ten-Ming
2016-07-14
In terms of the local bond-orientational order (LBOO) parameters, a cluster approach to analyze local structures of simple liquids was developed. In this approach, a cluster is defined as a combination of neighboring seeds having at least nb local-orientational bonds and their nearest neighbors, and a cluster ensemble is a collection of clusters with a specified nb and number of seeds ns. This cluster analysis was applied to investigate the microscopic structures of liquid Ga at ambient pressure (AP). The liquid structures studied were generated through ab initio molecular dynamics simulations. By scrutinizing the static structure factors (SSFs) of cluster ensembles with different combinations of nb and ns, we found that liquid Ga at AP contained two types of cluster structures, one characterized by sixfold orientational symmetry and the other showing fourfold orientational symmetry. The SSFs of cluster structures with sixfold orientational symmetry were akin to the SSF of a hard-sphere fluid. On the contrary, the SSFs of cluster structures showing fourfold orientational symmetry behaved similarly as the anomalous SSF of liquid Ga at AP, which is well known for exhibiting a high-q shoulder. The local structures of a highly LBOO cluster whose SSF displayed a high-q shoulder were found to be more similar to the structure of β-Ga than those of other solid phases of Ga. More generally, the cluster structures showing fourfold orientational symmetry have an inclination to resemble more to β-Ga.
Dinov, Martin; Leech, Robert
2017-01-01
Part of the process of EEG microstate estimation involves clustering EEG channel data at the global field power (GFP) maxima, very commonly using a modified K-means approach. Clustering has also been done deterministically, despite there being uncertainties in multiple stages of the microstate analysis, including the GFP peak definition, the clustering itself and in the post-clustering assignment of microstates back onto the EEG timecourse of interest. We perform a fully probabilistic microstate clustering and labeling, to account for these sources of uncertainty using the closest probabilistic analog to KM called Fuzzy C-means (FCM). We train softmax multi-layer perceptrons (MLPs) using the KM and FCM-inferred cluster assignments as target labels, to then allow for probabilistic labeling of the full EEG data instead of the usual correlation-based deterministic microstate label assignment typically used. We assess the merits of the probabilistic analysis vs. the deterministic approaches in EEG data recorded while participants perform real or imagined motor movements from a publicly available data set of 109 subjects. Though FCM group template maps that are almost topographically identical to KM were found, there is considerable uncertainty in the subsequent assignment of microstate labels. In general, imagined motor movements are less predictable on a time point-by-time point basis, possibly reflecting the more exploratory nature of the brain state during imagined, compared to during real motor movements. We find that some relationships may be more evident using FCM than using KM and propose that future microstate analysis should preferably be performed probabilistically rather than deterministically, especially in situations such as with brain computer interfaces, where both training and applying models of microstates need to account for uncertainty. Probabilistic neural network-driven microstate assignment has a number of advantages that we have discussed, which are likely to be further developed and exploited in future studies. In conclusion, probabilistic clustering and a probabilistic neural network-driven approach to microstate analysis is likely to better model and reveal details and the variability hidden in current deterministic and binarized microstate assignment and analyses.
Dinov, Martin; Leech, Robert
2017-01-01
Part of the process of EEG microstate estimation involves clustering EEG channel data at the global field power (GFP) maxima, very commonly using a modified K-means approach. Clustering has also been done deterministically, despite there being uncertainties in multiple stages of the microstate analysis, including the GFP peak definition, the clustering itself and in the post-clustering assignment of microstates back onto the EEG timecourse of interest. We perform a fully probabilistic microstate clustering and labeling, to account for these sources of uncertainty using the closest probabilistic analog to KM called Fuzzy C-means (FCM). We train softmax multi-layer perceptrons (MLPs) using the KM and FCM-inferred cluster assignments as target labels, to then allow for probabilistic labeling of the full EEG data instead of the usual correlation-based deterministic microstate label assignment typically used. We assess the merits of the probabilistic analysis vs. the deterministic approaches in EEG data recorded while participants perform real or imagined motor movements from a publicly available data set of 109 subjects. Though FCM group template maps that are almost topographically identical to KM were found, there is considerable uncertainty in the subsequent assignment of microstate labels. In general, imagined motor movements are less predictable on a time point-by-time point basis, possibly reflecting the more exploratory nature of the brain state during imagined, compared to during real motor movements. We find that some relationships may be more evident using FCM than using KM and propose that future microstate analysis should preferably be performed probabilistically rather than deterministically, especially in situations such as with brain computer interfaces, where both training and applying models of microstates need to account for uncertainty. Probabilistic neural network-driven microstate assignment has a number of advantages that we have discussed, which are likely to be further developed and exploited in future studies. In conclusion, probabilistic clustering and a probabilistic neural network-driven approach to microstate analysis is likely to better model and reveal details and the variability hidden in current deterministic and binarized microstate assignment and analyses. PMID:29163110
Clustering analysis of proteins from microbial genomes at multiple levels of resolution.
Zaslavsky, Leonid; Ciufo, Stacy; Fedorov, Boris; Tatusova, Tatiana
2016-08-31
Microbial genomes at the National Center for Biotechnology Information (NCBI) represent a large collection of more than 35,000 assemblies. There are several complexities associated with the data: a great variation in sampling density since human pathogens are densely sampled while other bacteria are less represented; different protein families occur in annotations with different frequencies; and the quality of genome annotation varies greatly. In order to extract useful information from these sophisticated data, the analysis needs to be performed at multiple levels of phylogenomic resolution and protein similarity, with an adequate sampling strategy. Protein clustering is used to construct meaningful and stable groups of similar proteins to be used for analysis and functional annotation. Our approach is to create protein clusters at three levels. First, tight clusters in groups of closely-related genomes (species-level clades) are constructed using a combined approach that takes into account both sequence similarity and genome context. Second, clustroids of conservative in-clade clusters are organized into seed global clusters. Finally, global protein clusters are built around the the seed clusters. We propose filtering strategies that allow limiting the protein set included in global clustering. The in-clade clustering procedure, subsequent selection of clustroids and organization into seed global clusters provides a robust representation and high rate of compression. Seed protein clusters are further extended by adding related proteins. Extended seed clusters include a significant part of the data and represent all major known cell machinery. The remaining part, coming from either non-conservative (unique) or rapidly evolving proteins, from rare genomes, or resulting from low-quality annotation, does not group together well. Processing these proteins requires significant computational resources and results in a large number of questionable clusters. The developed filtering strategies allow to identify and exclude such peripheral proteins limiting the protein dataset in global clustering. Overall, the proposed methodology allows the relevant data at different levels of details to be obtained and data redundancy eliminated while keeping biologically interesting variations.
NASA Astrophysics Data System (ADS)
Hozé, Nathanaël; Holcman, David
2012-01-01
We develop a coagulation-fragmentation model to study a system composed of a small number of stochastic objects moving in a confined domain, that can aggregate upon binding to form local clusters of arbitrary sizes. A cluster can also dissociate into two subclusters with a uniform probability. To study the statistics of clusters, we combine a Markov chain analysis with a partition number approach. Interestingly, we obtain explicit formulas for the size and the number of clusters in terms of hypergeometric functions. Finally, we apply our analysis to study the statistical physics of telomeres (ends of chromosomes) clustering in the yeast nucleus and show that the diffusion-coagulation-fragmentation process can predict the organization of telomeres.
Interactive visual exploration and refinement of cluster assignments.
Kern, Michael; Lex, Alexander; Gehlenborg, Nils; Johnson, Chris R
2017-09-12
With ever-increasing amounts of data produced in biology research, scientists are in need of efficient data analysis methods. Cluster analysis, combined with visualization of the results, is one such method that can be used to make sense of large data volumes. At the same time, cluster analysis is known to be imperfect and depends on the choice of algorithms, parameters, and distance measures. Most clustering algorithms don't properly account for ambiguity in the source data, as records are often assigned to discrete clusters, even if an assignment is unclear. While there are metrics and visualization techniques that allow analysts to compare clusterings or to judge cluster quality, there is no comprehensive method that allows analysts to evaluate, compare, and refine cluster assignments based on the source data, derived scores, and contextual data. In this paper, we introduce a method that explicitly visualizes the quality of cluster assignments, allows comparisons of clustering results and enables analysts to manually curate and refine cluster assignments. Our methods are applicable to matrix data clustered with partitional, hierarchical, and fuzzy clustering algorithms. Furthermore, we enable analysts to explore clustering results in context of other data, for example, to observe whether a clustering of genomic data results in a meaningful differentiation in phenotypes. Our methods are integrated into Caleydo StratomeX, a popular, web-based, disease subtype analysis tool. We show in a usage scenario that our approach can reveal ambiguities in cluster assignments and produce improved clusterings that better differentiate genotypes and phenotypes.
Generalized Analysis Tools for Multi-Spacecraft Missions
NASA Astrophysics Data System (ADS)
Chanteur, G. M.
2011-12-01
Analysis tools for multi-spacecraft missions like CLUSTER or MMS have been designed since the end of the 90's to estimate gradients of fields or to characterize discontinuities crossed by a cluster of spacecraft. Different approaches have been presented and discussed in the book "Analysis Methods for Multi-Spacecraft Data" published as Scientific Report 001 of the International Space Science Institute in Bern, Switzerland (G. Paschmann and P. Daly Eds., 1998). On one hand the approach using methods of least squares has the advantage to apply to any number of spacecraft [1] but is not convenient to perform analytical computation especially when considering the error analysis. On the other hand the barycentric approach is powerful as it provides simple analytical formulas involving the reciprocal vectors of the tetrahedron [2] but appears limited to clusters of four spacecraft. Moreover the barycentric approach allows to derive theoretical formulas for errors affecting the estimators built from the reciprocal vectors [2,3,4]. Following a first generalization of reciprocal vectors proposed by Vogt et al [4] and despite the present lack of projects with more than four spacecraft we present generalized reciprocal vectors for a cluster made of any number of spacecraft : each spacecraft is given a positive or nul weight. The non-coplanarity of at least four spacecraft with strictly positive weights is a necessary and sufficient condition for this analysis to be enabled. Weights given to spacecraft allow to minimize the influence of some spacecraft if its location or the quality of its data are not appropriate, or simply to extract subsets of spacecraft from the cluster. Estimators presented in [2] are generalized within this new frame except for the error analysis which is still under investigation. References [1] Harvey, C. C.: Spatial Gradients and the Volumetric Tensor, in: Analysis Methods for Multi-Spacecraft Data, G. Paschmann and P. Daly (eds.), pp. 307-322, ISSI SR-001, 1998. [2] Chanteur, G.: Spatial Interpolation for Four Spacecraft: Theory, in: Analysis Methods for Multi-Spacecraft Data, G. Paschmann and P. Daly (eds.), pp. 371-393, ISSI SR-001, 1998. [3] Chanteur, G.: Accuracy of field gradient estimations by Cluster: Explanation of its dependency upon elongation and planarity of the tetrahedron, pp. 265-268, ESA SP-449, 2000. [4] Vogt, J., Paschmann, G., and Chanteur, G.: Reciprocal Vectors, pp. 33-46, ISSI SR-008, 2008.
NASA Astrophysics Data System (ADS)
Farsadnia, F.; Rostami Kamrood, M.; Moghaddam Nia, A.; Modarres, R.; Bray, M. T.; Han, D.; Sadatinejad, J.
2014-02-01
One of the several methods in estimating flood quantiles in ungauged or data-scarce watersheds is regional frequency analysis. Amongst the approaches to regional frequency analysis, different clustering techniques have been proposed to determine hydrologically homogeneous regions in the literature. Recently, Self-Organization feature Map (SOM), a modern hydroinformatic tool, has been applied in several studies for clustering watersheds. However, further studies are still needed with SOM on the interpretation of SOM output map for identifying hydrologically homogeneous regions. In this study, two-level SOM and three clustering methods (fuzzy c-mean, K-mean, and Ward's Agglomerative hierarchical clustering) are applied in an effort to identify hydrologically homogeneous regions in Mazandaran province watersheds in the north of Iran, and their results are compared with each other. Firstly the SOM is used to form a two-dimensional feature map. Next, the output nodes of the SOM are clustered by using unified distance matrix algorithm and three clustering methods to form regions for flood frequency analysis. The heterogeneity test indicates the four regions achieved by the two-level SOM and Ward approach after adjustments are sufficiently homogeneous. The results suggest that the combination of SOM and Ward is much better than the combination of either SOM and FCM or SOM and K-mean.
NeCamp, Timothy; Kilbourne, Amy; Almirall, Daniel
2017-08-01
Cluster-level dynamic treatment regimens can be used to guide sequential treatment decision-making at the cluster level in order to improve outcomes at the individual or patient-level. In a cluster-level dynamic treatment regimen, the treatment is potentially adapted and re-adapted over time based on changes in the cluster that could be impacted by prior intervention, including aggregate measures of the individuals or patients that compose it. Cluster-randomized sequential multiple assignment randomized trials can be used to answer multiple open questions preventing scientists from developing high-quality cluster-level dynamic treatment regimens. In a cluster-randomized sequential multiple assignment randomized trial, sequential randomizations occur at the cluster level and outcomes are observed at the individual level. This manuscript makes two contributions to the design and analysis of cluster-randomized sequential multiple assignment randomized trials. First, a weighted least squares regression approach is proposed for comparing the mean of a patient-level outcome between the cluster-level dynamic treatment regimens embedded in a sequential multiple assignment randomized trial. The regression approach facilitates the use of baseline covariates which is often critical in the analysis of cluster-level trials. Second, sample size calculators are derived for two common cluster-randomized sequential multiple assignment randomized trial designs for use when the primary aim is a between-dynamic treatment regimen comparison of the mean of a continuous patient-level outcome. The methods are motivated by the Adaptive Implementation of Effective Programs Trial which is, to our knowledge, the first-ever cluster-randomized sequential multiple assignment randomized trial in psychiatry.
Borri, Marco; Schmidt, Maria A; Powell, Ceri; Koh, Dow-Mu; Riddell, Angela M; Partridge, Mike; Bhide, Shreerang A; Nutting, Christopher M; Harrington, Kevin J; Newbold, Katie L; Leach, Martin O
2015-01-01
To describe a methodology, based on cluster analysis, to partition multi-parametric functional imaging data into groups (or clusters) of similar functional characteristics, with the aim of characterizing functional heterogeneity within head and neck tumour volumes. To evaluate the performance of the proposed approach on a set of longitudinal MRI data, analysing the evolution of the obtained sub-sets with treatment. The cluster analysis workflow was applied to a combination of dynamic contrast-enhanced and diffusion-weighted imaging MRI data from a cohort of squamous cell carcinoma of the head and neck patients. Cumulative distributions of voxels, containing pre and post-treatment data and including both primary tumours and lymph nodes, were partitioned into k clusters (k = 2, 3 or 4). Principal component analysis and cluster validation were employed to investigate data composition and to independently determine the optimal number of clusters. The evolution of the resulting sub-regions with induction chemotherapy treatment was assessed relative to the number of clusters. The clustering algorithm was able to separate clusters which significantly reduced in voxel number following induction chemotherapy from clusters with a non-significant reduction. Partitioning with the optimal number of clusters (k = 4), determined with cluster validation, produced the best separation between reducing and non-reducing clusters. The proposed methodology was able to identify tumour sub-regions with distinct functional properties, independently separating clusters which were affected differently by treatment. This work demonstrates that unsupervised cluster analysis, with no prior knowledge of the data, can be employed to provide a multi-parametric characterization of functional heterogeneity within tumour volumes.
Pfeiffenberger, Erik; Chaleil, Raphael A.G.; Moal, Iain H.
2017-01-01
ABSTRACT Reliable identification of near‐native poses of docked protein–protein complexes is still an unsolved problem. The intrinsic heterogeneity of protein–protein interactions is challenging for traditional biophysical or knowledge based potentials and the identification of many false positive binding sites is not unusual. Often, ranking protocols are based on initial clustering of docked poses followed by the application of an energy function to rank each cluster according to its lowest energy member. Here, we present an approach of cluster ranking based not only on one molecular descriptor (e.g., an energy function) but also employing a large number of descriptors that are integrated in a machine learning model, whereby, an extremely randomized tree classifier based on 109 molecular descriptors is trained. The protocol is based on first locally enriching clusters with additional poses, the clusters are then characterized using features describing the distribution of molecular descriptors within the cluster, which are combined into a pairwise cluster comparison model to discriminate near‐native from incorrect clusters. The results show that our approach is able to identify clusters containing near‐native protein–protein complexes. In addition, we present an analysis of the descriptors with respect to their power to discriminate near native from incorrect clusters and how data transformations and recursive feature elimination can improve the ranking performance. Proteins 2017; 85:528–543. © 2016 Wiley Periodicals, Inc. PMID:27935158
Cluster analysis of polymers using laser-induced breakdown spectroscopy with K-means
NASA Astrophysics Data System (ADS)
Yangmin, GUO; Yun, TANG; Yu, DU; Shisong, TANG; Lianbo, GUO; Xiangyou, LI; Yongfeng, LU; Xiaoyan, ZENG
2018-06-01
Laser-induced breakdown spectroscopy (LIBS) combined with K-means algorithm was employed to automatically differentiate industrial polymers under atmospheric conditions. The unsupervised learning algorithm K-means were utilized for the clustering of LIBS dataset measured from twenty kinds of industrial polymers. To prevent the interference from metallic elements, three atomic emission lines (C I 247.86 nm , H I 656.3 nm, and O I 777.3 nm) and one molecular line C–N (0, 0) 388.3 nm were used. The cluster analysis results were obtained through an iterative process. The Davies–Bouldin index was employed to determine the initial number of clusters. The average relative standard deviation values of characteristic spectral lines were used as the iterative criterion. With the proposed approach, the classification accuracy for twenty kinds of industrial polymers achieved 99.6%. The results demonstrated that this approach has great potential for industrial polymers recycling by LIBS.
Tweets clustering using latent semantic analysis
NASA Astrophysics Data System (ADS)
Rasidi, Norsuhaili Mahamed; Bakar, Sakhinah Abu; Razak, Fatimah Abdul
2017-04-01
Social media are becoming overloaded with information due to the increasing number of information feeds. Unlike other social media, Twitter users are allowed to broadcast a short message called as `tweet". In this study, we extract tweets related to MH370 for certain of time. In this paper, we present overview of our approach for tweets clustering to analyze the users' responses toward tragedy of MH370. The tweets were clustered based on the frequency of terms obtained from the classification process. The method we used for the text classification is Latent Semantic Analysis. As a result, there are two types of tweets that response to MH370 tragedy which is emotional and non-emotional. We show some of our initial results to demonstrate the effectiveness of our approach.
Multiscale visual quality assessment for cluster analysis with self-organizing maps
NASA Astrophysics Data System (ADS)
Bernard, Jürgen; von Landesberger, Tatiana; Bremm, Sebastian; Schreck, Tobias
2011-01-01
Cluster analysis is an important data mining technique for analyzing large amounts of data, reducing many objects to a limited number of clusters. Cluster visualization techniques aim at supporting the user in better understanding the characteristics and relationships among the found clusters. While promising approaches to visual cluster analysis already exist, these usually fall short of incorporating the quality of the obtained clustering results. However, due to the nature of the clustering process, quality plays an important aspect, as for most practical data sets, typically many different clusterings are possible. Being aware of clustering quality is important to judge the expressiveness of a given cluster visualization, or to adjust the clustering process with refined parameters, among others. In this work, we present an encompassing suite of visual tools for quality assessment of an important visual cluster algorithm, namely, the Self-Organizing Map (SOM) technique. We define, measure, and visualize the notion of SOM cluster quality along a hierarchy of cluster abstractions. The quality abstractions range from simple scalar-valued quality scores up to the structural comparison of a given SOM clustering with output of additional supportive clustering methods. The suite of methods allows the user to assess the SOM quality on the appropriate abstraction level, and arrive at improved clustering results. We implement our tools in an integrated system, apply it on experimental data sets, and show its applicability.
ERIC Educational Resources Information Center
Kung, Hsiang-Te; And Others
1993-01-01
In spite of rapid progress achieved in the methodological research underlying environmental impact assessment (EIA), the problem of weighting various parameters has not yet been solved. This paper presents a new approach, fuzzy clustering analysis, which is illustrated with an EIA case study on Baoshan-Wusong District in Shanghai, China. (Author)
Cluster Approach to Network Interaction in Pedagogical University
ERIC Educational Resources Information Center
Chekaleva, Nadezhda V.; Makarova, Natalia S.; Drobotenko, Yulia B.
2016-01-01
The study presented in the article is devoted to the analysis of theory and practice of network interaction within the framework of education clusters. Education clusters are considered to be a novel form of network interaction in pedagogical education in Russia. The aim of the article is to show the advantages and disadvantages of the cluster…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sanfilippo, Antonio P.; Chikkagoudar, Satish
We describe an approach to analyzing trade data which uses clustering to detect similarities across shipping manifest records, classification to evaluate clustering results and categorize new unseen shipping data records, and visual analytics to provide to support situation awareness in dynamic decision making to monitor and warn against the movement of radiological threat materials through search, analysis and forecasting capabilities. The evaluation of clustering results through classification and systematic inspection of the clusters show the clusters have strong semantic cohesion and offer novel ways to detect transactions related to nuclear smuggling.
Chen, Ling; Feng, Yanqin; Sun, Jianguo
2017-10-01
This paper discusses regression analysis of clustered failure time data, which occur when the failure times of interest are collected from clusters. In particular, we consider the situation where the correlated failure times of interest may be related to cluster sizes. For inference, we present two estimation procedures, the weighted estimating equation-based method and the within-cluster resampling-based method, when the correlated failure times of interest arise from a class of additive transformation models. The former makes use of the inverse of cluster sizes as weights in the estimating equations, while the latter can be easily implemented by using the existing software packages for right-censored failure time data. An extensive simulation study is conducted and indicates that the proposed approaches work well in both the situations with and without informative cluster size. They are applied to a dental study that motivated this study.
NASA Astrophysics Data System (ADS)
Keshtkaran, Mohammad Reza; Yang, Zhi
2017-06-01
Objective. Spike sorting is a fundamental preprocessing step for many neuroscience studies which rely on the analysis of spike trains. Most of the feature extraction and dimensionality reduction techniques that have been used for spike sorting give a projection subspace which is not necessarily the most discriminative one. Therefore, the clusters which appear inherently separable in some discriminative subspace may overlap if projected using conventional feature extraction approaches leading to a poor sorting accuracy especially when the noise level is high. In this paper, we propose a noise-robust and unsupervised spike sorting algorithm based on learning discriminative spike features for clustering. Approach. The proposed algorithm uses discriminative subspace learning to extract low dimensional and most discriminative features from the spike waveforms and perform clustering with automatic detection of the number of the clusters. The core part of the algorithm involves iterative subspace selection using linear discriminant analysis and clustering using Gaussian mixture model with outlier detection. A statistical test in the discriminative subspace is proposed to automatically detect the number of the clusters. Main results. Comparative results on publicly available simulated and real in vivo datasets demonstrate that our algorithm achieves substantially improved cluster distinction leading to higher sorting accuracy and more reliable detection of clusters which are highly overlapping and not detectable using conventional feature extraction techniques such as principal component analysis or wavelets. Significance. By providing more accurate information about the activity of more number of individual neurons with high robustness to neural noise and outliers, the proposed unsupervised spike sorting algorithm facilitates more detailed and accurate analysis of single- and multi-unit activities in neuroscience and brain machine interface studies.
NASA Astrophysics Data System (ADS)
Hasan, Noor Haliza; Abdullah, M. T.
2008-01-01
The aim of the study is to use cluster analysis on morphometric parameters within the genus Kerivoula to produce a dendrogram and to determine the suitability of this method to describe the relationship among species within this genus. A total of 15 adult male individuals from genus Kerivoula taken from sampling trips around Borneo and specimens kept at the zoological museum of Universiti Malaysia Sarawak were examined. A total of 27 characters using dental, skull and external body measurements were recorded. Clustering analysis illustrated the grouping and morphometric relationships between the species of this genus. It has clearly separated each species from each other despite the overlapping of measurements of some species within the genus. Cluster analysis provides an alternative approach to make a preliminary identification of a species.
Ghazizadeh, Mahtab; McDonald, Anthony D; Lee, John D
2014-09-01
This study applies text mining to extract clusters of vehicle problems and associated trends from free-response data in the National Highway Traffic Safety Administration's vehicle owner's complaint database. As the automotive industry adopts new technologies, it is important to systematically assess the effect of these changes on traffic safety. Driving simulators, naturalistic driving data, and crash databases all contribute to a better understanding of how drivers respond to changing vehicle technology, but other approaches, such as automated analysis of incident reports, are needed. Free-response data from incidents representing two severity levels (fatal incidents and incidents involving injury) were analyzed using a text mining approach: latent semantic analysis (LSA). LSA and hierarchical clustering identified clusters of complaints for each severity level, which were compared and analyzed across time. Cluster analysis identified eight clusters of fatal incidents and six clusters of incidents involving injury. Comparisons showed that although the airbag clusters across the two severity levels have the same most frequent terms, the circumstances around the incidents differ. The time trends show clear increases in complaints surrounding the Ford/Firestone tire recall and the Toyota unintended acceleration recall. Increases in complaints may be partially driven by these recall announcements and the associated media attention. Text mining can reveal useful information from free-response databases that would otherwise be prohibitively time-consuming and difficult to summarize manually. Text mining can extend human analysis capabilities for large free-response databases to support earlier detection of problems and more timely safety interventions.
Barker, Daniel; D'Este, Catherine; Campbell, Michael J; McElduff, Patrick
2017-03-09
Stepped wedge cluster randomised trials frequently involve a relatively small number of clusters. The most common frameworks used to analyse data from these types of trials are generalised estimating equations and generalised linear mixed models. A topic of much research into these methods has been their application to cluster randomised trial data and, in particular, the number of clusters required to make reasonable inferences about the intervention effect. However, for stepped wedge trials, which have been claimed by many researchers to have a statistical power advantage over the parallel cluster randomised trial, the minimum number of clusters required has not been investigated. We conducted a simulation study where we considered the most commonly used methods suggested in the literature to analyse cross-sectional stepped wedge cluster randomised trial data. We compared the per cent bias, the type I error rate and power of these methods in a stepped wedge trial setting with a binary outcome, where there are few clusters available and when the appropriate adjustment for a time trend is made, which by design may be confounding the intervention effect. We found that the generalised linear mixed modelling approach is the most consistent when few clusters are available. We also found that none of the common analysis methods for stepped wedge trials were both unbiased and maintained a 5% type I error rate when there were only three clusters. Of the commonly used analysis approaches, we recommend the generalised linear mixed model for small stepped wedge trials with binary outcomes. We also suggest that in a stepped wedge design with three steps, at least two clusters be randomised at each step, to ensure that the intervention effect estimator maintains the nominal 5% significance level and is also reasonably unbiased.
Clustering approaches to identifying gene expression patterns from DNA microarray data.
Do, Jin Hwan; Choi, Dong-Kug
2008-04-30
The analysis of microarray data is essential for large amounts of gene expression data. In this review we focus on clustering techniques. The biological rationale for this approach is the fact that many co-expressed genes are co-regulated, and identifying co-expressed genes could aid in functional annotation of novel genes, de novo identification of transcription factor binding sites and elucidation of complex biological pathways. Co-expressed genes are usually identified in microarray experiments by clustering techniques. There are many such methods, and the results obtained even for the same datasets may vary considerably depending on the algorithms and metrics for dissimilarity measures used, as well as on user-selectable parameters such as desired number of clusters and initial values. Therefore, biologists who want to interpret microarray data should be aware of the weakness and strengths of the clustering methods used. In this review, we survey the basic principles of clustering of DNA microarray data from crisp clustering algorithms such as hierarchical clustering, K-means and self-organizing maps, to complex clustering algorithms like fuzzy clustering.
TWave: High-Order Analysis of Functional MRI
Barnathan, Michael; Megalooikonomou, Vasileios; Faloutsos, Christos; Faro, Scott; Mohamed, Feroze B.
2011-01-01
The traditional approach to functional image analysis models images as matrices of raw voxel intensity values. Although such a representation is widely utilized and heavily entrenched both within neuroimaging and in the wider data mining community, the strong interactions among space, time, and categorical modes such as subject and experimental task inherent in functional imaging yield a dataset with “high-order” structure, which matrix models are incapable of exploiting. Reasoning across all of these modes of data concurrently requires a high-order model capable of representing relationships between all modes of the data in tandem. We thus propose to model functional MRI data using tensors, which are high-order generalizations of matrices equivalent to multidimensional arrays or data cubes. However, several unique challenges exist in the high-order analysis of functional medical data: naïve tensor models are incapable of exploiting spatiotemporal locality patterns, standard tensor analysis techniques exhibit poor efficiency, and mixtures of numeric and categorical modes of data are very often present in neuroimaging experiments. Formulating the problem of image clustering as a form of Latent Semantic Analysis and using the WaveCluster algorithm as a baseline, we propose a comprehensive hybrid tensor and wavelet framework for clustering, concept discovery, and compression of functional medical images which successfully addresses these challenges. Our approach reduced runtime and dataset size on a 9.3 GB finger opposition motor task fMRI dataset by up to 98% while exhibiting improved spatiotemporal coherence relative to standard tensor, wavelet, and voxel-based approaches. Our clustering technique was capable of automatically differentiating between the frontal areas of the brain responsible for task-related habituation and the motor regions responsible for executing the motor task, in contrast to a widely used fMRI analysis program, SPM, which only detected the latter region. Furthermore, our approach discovered latent concepts suggestive of subject handedness nearly 100x faster than standard approaches. These results suggest that a high-order model is an integral component to accurate scalable functional neuroimaging. PMID:21729758
Improved Test Planning and Analysis Through the Use of Advanced Statistical Methods
NASA Technical Reports Server (NTRS)
Green, Lawrence L.; Maxwell, Katherine A.; Glass, David E.; Vaughn, Wallace L.; Barger, Weston; Cook, Mylan
2016-01-01
The goal of this work is, through computational simulations, to provide statistically-based evidence to convince the testing community that a distributed testing approach is superior to a clustered testing approach for most situations. For clustered testing, numerous, repeated test points are acquired at a limited number of test conditions. For distributed testing, only one or a few test points are requested at many different conditions. The statistical techniques of Analysis of Variance (ANOVA), Design of Experiments (DOE) and Response Surface Methods (RSM) are applied to enable distributed test planning, data analysis and test augmentation. The D-Optimal class of DOE is used to plan an optimally efficient single- and multi-factor test. The resulting simulated test data are analyzed via ANOVA and a parametric model is constructed using RSM. Finally, ANOVA can be used to plan a second round of testing to augment the existing data set with new data points. The use of these techniques is demonstrated through several illustrative examples. To date, many thousands of comparisons have been performed and the results strongly support the conclusion that the distributed testing approach outperforms the clustered testing approach.
Unsupervised analysis of small animal dynamic Cerenkov luminescence imaging
NASA Astrophysics Data System (ADS)
Spinelli, Antonello E.; Boschi, Federico
2011-12-01
Clustering analysis (CA) and principal component analysis (PCA) were applied to dynamic Cerenkov luminescence images (dCLI). In order to investigate the performances of the proposed approaches, two distinct dynamic data sets obtained by injecting mice with 32P-ATP and 18F-FDG were acquired using the IVIS 200 optical imager. The k-means clustering algorithm has been applied to dCLI and was implemented using interactive data language 8.1. We show that cluster analysis allows us to obtain good agreement between the clustered and the corresponding emission regions like the bladder, the liver, and the tumor. We also show a good correspondence between the time activity curves of the different regions obtained by using CA and manual region of interest analysis on dCLIT and PCA images. We conclude that CA provides an automatic unsupervised method for the analysis of preclinical dynamic Cerenkov luminescence image data.
Borri, Marco; Schmidt, Maria A.; Powell, Ceri; Koh, Dow-Mu; Riddell, Angela M.; Partridge, Mike; Bhide, Shreerang A.; Nutting, Christopher M.; Harrington, Kevin J.; Newbold, Katie L.; Leach, Martin O.
2015-01-01
Purpose To describe a methodology, based on cluster analysis, to partition multi-parametric functional imaging data into groups (or clusters) of similar functional characteristics, with the aim of characterizing functional heterogeneity within head and neck tumour volumes. To evaluate the performance of the proposed approach on a set of longitudinal MRI data, analysing the evolution of the obtained sub-sets with treatment. Material and Methods The cluster analysis workflow was applied to a combination of dynamic contrast-enhanced and diffusion-weighted imaging MRI data from a cohort of squamous cell carcinoma of the head and neck patients. Cumulative distributions of voxels, containing pre and post-treatment data and including both primary tumours and lymph nodes, were partitioned into k clusters (k = 2, 3 or 4). Principal component analysis and cluster validation were employed to investigate data composition and to independently determine the optimal number of clusters. The evolution of the resulting sub-regions with induction chemotherapy treatment was assessed relative to the number of clusters. Results The clustering algorithm was able to separate clusters which significantly reduced in voxel number following induction chemotherapy from clusters with a non-significant reduction. Partitioning with the optimal number of clusters (k = 4), determined with cluster validation, produced the best separation between reducing and non-reducing clusters. Conclusion The proposed methodology was able to identify tumour sub-regions with distinct functional properties, independently separating clusters which were affected differently by treatment. This work demonstrates that unsupervised cluster analysis, with no prior knowledge of the data, can be employed to provide a multi-parametric characterization of functional heterogeneity within tumour volumes. PMID:26398888
A scoping review of spatial cluster analysis techniques for point-event data.
Fritz, Charles E; Schuurman, Nadine; Robertson, Colin; Lear, Scott
2013-05-01
Spatial cluster analysis is a uniquely interdisciplinary endeavour, and so it is important to communicate and disseminate ideas, innovations, best practices and challenges across practitioners, applied epidemiology researchers and spatial statisticians. In this research we conducted a scoping review to systematically search peer-reviewed journal databases for research that has employed spatial cluster analysis methods on individual-level, address location, or x and y coordinate derived data. To illustrate the thematic issues raised by our results, methods were tested using a dataset where known clusters existed. Point pattern methods, spatial clustering and cluster detection tests, and a locally weighted spatial regression model were most commonly used for individual-level, address location data (n = 29). The spatial scan statistic was the most popular method for address location data (n = 19). Six themes were identified relating to the application of spatial cluster analysis methods and subsequent analyses, which we recommend researchers to consider; exploratory analysis, visualization, spatial resolution, aetiology, scale and spatial weights. It is our intention that researchers seeking direction for using spatial cluster analysis methods, consider the caveats and strengths of each approach, but also explore the numerous other methods available for this type of analysis. Applied spatial epidemiology researchers and practitioners should give special consideration to applying multiple tests to a dataset. Future research should focus on developing frameworks for selecting appropriate methods and the corresponding spatial weighting schemes.
NASA Astrophysics Data System (ADS)
Bellón, Beatriz; Bégué, Agnès; Lo Seen, Danny; Lebourgeois, Valentine; Evangelista, Balbino Antônio; Simões, Margareth; Demonte Ferraz, Rodrigo Peçanha
2018-06-01
Cropping systems' maps at fine scale over large areas provide key information for further agricultural production and environmental impact assessments, and thus represent a valuable tool for effective land-use planning. There is, therefore, a growing interest in mapping cropping systems in an operational manner over large areas, and remote sensing approaches based on vegetation index time series analysis have proven to be an efficient tool. However, supervised pixel-based approaches are commonly adopted, requiring resource consuming field campaigns to gather training data. In this paper, we present a new object-based unsupervised classification approach tested on an annual MODIS 16-day composite Normalized Difference Vegetation Index time series and a Landsat 8 mosaic of the State of Tocantins, Brazil, for the 2014-2015 growing season. Two variants of the approach are compared: an hyperclustering approach, and a landscape-clustering approach involving a previous stratification of the study area into landscape units on which the clustering is then performed. The main cropping systems of Tocantins, characterized by the crop types and cropping patterns, were efficiently mapped with the landscape-clustering approach. Results show that stratification prior to clustering significantly improves the classification accuracies for underrepresented and sparsely distributed cropping systems. This study illustrates the potential of unsupervised classification for large area cropping systems' mapping and contributes to the development of generic tools for supporting large-scale agricultural monitoring across regions.
Ocké, Marga C
2013-05-01
This paper aims to describe different approaches for studying the overall diet with advantages and limitations. Studies of the overall diet have emerged because the relationship between dietary intake and health is very complex with all kinds of interactions. These cannot be captured well by studying single dietary components. Three main approaches to study the overall diet can be distinguished. The first method is researcher-defined scores or indices of diet quality. These are usually based on guidelines for a healthy diet or on diets known to be healthy. The second approach, using principal component or cluster analysis, is driven by the underlying dietary data. In principal component analysis, scales are derived based on the underlying relationships between food groups, whereas in cluster analysis, subgroups of the population are created with people that cluster together based on their dietary intake. A third approach includes methods that are driven by a combination of biological pathways and the underlying dietary data. Reduced rank regression defines linear combinations of food intakes that maximally explain nutrient intakes or intermediate markers of disease. Decision tree analysis identifies subgroups of a population whose members share dietary characteristics that influence (intermediate markers of) disease. It is concluded that all approaches have advantages and limitations and essentially answer different questions. The third approach is still more in an exploration phase, but seems to have great potential with complementary value. More insight into the utility of conducting studies on the overall diet can be gained if more attention is given to methodological issues.
Malakooti, Behnam; Yang, Ziyong
2004-02-01
In many real-world problems, the range of consequences of different alternatives are considerably different. In addition, sometimes, selection of a group of alternatives (instead of only one best alternative) is necessary. Traditional decision making approaches treat the set of alternatives with the same method of analysis and selection. In this paper, we propose clustering alternatives into different groups so that different methods of analysis, selection, and implementation for each group can be applied. As an example, consider the selection of a group of functions (or tasks) to be processed by a group of processors. The set of tasks can be grouped according to their similar criteria, and hence, each cluster of tasks to be processed by a processor. The selection of the best alternative for each clustered group can be performed using existing methods; however, the process of selecting groups is different than the process of selecting alternatives within a group. We develop theories and procedures for clustering discrete multiple criteria alternatives. We also demonstrate how the set of alternatives is clustered into mutually exclusive groups based on 1) similar features among alternatives; 2) ideal (or most representative) alternatives given by the decision maker; and 3) other preferential information of the decision maker. The clustering of multiple criteria alternatives also has the following advantages. 1) It decreases the set of alternatives to be considered by the decision maker (for example, different decision makers are assigned to different groups of alternatives). 2) It decreases the number of criteria. 3) It may provide a different approach for analyzing multiple decision makers problems. Each decision maker may cluster alternatives differently, and hence, clustering of alternatives may provide a basis for negotiation. The developed approach is applicable for solving a class of telecommunication networks problems where a set of objects (such as routers, processors, or intelligent autonomous vehicles) are to be clustered into similar groups. Objects are clustered based on several criteria and the decision maker's preferences.
A Framework for Designing Cluster Randomized Trials with Binary Outcomes
ERIC Educational Resources Information Center
Spybrook, Jessaca; Martinez, Andres
2011-01-01
The purpose of this paper is to provide a frame work for approaching a power analysis for a CRT (cluster randomized trial) with a binary outcome. The authors suggest a framework in the context of a simple CRT and then extend it to a blocked design, or a multi-site cluster randomized trial (MSCRT). The framework is based on proportions, an…
A Cluster-Analytical Approach towards Physical Activity and Eating Habits among 10-Year-Old Children
ERIC Educational Resources Information Center
Sabbe, Dieter; De Bourdeaudhuij, I.; Legiest, E.; Maes, L.
2008-01-01
The purpose was to investigate whether clusters--based on physical activity (PA) and eating habits--can be found among children, and to explore subgroups' characteristics. A total of 1725 10-year olds completed a self-administered questionnaire. K-means cluster analysis was based on the weekly quantity of vigorous and moderate PA, the excess index…
Anders, Katherine L; Cutcher, Zoe; Kleinschmidt, Immo; Donnelly, Christl A; Ferguson, Neil M; Indriani, Citra; O'Neill, Scott L; Jewell, Nicholas P; Simmons, Cameron P
2018-05-07
Cluster randomized trials are the gold standard for assessing efficacy of community-level interventions, such as vector control strategies against dengue. We describe a novel cluster randomized trial methodology with a test-negative design, which offers advantages over traditional approaches. It utilizes outcome-based sampling of patients presenting with a syndrome consistent with the disease of interest, who are subsequently classified as test-positive cases or test-negative controls on the basis of diagnostic testing. We use simulations of a cluster trial to demonstrate validity of efficacy estimates under the test-negative approach. This demonstrates that, provided study arms are balanced for both test-negative and test-positive illness at baseline and that other test-negative design assumptions are met, the efficacy estimates closely match true efficacy. We also briefly discuss analytical considerations for an odds ratio-based effect estimate arising from clustered data, and outline potential approaches to analysis. We conclude that application of the test-negative design to certain cluster randomized trials could increase their efficiency and ease of implementation.
CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks.
Li, Min; Li, Dongyan; Tang, Yu; Wu, Fangxiang; Wang, Jianxin
2017-08-31
Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster.
CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks
Li, Min; Li, Dongyan; Tang, Yu; Wang, Jianxin
2017-01-01
Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster. PMID:28858211
Identification and characterization of near-fatal asthma phenotypes by cluster analysis.
Serrano-Pariente, J; Rodrigo, G; Fiz, J A; Crespo, A; Plaza, V
2015-09-01
Near-fatal asthma (NFA) is a heterogeneous clinical entity and several profiles of patients have been described according to different clinical, pathophysiological and histological features. However, there are no previous studies that identify in a unbiased way--using statistical methods such as clusters analysis--different phenotypes of NFA. Therefore, the aim of the present study was to identify and to characterize phenotypes of near fatal asthma using a cluster analysis. Over a period of 2 years, 33 Spanish hospitals enrolled 179 asthmatics admitted for an episode of NFA. A cluster analysis using two-steps algorithm was performed from data of 84 of these cases. The analysis defined three clusters of patients with NFA: cluster 1, the largest, including older patients with clinical and therapeutic criteria of severe asthma; cluster 2, with an high proportion of respiratory arrest (68%), impaired consciousness level (82%) and mechanical ventilation (93%); and cluster 3, which included younger patients, characterized by an insufficient anti-inflammatory treatment and frequent sensitization to Alternaria alternata and soybean. These results identify specific asthma phenotypes involved in NFA, confirming in part previous findings observed in studies with a clinical approach. The identification of patients with a specific NFA phenotype could suggest interventions to prevent future severe asthma exacerbations. © 2015 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra".
Griss, Johannes; Perez-Riverol, Yasset; The, Matthew; Käll, Lukas; Vizcaíno, Juan Antonio
2018-05-04
In the recent benchmarking article entitled "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra", Rieder et al. compared several different approaches to cluster MS/MS spectra. While we certainly recognize the value of the manuscript, here, we report some shortcomings detected in the original analyses. For most analyses, the authors clustered only single MS/MS runs. In one of the reported analyses, three MS/MS runs were processed together, which already led to computational performance issues in many of the tested approaches. This fact highlights the difficulties of using many of the tested algorithms on the nowadays produced average proteomics data sets. Second, the authors only processed identified spectra when merging MS runs. Thereby, all unidentified spectra that are of lower quality were already removed from the data set and could not influence the clustering results. Next, we found that the authors did not analyze the effect of chimeric spectra on the clustering results. In our analysis, we found that 3% of the spectra in the used data sets were chimeric, and this had marked effects on the behavior of the different clustering algorithms tested. Finally, the authors' choice to evaluate the MS-Cluster and spectra-cluster algorithms using a precursor tolerance of 5 Da for high-resolution Orbitrap data only was, in our opinion, not adequate to assess the performance of MS/MS clustering approaches.
Tremblay, Marlène; Hess, Justin P; Christenson, Brock M; McIntyre, Kolby K; Smink, Ben; van der Kamp, Arjen J; de Jong, Lisanne G; Döpfer, Dörte
2016-07-01
Automatic milking systems (AMS) are implemented in a variety of situations and environments. Consequently, there is a need to characterize individual farming practices and regional challenges to streamline management advice and objectives for producers. Benchmarking is often used in the dairy industry to compare farms by computing percentile ranks of the production values of groups of farms. Grouping for conventional benchmarking is commonly limited to the use of a few factors such as farms' geographic region or breed of cattle. We hypothesized that herds' production data and management information could be clustered in a meaningful way using cluster analysis and that this clustering approach would yield better peer groups of farms than benchmarking methods based on criteria such as country, region, breed, or breed and region. By applying mixed latent-class model-based cluster analysis to 529 North American AMS dairy farms with respect to 18 significant risk factors, 6 clusters were identified. Each cluster (i.e., peer group) represented unique management styles, challenges, and production patterns. When compared with peer groups based on criteria similar to the conventional benchmarking standards, the 6 clusters better predicted milk produced (kilograms) per robot per day. Each cluster represented a unique management and production pattern that requires specialized advice. For example, cluster 1 farms were those that recently installed AMS robots, whereas cluster 3 farms (the most northern farms) fed high amounts of concentrates through the robot to compensate for low-energy feed in the bunk. In addition to general recommendations for farms within a cluster, individual farms can generate their own specific goals by comparing themselves to farms within their cluster. This is very comparable to benchmarking but adds the specific characteristics of the peer group, resulting in better farm management advice. The improvement that cluster analysis allows for is characterized by the multivariable approach and the fact that comparisons between production units can be accomplished within a cluster and between clusters as a choice. Copyright © 2016 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Pattern Activity Clustering and Evaluation (PACE)
NASA Astrophysics Data System (ADS)
Blasch, Erik; Banas, Christopher; Paul, Michael; Bussjager, Becky; Seetharaman, Guna
2012-06-01
With the vast amount of network information available on activities of people (i.e. motions, transportation routes, and site visits) there is a need to explore the salient properties of data that detect and discriminate the behavior of individuals. Recent machine learning approaches include methods of data mining, statistical analysis, clustering, and estimation that support activity-based intelligence. We seek to explore contemporary methods in activity analysis using machine learning techniques that discover and characterize behaviors that enable grouping, anomaly detection, and adversarial intent prediction. To evaluate these methods, we describe the mathematics and potential information theory metrics to characterize behavior. A scenario is presented to demonstrate the concept and metrics that could be useful for layered sensing behavior pattern learning and analysis. We leverage work on group tracking, learning and clustering approaches; as well as utilize information theoretical metrics for classification, behavioral and event pattern recognition, and activity and entity analysis. The performance evaluation of activity analysis supports high-level information fusion of user alerts, data queries and sensor management for data extraction, relations discovery, and situation analysis of existing data.
Roushangar, Kiyoumars; Alizadeh, Farhad; Adamowski, Jan
2018-08-01
Understanding precipitation on a regional basis is an important component of water resources planning and management. The present study outlines a methodology based on continuous wavelet transform (CWT) and multiscale entropy (CWME), combined with self-organizing map (SOM) and k-means clustering techniques, to measure and analyze the complexity of precipitation. Historical monthly precipitation data from 1960 to 2010 at 31 rain gauges across Iran were preprocessed by CWT. The multi-resolution CWT approach segregated the major features of the original precipitation series by unfolding the structure of the time series which was often ambiguous. The entropy concept was then applied to components obtained from CWT to measure dispersion, uncertainty, disorder, and diversification of subcomponents. Based on different validity indices, k-means clustering captured homogenous areas more accurately, and additional analysis was performed based on the outcome of this approach. The 31 rain gauges in this study were clustered into 6 groups, each one having a unique CWME pattern across different time scales. The results of clustering showed that hydrologic similarity (multiscale variation of precipitation) was not based on geographic contiguity. According to the pattern of entropy across the scales, each cluster was assigned an entropy signature that provided an estimation of the entropy pattern of precipitation data in each cluster. Based on the pattern of mean CWME for each cluster, a characteristic signature was assigned, which provided an estimation of the CWME of a cluster across scales of 1-2, 3-8, and 9-13 months relative to other stations. The validity of the homogeneous clusters demonstrated the usefulness of the proposed approach to regionalize precipitation. Further analysis based on wavelet coherence (WTC) was performed by selecting central rain gauges in each cluster and analyzing against temperature, wind, Multivariate ENSO index (MEI), and East Atlantic (EA) and North Atlantic Oscillation (NAO), indeces. The results revealed that all climatic features except NAO influenced precipitation in Iran during the 1960-2010 period. Copyright © 2018 Elsevier Inc. All rights reserved.
Clustering of samples and variables with mixed-type data
Edelmann, Dominic; Kopp-Schneider, Annette
2017-01-01
Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix. PMID:29182671
Clustering P-Wave Receiver Functions To Constrain Subsurface Seismic Structure
NASA Astrophysics Data System (ADS)
Chai, C.; Larmat, C. S.; Maceira, M.; Ammon, C. J.; He, R.; Zhang, H.
2017-12-01
The acquisition of high-quality data from permanent and temporary dense seismic networks provides the opportunity to apply statistical and machine learning techniques to a broad range of geophysical observations. Lekic and Romanowicz (2011) used clustering analysis on tomographic velocity models of the western United States to perform tectonic regionalization and the velocity-profile clusters agree well with known geomorphic provinces. A complementary and somewhat less restrictive approach is to apply cluster analysis directly to geophysical observations. In this presentation, we apply clustering analysis to teleseismic P-wave receiver functions (RFs) continuing efforts of Larmat et al. (2015) and Maceira et al. (2015). These earlier studies validated the approach with surface waves and stacked EARS RFs from the USArray stations. In this study, we experiment with both the K-means and hierarchical clustering algorithms. We also test different distance metrics defined in the vector space of RFs following Lekic and Romanowicz (2011). We cluster data from two distinct data sets. The first, corresponding to the western US, was by smoothing/interpolation of receiver-function wavefield (Chai et al. 2015). Spatial coherence and agreement with geologic region increase with this simpler, spatially smoothed set of observations. The second data set is composed of RFs for more than 800 stations of the China Digital Seismic Network (CSN). Preliminary results show a first order agreement between clusters and tectonic region and each region cluster includes a distinct Ps arrival, which probably reflects differences in crustal thickness. Regionalization remains an important step to characterize a model prior to application of full waveform and/or stochastic imaging techniques because of the computational expense of these types of studies. Machine learning techniques can provide valuable information that can be used to design and characterize formal geophysical inversion, providing information on spatial variability in the subsurface geology.
The cosmological analysis of X-ray cluster surveys - I. A new method for interpreting number counts
NASA Astrophysics Data System (ADS)
Clerc, N.; Pierre, M.; Pacaud, F.; Sadibekova, T.
2012-07-01
We present a new method aimed at simplifying the cosmological analysis of X-ray cluster surveys. It is based on purely instrumental observable quantities considered in a two-dimensional X-ray colour-magnitude diagram (hardness ratio versus count rate). The basic principle is that even in rather shallow surveys, substantial information on cluster redshift and temperature is present in the raw X-ray data and can be statistically extracted; in parallel, such diagrams can be readily predicted from an ab initio cosmological modelling. We illustrate the methodology for the case of a 100-deg2XMM survey having a sensitivity of ˜10-14 erg s-1 cm-2 and fit at the same time, the survey selection function, the cluster evolutionary scaling relations and the cosmology; our sole assumption - driven by the limited size of the sample considered in the case study - is that the local cluster scaling relations are known. We devote special attention to the realistic modelling of the count-rate measurement uncertainties and evaluate the potential of the method via a Fisher analysis. In the absence of individual cluster redshifts, the count rate and hardness ratio (CR-HR) method appears to be much more efficient than the traditional approach based on cluster counts (i.e. dn/dz, requiring redshifts). In the case where redshifts are available, our method performs similar to the traditional mass function (dn/dM/dz) for the purely cosmological parameters, but constrains better parameters defining the cluster scaling relations and their evolution. A further practical advantage of the CR-HR method is its simplicity: this fully top-down approach totally bypasses the tedious steps consisting in deriving cluster masses from X-ray temperature measurements.
NASA Astrophysics Data System (ADS)
Borgelt, Christian
In clustering we often face the situation that only a subset of the available attributes is relevant for forming clusters, even though this may not be known beforehand. In such cases it is desirable to have a clustering algorithm that automatically weights attributes or even selects a proper subset. In this paper I study such an approach for fuzzy clustering, which is based on the idea to transfer an alternative to the fuzzifier (Klawonn and Höppner, What is fuzzy about fuzzy clustering? Understanding and improving the concept of the fuzzifier, In: Proc. 5th Int. Symp. on Intelligent Data Analysis, 254-264, Springer, Berlin, 2003) to attribute weighting fuzzy clustering (Keller and Klawonn, Int J Uncertain Fuzziness Knowl Based Syst 8:735-746, 2000). In addition, by reformulating Gustafson-Kessel fuzzy clustering, a scheme for weighting and selecting principal axes can be obtained. While in Borgelt (Feature weighting and feature selection in fuzzy clustering, In: Proc. 17th IEEE Int. Conf. on Fuzzy Systems, IEEE Press, Piscataway, NJ, 2008) I already presented such an approach for a global selection of attributes and principal axes, this paper extends it to a cluster-specific selection, thus arriving at a fuzzy subspace clustering algorithm (Parsons, Haque, and Liu, 2004).
Guo, Qi; Lu, Xiaoni; Gao, Ya; Zhang, Jingjing; Yan, Bin; Su, Dan; Song, Anqi; Zhao, Xi; Wang, Gang
2017-03-07
Grading of essential hypertension according to blood pressure (BP) level may not adequately reflect clinical heterogeneity of hypertensive patients. This study was carried out to explore clinical phenotypes in essential hypertensive patients using cluster analysis. This study recruited 513 hypertensive patients and evaluated BP variations with ambulatory blood pressure monitoring. Four distinct hypertension groups were identified using cluster analysis: (1) younger male smokers with relatively high BP had the most severe carotid plaque thickness but no coronary artery disease (CAD); (2) older women with relatively low diastolic BP had more diabetes; (3) non-smokers with a low systolic BP level had neither diabetes nor CAD; (4) hypertensive patients with BP reverse dipping were most likely to have CAD but had least severe carotid plaque thickness. In binary logistic analysis, reverse dipping was significantly associated with prevalence of CAD. Cluster analysis was shown to be a feasible approach for investigating the heterogeneity of essential hypertension in clinical studies. BP reverse dipping might be valuable for prediction of CAD in hypertensive patients when compared with carotid plaque thickness. However, large-scale prospective trials with more information of plaque morphology are necessary to further compare the predicative power between BP dipping pattern and carotid plaque.
Guo, Qi; Lu, Xiaoni; Gao, Ya; Zhang, Jingjing; Yan, Bin; Su, Dan; Song, Anqi; Zhao, Xi; Wang, Gang
2017-01-01
Grading of essential hypertension according to blood pressure (BP) level may not adequately reflect clinical heterogeneity of hypertensive patients. This study was carried out to explore clinical phenotypes in essential hypertensive patients using cluster analysis. This study recruited 513 hypertensive patients and evaluated BP variations with ambulatory blood pressure monitoring. Four distinct hypertension groups were identified using cluster analysis: (1) younger male smokers with relatively high BP had the most severe carotid plaque thickness but no coronary artery disease (CAD); (2) older women with relatively low diastolic BP had more diabetes; (3) non-smokers with a low systolic BP level had neither diabetes nor CAD; (4) hypertensive patients with BP reverse dipping were most likely to have CAD but had least severe carotid plaque thickness. In binary logistic analysis, reverse dipping was significantly associated with prevalence of CAD. Cluster analysis was shown to be a feasible approach for investigating the heterogeneity of essential hypertension in clinical studies. BP reverse dipping might be valuable for prediction of CAD in hypertensive patients when compared with carotid plaque thickness. However, large-scale prospective trials with more information of plaque morphology are necessary to further compare the predicative power between BP dipping pattern and carotid plaque. PMID:28266630
Shi, Weifang; Zeng, Weihua
2013-01-01
Reducing human vulnerability to chemical hazards in the industrialized city is a matter of great urgency. Vulnerability mapping is an alternative approach for providing vulnerability-reducing interventions in a region. This study presents a method for mapping human vulnerability to chemical hazards by using clustering analysis for effective vulnerability reduction. Taking the city of Shanghai as the study area, we measure human exposure to chemical hazards by using the proximity model with additionally considering the toxicity of hazardous substances, and capture the sensitivity and coping capacity with corresponding indicators. We perform an improved k-means clustering approach on the basis of genetic algorithm by using a 500 m × 500 m geographical grid as basic spatial unit. The sum of squared errors and silhouette coefficient are combined to measure the quality of clustering and to determine the optimal clustering number. Clustering result reveals a set of six typical human vulnerability patterns that show distinct vulnerability dimension combinations. The vulnerability mapping of the study area reflects cluster-specific vulnerability characteristics and their spatial distribution. Finally, we suggest specific points that can provide new insights in rationally allocating the limited funds for the vulnerability reduction of each cluster. PMID:23787337
van Haaften, Rachel I M; Luceri, Cristina; van Erk, Arie; Evelo, Chris T A
2009-06-01
Omics technology used for large-scale measurements of gene expression is rapidly evolving. This work pointed out the need of an extensive bioinformatics analyses for array quality assessment before and after gene expression clustering and pathway analysis. A study focused on the effect of red wine polyphenols on rat colon mucosa was used to test the impact of quality control and normalisation steps on the biological conclusions. The integration of data visualization, pathway analysis and clustering revealed an artifact problem that was solved with an adapted normalisation. We propose a possible point to point standard analysis procedure, based on a combination of clustering and data visualization for the analysis of microarray data.
Multi-viewpoint clustering analysis
NASA Technical Reports Server (NTRS)
Mehrotra, Mala; Wild, Chris
1993-01-01
In this paper, we address the feasibility of partitioning rule-based systems into a number of meaningful units to enhance the comprehensibility, maintainability and reliability of expert systems software. Preliminary results have shown that no single structuring principle or abstraction hierarchy is sufficient to understand complex knowledge bases. We therefore propose the Multi View Point - Clustering Analysis (MVP-CA) methodology to provide multiple views of the same expert system. We present the results of using this approach to partition a deployed knowledge-based system that navigates the Space Shuttle's entry. We also discuss the impact of this approach on verification and validation of knowledge-based systems.
Fontes, Cristiano Hora; Budman, Hector
2017-11-01
A clustering problem involving multivariate time series (MTS) requires the selection of similarity metrics. This paper shows the limitations of the PCA similarity factor (SPCA) as a single metric in nonlinear problems where there are differences in magnitude of the same process variables due to expected changes in operation conditions. A novel method for clustering MTS based on a combination between SPCA and the average-based Euclidean distance (AED) within a fuzzy clustering approach is proposed. Case studies involving either simulated or real industrial data collected from a large scale gas turbine are used to illustrate that the hybrid approach enhances the ability to recognize normal and fault operating patterns. This paper also proposes an oversampling procedure to create synthetic multivariate time series that can be useful in commonly occurring situations involving unbalanced data sets. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
A clustering approach to segmenting users of internet-based risk calculators.
Harle, C A; Downs, J S; Padman, R
2011-01-01
Risk calculators are widely available Internet applications that deliver quantitative health risk estimates to consumers. Although these tools are known to have varying effects on risk perceptions, little is known about who will be more likely to accept objective risk estimates. To identify clusters of online health consumers that help explain variation in individual improvement in risk perceptions from web-based quantitative disease risk information. A secondary analysis was performed on data collected in a field experiment that measured people's pre-diabetes risk perceptions before and after visiting a realistic health promotion website that provided quantitative risk information. K-means clustering was performed on numerous candidate variable sets, and the different segmentations were evaluated based on between-cluster variation in risk perception improvement. Variation in responses to risk information was best explained by clustering on pre-intervention absolute pre-diabetes risk perceptions and an objective estimate of personal risk. Members of a high-risk overestimater cluster showed large improvements in their risk perceptions, but clusters of both moderate-risk and high-risk underestimaters were much more muted in improving their optimistically biased perceptions. Cluster analysis provided a unique approach for segmenting health consumers and predicting their acceptance of quantitative disease risk information. These clusters suggest that health consumers were very responsive to good news, but tended not to incorporate bad news into their self-perceptions much. These findings help to quantify variation among online health consumers and may inform the targeted marketing of and improvements to risk communication tools on the Internet.
Clustering stocks using partial correlation coefficients
NASA Astrophysics Data System (ADS)
Jung, Sean S.; Chang, Woojin
2016-11-01
A partial correlation analysis is performed on the Korean stock market (KOSPI). The difference between Pearson correlation and the partial correlation is analyzed and it is found that when conditioned on the market return, Pearson correlation coefficients are generally greater than those of the partial correlation, which implies that the market return tends to drive up the correlation between stock returns. A clustering analysis is then performed to study the market structure given by the partial correlation analysis and the members of the clusters are compared with the Global Industry Classification Standard (GICS). The initial hypothesis is that the firms in the same GICS sector are clustered together since they are in a similar business and environment. However, the result is inconsistent with the hypothesis and most clusters are a mix of multiple sectors suggesting that the traditional approach of using sectors to determine the proximity between stocks may not be sufficient enough to diversify a portfolio.
Vigre, Håkan; Domingues, Ana Rita Coutinho Calado; Pedersen, Ulrik Bo; Hald, Tine
2016-03-01
The aim of the project as the cluster analysis was to in part to develop a generic structured quantitative microbiological risk assessment (QMRA) model of human salmonellosis due to pork consumption in EU member states (MSs), and the objective of the cluster analysis was to group the EU MSs according to the relative contribution of different pathways of Salmonella in the farm-to-consumption chain of pork products. In the development of the model, by selecting a case study MS from each cluster the model was developed to represent different aspects of pig production, pork production, and consumption of pork products across EU states. The objective of the cluster analysis was to aggregate MSs into groups of countries with similar importance of different pathways of Salmonella in the farm-to-consumption chain using available, and where possible, universal register data related to the pork production and consumption in each country. Based on MS-specific information about distribution of (i) small and large farms, (ii) small and large slaughterhouses, (iii) amount of pork meat consumed, and (iv) amount of sausages consumed we used nonhierarchical and hierarchical cluster analysis to group the MSs. The cluster solutions were validated internally using statistic measures and externally by comparing the clustered MSs with an estimated human incidence of salmonellosis due to pork products in the MSs. Finally, each cluster was characterized qualitatively using the centroids of the clusters. © 2016 Society for Risk Analysis.
An analysis of pilot error-related aircraft accidents
NASA Technical Reports Server (NTRS)
Kowalsky, N. B.; Masters, R. L.; Stone, R. B.; Babcock, G. L.; Rypka, E. W.
1974-01-01
A multidisciplinary team approach to pilot error-related U.S. air carrier jet aircraft accident investigation records successfully reclaimed hidden human error information not shown in statistical studies. New analytic techniques were developed and applied to the data to discover and identify multiple elements of commonality and shared characteristics within this group of accidents. Three techniques of analysis were used: Critical element analysis, which demonstrated the importance of a subjective qualitative approach to raw accident data and surfaced information heretofore unavailable. Cluster analysis, which was an exploratory research tool that will lead to increased understanding and improved organization of facts, the discovery of new meaning in large data sets, and the generation of explanatory hypotheses. Pattern recognition, by which accidents can be categorized by pattern conformity after critical element identification by cluster analysis.
Clustering gene expression data based on predicted differential effects of GV interaction.
Pan, Hai-Yan; Zhu, Jun; Han, Dan-Fu
2005-02-01
Microarray has become a popular biotechnology in biological and medical research. However, systematic and stochastic variabilities in microarray data are expected and unavoidable, resulting in the problem that the raw measurements have inherent "noise" within microarray experiments. Currently, logarithmic ratios are usually analyzed by various clustering methods directly, which may introduce bias interpretation in identifying groups of genes or samples. In this paper, a statistical method based on mixed model approaches was proposed for microarray data cluster analysis. The underlying rationale of this method is to partition the observed total gene expression level into various variations caused by different factors using an ANOVA model, and to predict the differential effects of GV (gene by variety) interaction using the adjusted unbiased prediction (AUP) method. The predicted GV interaction effects can then be used as the inputs of cluster analysis. We illustrated the application of our method with a gene expression dataset and elucidated the utility of our approach using an external validation.
Tobacco, Marijuana, and Alcohol Use in University Students: A Cluster Analysis
Primack, Brian A.; Kim, Kevin H.; Shensa, Ariel; Sidani, Jaime E.; Barnett, Tracey E.; Switzer, Galen E.
2012-01-01
Objective Segmentation of populations may facilitate development of targeted substance abuse prevention programs. We aimed to partition a national sample of university students according to profiles based on substance use. Participants We used 2008–2009 data from the National College Health Assessment from the American College Health Association. Our sample consisted of 111,245 individuals from 158 institutions. Method We partitioned the sample using cluster analysis according to current substance use behaviors. We examined the association of cluster membership with individual and institutional characteristics. Results Cluster analysis yielded six distinct clusters. Three individual factors—gender, year in school, and fraternity/sorority membership—were the most strongly associated with cluster membership. Conclusions In a large sample of university students, we were able to identify six distinct patterns of substance abuse. It may be valuable to target specific populations of college-aged substance users based on individual factors. However, comprehensive intervention will require a multifaceted approach. PMID:22686360
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bush, B.; Melaina, M.; Penev, M.
This report describes the development and analysis of detailed temporal and spatial scenarios for early market hydrogen fueling infrastructure clustering and fuel cell electric vehicle rollout using the Scenario Evaluation, Regionalization and Analysis (SERA) model. The report provides an overview of the SERA scenario development framework and discusses the approach used to develop the nationwidescenario.
Murray, Nicholas P; Hunfalvay, Melissa
2017-02-01
Considerable research has documented that successful performance in interceptive tasks (such as return of serve in tennis) is based on the performers' capability to capture appropriate anticipatory information prior to the flight path of the approaching object. Athletes of higher skill tend to fixate on different locations in the playing environment prior to initiation of a skill than their lesser skilled counterparts. The purpose of this study was to examine visual search behaviour strategies of elite (world ranked) tennis players and non-ranked competitive tennis players (n = 43) utilising cluster analysis. The results of hierarchical (Ward's method) and nonhierarchical (k means) cluster analyses revealed three different clusters. The clustering method distinguished visual behaviour of high, middle-and low-ranked players. Specifically, high-ranked players demonstrated longer mean fixation duration and lower variation of visual search than middle-and low-ranked players. In conclusion, the results demonstrated that cluster analysis is a useful tool for detecting and analysing the areas of interest for use in experimental analysis of expertise and to distinguish visual search variables among participants'.
Johnson, Jacqueline L; Kreidler, Sarah M; Catellier, Diane J; Murray, David M; Muller, Keith E; Glueck, Deborah H
2015-11-30
We used theoretical and simulation-based approaches to study Type I error rates for one-stage and two-stage analytic methods for cluster-randomized designs. The one-stage approach uses the observed data as outcomes and accounts for within-cluster correlation using a general linear mixed model. The two-stage model uses the cluster specific means as the outcomes in a general linear univariate model. We demonstrate analytically that both one-stage and two-stage models achieve exact Type I error rates when cluster sizes are equal. With unbalanced data, an exact size α test does not exist, and Type I error inflation may occur. Via simulation, we compare the Type I error rates for four one-stage and six two-stage hypothesis testing approaches for unbalanced data. With unbalanced data, the two-stage model, weighted by the inverse of the estimated theoretical variance of the cluster means, and with variance constrained to be positive, provided the best Type I error control for studies having at least six clusters per arm. The one-stage model with Kenward-Roger degrees of freedom and unconstrained variance performed well for studies having at least 14 clusters per arm. The popular analytic method of using a one-stage model with denominator degrees of freedom appropriate for balanced data performed poorly for small sample sizes and low intracluster correlation. Because small sample sizes and low intracluster correlation are common features of cluster-randomized trials, the Kenward-Roger method is the preferred one-stage approach. Copyright © 2015 John Wiley & Sons, Ltd.
tropical cyclone risk analysis: a decisive role of its track
NASA Astrophysics Data System (ADS)
Chelsea Nam, C.; Park, Doo-Sun R.; Ho, Chang-Hoi
2016-04-01
The tracks of 85 tropical cyclones (TCs) that made landfall to South Korea for the period 1979-2010 are classified into four clusters by using a fuzzy c-means clustering method. The four clusters are characterized by 1) east-short, 2) east-long, 3) west-long, and 4) west-short based on the moving routes around Korean peninsula. We conducted risk comparison analysis for these four clusters regarding their hazards, exposure, and damages. Here, hazard parameters are calculated from two different sources independently, one from the best-track data (BT) and the other from the 60 weather stations over the country (WS). The results show distinct characteristics of the four clusters in terms of the hazard parameters and economic losses (EL), suggesting that there is a clear track-dependency in the overall TC risk. It is appeared that whether there occurred an "effective collision" overweighs the intensity of the TC per se. The EL ranking did not agree with the BT parameters (maximum wind speed, central pressure, or storm radius), but matches to WS parameter (especially, daily accumulated rainfall and TC-influenced period). The west-approaching TCs (i.e. west-long and west-short clusters) generally recorded larger EL than the east-approaching TCs (i.e. east-short and east-long clusters), although the east-long clusters are the strongest in BT point of view. This can be explained through the spatial distribution of the WS parameters and the regional EL maps corresponding to it. West-approaching TCs accompanied heavy rainfall on the southern regions with the helps of the topographic effect on their tracks, and of the extended stay on the Korean Peninsula in their extratropical transition, that were not allowed to the east-approaching TCs. On the other hand, some regions had EL that are not directly proportional to the hazards, and this is partly attributed to spatial disparity in wealth and vulnerability. Correlation analysis also revealed the importance of rainfall; daily accumulated rainfall is the most-correlated with EL among all BT and WS hazard parameters for all clusters except the east-short. The least-correlated hazard parameter is the storm radius which showed significant correlations with EL for only the short clusters. In conclusion, this study suggests that TC track is essential in determining the way it brings damage on South Korea. Thus, it is suggested that the damage warning and adaptation policy need to be different for different TC tracks although South Korea is relatively small compared to average TC size.
Sul, Woo Jun; Cole, James R.; Jesus, Ederson da C.; Wang, Qiong; Farris, Ryan J.; Fish, Jordan A.; Tiedje, James M.
2011-01-01
High-throughput sequencing of 16S rRNA genes has increased our understanding of microbial community structure, but now even higher-throughput methods to the Illumina scale allow the creation of much larger datasets with more samples and orders-of-magnitude more sequences that swamp current analytic methods. We developed a method capable of handling these larger datasets on the basis of assignment of sequences into an existing taxonomy using a supervised learning approach (taxonomy-supervised analysis). We compared this method with a commonly used clustering approach based on sequence similarity (taxonomy-unsupervised analysis). We sampled 211 different bacterial communities from various habitats and obtained ∼1.3 million 16S rRNA sequences spanning the V4 hypervariable region by pyrosequencing. Both methodologies gave similar ecological conclusions in that β-diversity measures calculated by using these two types of matrices were significantly correlated to each other, as were the ordination configurations and hierarchical clustering dendrograms. In addition, our taxonomy-supervised analyses were also highly correlated with phylogenetic methods, such as UniFrac. The taxonomy-supervised analysis has the advantages that it is not limited by the exhaustive computation required for the alignment and clustering necessary for the taxonomy-unsupervised analysis, is more tolerant of sequencing errors, and allows comparisons when sequences are from different regions of the 16S rRNA gene. With the tremendous expansion in 16S rRNA data acquisition underway, the taxonomy-supervised approach offers the potential to provide more rapid and extensive community comparisons across habitats and samples. PMID:21873204
Li, Huanjie; Nickerson, Lisa D; Nichols, Thomas E; Gao, Jia-Hong
2017-03-01
Two powerful methods for statistical inference on MRI brain images have been proposed recently, a non-stationary voxelation-corrected cluster-size test (CST) based on random field theory and threshold-free cluster enhancement (TFCE) based on calculating the level of local support for a cluster, then using permutation testing for inference. Unlike other statistical approaches, these two methods do not rest on the assumptions of a uniform and high degree of spatial smoothness of the statistic image. Thus, they are strongly recommended for group-level fMRI analysis compared to other statistical methods. In this work, the non-stationary voxelation-corrected CST and TFCE methods for group-level analysis were evaluated for both stationary and non-stationary images under varying smoothness levels, degrees of freedom and signal to noise ratios. Our results suggest that, both methods provide adequate control for the number of voxel-wise statistical tests being performed during inference on fMRI data and they are both superior to current CSTs implemented in popular MRI data analysis software packages. However, TFCE is more sensitive and stable for group-level analysis of VBM data. Thus, the voxelation-corrected CST approach may confer some advantages by being computationally less demanding for fMRI data analysis than TFCE with permutation testing and by also being applicable for single-subject fMRI analyses, while the TFCE approach is advantageous for VBM data. Hum Brain Mapp 38:1269-1280, 2017. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Lalonde, Michel; Wells, R Glenn; Birnie, David; Ruddy, Terrence D; Wassenaar, Richard
2014-07-01
Phase analysis of single photon emission computed tomography (SPECT) radionuclide angiography (RNA) has been investigated for its potential to predict the outcome of cardiac resynchronization therapy (CRT). However, phase analysis may be limited in its potential at predicting CRT outcome as valuable information may be lost by assuming that time-activity curves (TAC) follow a simple sinusoidal shape. A new method, cluster analysis, is proposed which directly evaluates the TACs and may lead to a better understanding of dyssynchrony patterns and CRT outcome. Cluster analysis algorithms were developed and optimized to maximize their ability to predict CRT response. About 49 patients (N = 27 ischemic etiology) received a SPECT RNA scan as well as positron emission tomography (PET) perfusion and viability scans prior to undergoing CRT. A semiautomated algorithm sampled the left ventricle wall to produce 568 TACs from SPECT RNA data. The TACs were then subjected to two different cluster analysis techniques, K-means, and normal average, where several input metrics were also varied to determine the optimal settings for the prediction of CRT outcome. Each TAC was assigned to a cluster group based on the comparison criteria and global and segmental cluster size and scores were used as measures of dyssynchrony and used to predict response to CRT. A repeated random twofold cross-validation technique was used to train and validate the cluster algorithm. Receiver operating characteristic (ROC) analysis was used to calculate the area under the curve (AUC) and compare results to those obtained for SPECT RNA phase analysis and PET scar size analysis methods. Using the normal average cluster analysis approach, the septal wall produced statistically significant results for predicting CRT results in the ischemic population (ROC AUC = 0.73;p < 0.05 vs. equal chance ROC AUC = 0.50) with an optimal operating point of 71% sensitivity and 60% specificity. Cluster analysis results were similar to SPECT RNA phase analysis (ROC AUC = 0.78, p = 0.73 vs cluster AUC; sensitivity/specificity = 59%/89%) and PET scar size analysis (ROC AUC = 0.73, p = 1.0 vs cluster AUC; sensitivity/specificity = 76%/67%). A SPECT RNA cluster analysis algorithm was developed for the prediction of CRT outcome. Cluster analysis results produced results equivalent to those obtained from Fourier and scar analysis.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lalonde, Michel, E-mail: mlalonde15@rogers.com; Wassenaar, Richard; Wells, R. Glenn
2014-07-15
Purpose: Phase analysis of single photon emission computed tomography (SPECT) radionuclide angiography (RNA) has been investigated for its potential to predict the outcome of cardiac resynchronization therapy (CRT). However, phase analysis may be limited in its potential at predicting CRT outcome as valuable information may be lost by assuming that time-activity curves (TAC) follow a simple sinusoidal shape. A new method, cluster analysis, is proposed which directly evaluates the TACs and may lead to a better understanding of dyssynchrony patterns and CRT outcome. Cluster analysis algorithms were developed and optimized to maximize their ability to predict CRT response. Methods: Aboutmore » 49 patients (N = 27 ischemic etiology) received a SPECT RNA scan as well as positron emission tomography (PET) perfusion and viability scans prior to undergoing CRT. A semiautomated algorithm sampled the left ventricle wall to produce 568 TACs from SPECT RNA data. The TACs were then subjected to two different cluster analysis techniques, K-means, and normal average, where several input metrics were also varied to determine the optimal settings for the prediction of CRT outcome. Each TAC was assigned to a cluster group based on the comparison criteria and global and segmental cluster size and scores were used as measures of dyssynchrony and used to predict response to CRT. A repeated random twofold cross-validation technique was used to train and validate the cluster algorithm. Receiver operating characteristic (ROC) analysis was used to calculate the area under the curve (AUC) and compare results to those obtained for SPECT RNA phase analysis and PET scar size analysis methods. Results: Using the normal average cluster analysis approach, the septal wall produced statistically significant results for predicting CRT results in the ischemic population (ROC AUC = 0.73;p < 0.05 vs. equal chance ROC AUC = 0.50) with an optimal operating point of 71% sensitivity and 60% specificity. Cluster analysis results were similar to SPECT RNA phase analysis (ROC AUC = 0.78, p = 0.73 vs cluster AUC; sensitivity/specificity = 59%/89%) and PET scar size analysis (ROC AUC = 0.73, p = 1.0 vs cluster AUC; sensitivity/specificity = 76%/67%). Conclusions: A SPECT RNA cluster analysis algorithm was developed for the prediction of CRT outcome. Cluster analysis results produced results equivalent to those obtained from Fourier and scar analysis.« less
NASA Astrophysics Data System (ADS)
Iswandhani, N.; Muhajir, M.
2018-03-01
This research was conducted in Department of Statistics Islamic University of Indonesia. The data used are primary data obtained by post @explorejogja instagram account from January until December 2016. In the @explorejogja instagram account found many tourist destinations that can be visited by tourists both in the country and abroad, Therefore it is necessary to form a cluster of existing tourist destinations based on the number of likes from user instagram assumed as the most popular. The purpose of this research is to know the most popular distribution of tourist spot, the cluster formation of tourist destinations, and central popularity of tourist destinations based on @explorejogja instagram account in 2016. Statistical analysis used is descriptive statistics, k-means clustering, and social network analysis. The results of this research were obtained the top 10 most popular destinations in Yogyakarta, map of html-based tourist destination distribution consisting of 121 tourist destination points, formed 3 clusters each consisting of cluster 1 with 52 destinations, cluster 2 with 9 destinations and cluster 3 with 60 destinations, and Central popularity of tourist destinations in the special region of Yogyakarta by district.
Spatial Analysis of Great Lakes Regional Icing Cloud Liquid Water Content
NASA Technical Reports Server (NTRS)
Ryerson, Charles C.; Koenig, George G.; Melloh, Rae A.; Meese, Debra A.; Reehorst, Andrew L.; Miller, Dean R.
2003-01-01
Abstract Clustering of cloud microphysical conditions, such as liquid water content (LWC) and drop size, can affect the rate and shape of ice accretion and the airworthiness of aircraft. Clustering may also degrade the accuracy of cloud LWC measurements from radars and microwave radiometers being developed by the government for remotely mapping icing conditions ahead of aircraft in flight. This paper evaluates spatial clustering of LWC in icing clouds using measurements collected during NASA research flights in the Great Lakes region. We used graphical and analytical approaches to describe clustering. The analytical approach involves determining the average size of clusters and computing a clustering intensity parameter. We analyzed flight data composed of 1-s-frequency LWC measurements for 12 periods ranging from 17.4 minutes (73 km) to 45.3 minutes (190 km) in duration. Graphically some flight segments showed evidence of consistency with regard to clustering patterns. Cluster intensity varied from 0.06, indicating little clustering, to a high of 2.42. Cluster lengths ranged from 0.1 minutes (0.6 km) to 4.1 minutes (17.3 km). Additional analyses will allow us to determine if clustering climatologies can be developed to characterize cluster conditions by region, time period, or weather condition. Introduction
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sreepathi, Sarat; Kumar, Jitendra; Mills, Richard T.
A proliferation of data from vast networks of remote sensing platforms (satellites, unmanned aircraft systems (UAS), airborne etc.), observational facilities (meteorological, eddy covariance etc.), state-of-the-art sensors, and simulation models offer unprecedented opportunities for scientific discovery. Unsupervised classification is a widely applied data mining approach to derive insights from such data. However, classification of very large data sets is a complex computational problem that requires efficient numerical algorithms and implementations on high performance computing (HPC) platforms. Additionally, increasing power, space, cooling and efficiency requirements has led to the deployment of hybrid supercomputing platforms with complex architectures and memory hierarchies like themore » Titan system at Oak Ridge National Laboratory. The advent of such accelerated computing architectures offers new challenges and opportunities for big data analytics in general and specifically, large scale cluster analysis in our case. Although there is an existing body of work on parallel cluster analysis, those approaches do not fully meet the needs imposed by the nature and size of our large data sets. Moreover, they had scaling limitations and were mostly limited to traditional distributed memory computing platforms. We present a parallel Multivariate Spatio-Temporal Clustering (MSTC) technique based on k-means cluster analysis that can target hybrid supercomputers like Titan. We developed a hybrid MPI, CUDA and OpenACC implementation that can utilize both CPU and GPU resources on computational nodes. We describe performance results on Titan that demonstrate the scalability and efficacy of our approach in processing large ecological data sets.« less
Structure-Based Phylogenetic Analysis of the Lipocalin Superfamily.
Lakshmi, Balasubramanian; Mishra, Madhulika; Srinivasan, Narayanaswamy; Archunan, Govindaraju
2015-01-01
Lipocalins constitute a superfamily of extracellular proteins that are found in all three kingdoms of life. Although very divergent in their sequences and functions, they show remarkable similarity in 3-D structures. Lipocalins bind and transport small hydrophobic molecules. Earlier sequence-based phylogenetic studies of lipocalins highlighted that they have a long evolutionary history. However the molecular and structural basis of their functional diversity is not completely understood. The main objective of the present study is to understand functional diversity of the lipocalins using a structure-based phylogenetic approach. The present study with 39 protein domains from the lipocalin superfamily suggests that the clusters of lipocalins obtained by structure-based phylogeny correspond well with the functional diversity. The detailed analysis on each of the clusters and sub-clusters reveals that the 39 lipocalin domains cluster based on their mode of ligand binding though the clustering was performed on the basis of gross domain structure. The outliers in the phylogenetic tree are often from single member families. Also structure-based phylogenetic approach has provided pointers to assign putative function for the domains of unknown function in lipocalin family. The approach employed in the present study can be used in the future for the functional identification of new lipocalin proteins and may be extended to other protein families where members show poor sequence similarity but high structural similarity.
Computing and Applying Atomic Regulons to Understand Gene Expression and Regulation
Faria, José P.; Davis, James J.; Edirisinghe, Janaka N.; ...
2016-11-24
Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. A multitude of technologies, abstractions, and interpretive frameworks have emerged to answer the challenges presented by genome function and regulatory network inference. Here, we propose a new approach for producing biologically meaningful clusters of coexpressed genes, called Atomic Regulons (ARs), based on expression data, gene context, and functional relationships. We demonstrate this new approach by computing ARs for Escherichia coli, which we compare with the coexpressed gene clusters predicted by two prevalent existing methods: hierarchical clustering and k-meansmore » clustering. We test the consistency of ARs predicted by all methods against expected interactions predicted by the Context Likelihood of Relatedness (CLR) mutual information based method, finding that the ARs produced by our approach show better agreement with CLR interactions. We then apply our method to compute ARs for four other genomes: Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus. We compare the AR clusters from all genomes to study the similarity of coexpression among a phylogenetically diverse set of species, identifying subsystems that show remarkable similarity over wide phylogenetic distances. We also study the sensitivity of our method for computing ARs to the expression data used in the computation, showing that our new approach requires less data than competing approaches to converge to a near final configuration of ARs. We go on to use our sensitivity analysis to identify the specific experiments that lead most rapidly to the final set of ARs for E. coli. As a result, this analysis produces insights into improving the design of gene expression experiments.« less
Computing and Applying Atomic Regulons to Understand Gene Expression and Regulation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Faria, José P.; Davis, James J.; Edirisinghe, Janaka N.
Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. A multitude of technologies, abstractions, and interpretive frameworks have emerged to answer the challenges presented by genome function and regulatory network inference. Here, we propose a new approach for producing biologically meaningful clusters of coexpressed genes, called Atomic Regulons (ARs), based on expression data, gene context, and functional relationships. We demonstrate this new approach by computing ARs for Escherichia coli, which we compare with the coexpressed gene clusters predicted by two prevalent existing methods: hierarchical clustering and k-meansmore » clustering. We test the consistency of ARs predicted by all methods against expected interactions predicted by the Context Likelihood of Relatedness (CLR) mutual information based method, finding that the ARs produced by our approach show better agreement with CLR interactions. We then apply our method to compute ARs for four other genomes: Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus. We compare the AR clusters from all genomes to study the similarity of coexpression among a phylogenetically diverse set of species, identifying subsystems that show remarkable similarity over wide phylogenetic distances. We also study the sensitivity of our method for computing ARs to the expression data used in the computation, showing that our new approach requires less data than competing approaches to converge to a near final configuration of ARs. We go on to use our sensitivity analysis to identify the specific experiments that lead most rapidly to the final set of ARs for E. coli. As a result, this analysis produces insights into improving the design of gene expression experiments.« less
Potashev, Konstantin; Sharonova, Natalia; Breus, Irina
2014-07-01
Clustering was employed for the analysis of obtained experimental data set (42 plants in total) on seed germination in leached chernozem contaminated with kerosene. Among investigated plants were 31 cultivated plants from 11 families (27 species and 20 varieties) and 11 wild plant species from 7 families, 23 annual and 19 perennial/biannual plant species, 11 monocotyledonous and 31 dicotyledonous plants. Two-dimensional (two-parameter) clustering approach, allowing the estimation of tolerance of germinating seeds using a pair of independent parameters (С75%, V7%) was found to be most effective. These parameters characterized the ability of seeds to both withstand high concentrations of contaminants without the significant reduction of the germination, and maintain high germination rate within certain contaminant concentrations. The performed clustering revealed a number of plant features, which define the relation of a particular plant to a particular tolerance cluster; it has also demonstrated the possibility of generalizing the kerosene results for n-tridecane, which is one of the typical kerosene components. In contrast to the "manual" plant ranking based on the assessment of germination at discrete concentrations of the contaminant, the proposed clustering approach allowed a generalized characterization of the seed tolerance/sensitivity to hydrocarbon contaminants. Copyright © 2014 Elsevier B.V. All rights reserved.
Iterative Stable Alignment and Clustering of 2D Transmission Electron Microscope Images
Yang, Zhengfan; Fang, Jia; Chittuluru, Johnathan; Asturias, Francisco J.; Penczek, Pawel A.
2012-01-01
SUMMARY Identification of homogeneous subsets of images in a macromolecular electron microscopy (EM) image data set is a critical step in single-particle analysis. The task is handled by iterative algorithms, whose performance is compromised by the compounded limitations of image alignment and K-means clustering. Here we describe an approach, iterative stable alignment and clustering (ISAC) that, relying on a new clustering method and on the concepts of stability and reproducibility, can extract validated, homogeneous subsets of images. ISAC requires only a small number of simple parameters and, with minimal human intervention, can eliminate bias from two-dimensional image clustering and maximize the quality of group averages that can be used for ab initio three-dimensional structural determination and analysis of macromolecular conformational variability. Repeated testing of the stability and reproducibility of a solution within ISAC eliminates heterogeneous or incorrect classes and introduces critical validation to the process of EM image clustering. PMID:22325773
Motegi, Hiromi; Tsuboi, Yuuri; Saga, Ayako; Kagami, Tomoko; Inoue, Maki; Toki, Hideaki; Minowa, Osamu; Noda, Tetsuo; Kikuchi, Jun
2015-11-04
There is an increasing need to use multivariate statistical methods for understanding biological functions, identifying the mechanisms of diseases, and exploring biomarkers. In addition to classical analyses such as hierarchical cluster analysis, principal component analysis, and partial least squares discriminant analysis, various multivariate strategies, including independent component analysis, non-negative matrix factorization, and multivariate curve resolution, have recently been proposed. However, determining the number of components is problematic. Despite the proposal of several different methods, no satisfactory approach has yet been reported. To resolve this problem, we implemented a new idea: classifying a component as "reliable" or "unreliable" based on the reproducibility of its appearance, regardless of the number of components in the calculation. Using the clustering method for classification, we applied this idea to multivariate curve resolution-alternating least squares (MCR-ALS). Comparisons between conventional and modified methods applied to proton nuclear magnetic resonance ((1)H-NMR) spectral datasets derived from known standard mixtures and biological mixtures (urine and feces of mice) revealed that more plausible results are obtained by the modified method. In particular, clusters containing little information were detected with reliability. This strategy, named "cluster-aided MCR-ALS," will facilitate the attainment of more reliable results in the metabolomics datasets.
Cluster Analysis Identifies 3 Phenotypes within Allergic Asthma.
Sendín-Hernández, María Paz; Ávila-Zarza, Carmelo; Sanz, Catalina; García-Sánchez, Asunción; Marcos-Vadillo, Elena; Muñoz-Bellido, Francisco J; Laffond, Elena; Domingo, Christian; Isidoro-García, María; Dávila, Ignacio
Asthma is a heterogeneous chronic disease with different clinical expressions and responses to treatment. In recent years, several unbiased approaches based on clinical, physiological, and molecular features have described several phenotypes of asthma. Some phenotypes are allergic, but little is known about whether these phenotypes can be further subdivided. We aimed to phenotype patients with allergic asthma using an unbiased approach based on multivariate classification techniques (unsupervised hierarchical cluster analysis). From a total of 54 variables of 225 patients with well-characterized allergic asthma diagnosed following American Thoracic Society (ATS) recommendation, positive skin prick test to aeroallergens, and concordant symptoms, we finally selected 19 variables by multiple correspondence analyses. Then a cluster analysis was performed. Three groups were identified. Cluster 1 was constituted by patients with intermittent or mild persistent asthma, without family antecedents of atopy, asthma, or rhinitis. This group showed the lowest total IgE levels. Cluster 2 was constituted by patients with mild asthma with a family history of atopy, asthma, or rhinitis. Total IgE levels were intermediate. Cluster 3 included patients with moderate or severe persistent asthma that needed treatment with corticosteroids and long-acting β-agonists. This group showed the highest total IgE levels. We identified 3 phenotypes of allergic asthma in our population. Furthermore, we described 2 phenotypes of mild atopic asthma mainly differentiated by a family history of allergy. Copyright © 2017 American Academy of Allergy, Asthma & Immunology. Published by Elsevier Inc. All rights reserved.
Systematic Association of Genes to Phenotypes by Genome and Literature Mining
Jensen, Lars J; Perez-Iratxeta, Carolina; Kaczanowski, Szymon; Hooper, Sean D; Andrade, Miguel A
2005-01-01
One of the major challenges of functional genomics is to unravel the connection between genotype and phenotype. So far no global analysis has attempted to explore those connections in the light of the large phenotypic variability seen in nature. Here, we use an unsupervised, systematic approach for associating genes and phenotypic characteristics that combines literature mining with comparative genome analysis. We first mine the MEDLINE literature database for terms that reflect phenotypic similarities of species. Subsequently we predict the likely genomic determinants: genes specifically present in the respective genomes. In a global analysis involving 92 prokaryotic genomes we retrieve 323 clusters containing a total of 2,700 significant gene–phenotype associations. Some clusters contain mostly known relationships, such as genes involved in motility or plant degradation, often with additional hypothetical proteins associated with those phenotypes. Other clusters comprise unexpected associations; for example, a group of terms related to food and spoilage is linked to genes predicted to be involved in bacterial food poisoning. Among the clusters, we observe an enrichment of pathogenicity-related associations, suggesting that the approach reveals many novel genes likely to play a role in infectious diseases. PMID:15799710
Multivariate time series clustering on geophysical data recorded at Mt. Etna from 1996 to 2003
NASA Astrophysics Data System (ADS)
Di Salvo, Roberto; Montalto, Placido; Nunnari, Giuseppe; Neri, Marco; Puglisi, Giuseppe
2013-02-01
Time series clustering is an important task in data analysis issues in order to extract implicit, previously unknown, and potentially useful information from a large collection of data. Finding useful similar trends in multivariate time series represents a challenge in several areas including geophysics environment research. While traditional time series analysis methods deal only with univariate time series, multivariate time series analysis is a more suitable approach in the field of research where different kinds of data are available. Moreover, the conventional time series clustering techniques do not provide desired results for geophysical datasets due to the huge amount of data whose sampling rate is different according to the nature of signal. In this paper, a novel approach concerning geophysical multivariate time series clustering is proposed using dynamic time series segmentation and Self Organizing Maps techniques. This method allows finding coupling among trends of different geophysical data recorded from monitoring networks at Mt. Etna spanning from 1996 to 2003, when the transition from summit eruptions to flank eruptions occurred. This information can be used to carry out a more careful evaluation of the state of volcano and to define potential hazard assessment at Mt. Etna.
The Cluster Sensitivity Index: A Basic Measure of Classification Robustness
ERIC Educational Resources Information Center
Hom, Willard C.
2010-01-01
Analysts of institutional performance have occasionally used a peer grouping approach in which they compared institutions only to other institutions with similar characteristics. Because analysts historically have used cluster analysis to define peer groups (i.e., the group of comparable institutions), the author proposes and demonstrates with…
USDA-ARS?s Scientific Manuscript database
Risk factors for obesity and weight gain are typically evaluated individually while "adjusting for" the influence of other confounding factors, and few studies, if any, have created risk profiles by clustering risk factors. We identified subgroups of postmenopausal women homogeneous in their cluster...
ERIC Educational Resources Information Center
Daniels, Lia M.; Haynes, Tara L.; Stupnisky, Robert H.; Perry, Raymond P.; Newall, Nancy E.; Pekrun, Reinhard
2008-01-01
Within achievement goal theory debate remains regarding the adaptiveness of certain combinations of goals. Assuming a multiple-goals perspective, we used cluster analysis to classify 1002 undergraduate students according to their mastery and performance-approach goals. Four clusters emerged, representing different goal combinations: high…
Energy Levels and Co-evolution of Product Innovation in Supply Chain Clusters
NASA Astrophysics Data System (ADS)
Ji, Guojun
In the last decade supply chain clusters phenomenon has emerged as a new approach in product innovation studies. This article makes three contributions to the approach by addressing some open issues. The first contribution is to explicitly incorporate the energy levels in the analysis. Hence, the unit of analysis is widened from sectoral systems of innovation to socio-technical systems. Hence, the unit of analysis is widened from sectoral systems of innovation to socio-technical systems. The second contribution is to suggest an analytical distinction between different evolution method, actors involved in them, and the institutions which guide actor's perceptions and activities. Thirdly, the article opens up the black box of institutions, making them an integral part of supply chain. The article provides a coherent conceptual multi-level perspective, using insights from sociology, institutional theory and innovation studies. The perspective is particularly useful to analyze long-term dynamics supply chain clusters phenomenon, shifts from one energy level to another and the co-evolution of product innovation.
Pandini, Alessandro; Fraccalvieri, Domenico; Bonati, Laura
2013-01-01
The biological function of proteins is strictly related to their molecular flexibility and dynamics: enzymatic activity, protein-protein interactions, ligand binding and allosteric regulation are important mechanisms involving protein motions. Computational approaches, such as Molecular Dynamics (MD) simulations, are now routinely used to study the intrinsic dynamics of target proteins as well as to complement molecular docking approaches. These methods have also successfully supported the process of rational design and discovery of new drugs. Identification of functionally relevant conformations is a key step in these studies. This is generally done by cluster analysis of the ensemble of structures in the MD trajectory. Recently Artificial Neural Network (ANN) approaches, in particular methods based on Self-Organising Maps (SOMs), have been reported performing more accurately and providing more consistent results than traditional clustering algorithms in various data-mining problems. In the specific case of conformational analysis, SOMs have been successfully used to compare multiple ensembles of protein conformations demonstrating a potential in efficiently detecting the dynamic signatures central to biological function. Moreover, examples of the use of SOMs to address problems relevant to other stages of the drug-design process, including clustering of docking poses, have been reported. In this contribution we review recent applications of ANN algorithms in analysing conformational and structural ensembles and we discuss their potential in computer-based approaches for medicinal chemistry.
Dupuy, Céline; Morignat, Eric; Maugey, Xavier; Vinard, Jean-Luc; Hendrikx, Pascal; Ducrot, Christian; Calavas, Didier; Gay, Emilie
2013-04-30
The slaughterhouse is a central processing point for food animals and thus a source of both demographic data (age, breed, sex) and health-related data (reason for condemnation and condemned portions) that are not available through other sources. Using these data for syndromic surveillance is therefore tempting. However many possible reasons for condemnation and condemned portions exist, making the definition of relevant syndromes challenging.The objective of this study was to determine a typology of cattle with at least one portion of the carcass condemned in order to define syndromes. Multiple factor analysis (MFA) in combination with clustering methods was performed using both health-related data and demographic data. Analyses were performed on 381,186 cattle with at least one portion of the carcass condemned among the 1,937,917 cattle slaughtered in ten French abattoirs. Results of the MFA and clustering methods led to 12 clusters considered as stable according to year of slaughter and slaughterhouse. One cluster was specific to a disease of public health importance (cysticercosis). Two clusters were linked to the slaughtering process (fecal contamination of heart or lungs and deterioration lesions). Two clusters respectively characterized by chronic liver lesions and chronic peritonitis could be linked to diseases of economic importance to farmers. Three clusters could be linked respectively to reticulo-pericarditis, fatty liver syndrome and farmer's lung syndrome, which are related to both diseases of economic importance to farmers and herd management issues. Three clusters respectively characterized by arthritis, myopathy and Dark Firm Dry (DFD) meat could notably be linked to animal welfare issues. Finally, one cluster, characterized by bronchopneumonia, could be linked to both animal health and herd management issues. The statistical approach of combining multiple factor analysis with cluster analysis showed its relevance for the detection of syndromes using available large and complex slaughterhouse data. The advantages of this statistical approach are to i) define groups of reasons for condemnation based on meat inspection data, ii) help grouping reasons for condemnation among a list of various possible reasons for condemnation for which a consensus among experts could be difficult to reach, iii) assign each animal to a single syndrome which allows the detection of changes in trends of syndromes to detect unusual patterns in known diseases and emergence of new diseases.
Substructures in DAFT/FADA survey clusters based on XMM and optical data
NASA Astrophysics Data System (ADS)
Durret, F.; DAFT/FADA Team
2014-07-01
The DAFT/FADA survey was initiated to perform weak lensing tomography on a sample of 90 massive clusters in the redshift range [0.4,0.9] with HST imaging available. The complementary deep multiband imaging constitutes a high quality imaging data base for these clusters. In X-rays, we have analysed the XMM-Newton and/or Chandra data available for 32 clusters, and for 23 clusters we fit the X-ray emissivity with a beta-model and subtract it to search for substructures in the X-ray gas. This study was coupled with a dynamical analysis for the 18 clusters with at least 15 spectroscopic galaxy redshifts in the cluster range, based on a Serna & Gerbal (SG) analysis. We detected ten substructures in eight clusters by both methods (X-rays and SG). The percentage of mass included in substructures is found to be roughly constant with redshift, with values of 5-15%. Most of the substructures detected both in X-rays and with the SG method are found to be relatively recent infalls, probably at their first cluster pericenter approach.
NASA Astrophysics Data System (ADS)
Truong, Thanh N.; Stefanovich, Eugene V.
1997-05-01
We present a study of micro-solvation of Cl anion by water clusters of the size up to seven molecules using a perturbative Monte Carlo approach with a hybrid HF/MM potential. In this approach, a perturbation theory was used to avoid performing full SCF calculations at every Monte Carlo step. In this study, the anion is treated quantum mechanically at the HF/6-31G ∗ level of theory while interactions between solvent waters are presented by the TIP3P potential force field. Analysis on the solvent induced dipole moment of the ion indicates that the Cl anion resides most of the time on the surface of the clusters. Accuracy of the perturbative MC approach is also discussed.
NASA Astrophysics Data System (ADS)
Nugroho, P.
2018-02-01
Creative industries existence is inseparable from the underlying social construct which provides sources for creativity and innovation. The working of social capital in a society facilitates information exchange, knowledge transfer and technology acquisition within the industry through social networks. As a result, a socio-spatial divide exists in directing the growth of the creative industries. This paper aims to examine how such a socio-spatial divide contributes to the local creative industry development in Semarang and Kudus batik clusters. Explanatory sequential mixed methods approach covering a quantitative approach followed by a qualitative approach is chosen to understand better the interplay between tangible and intangible variables in the local batik clusters. Surveys on secondary data taken from the government statistics and reports, previous studies, and media exposures are completed in the former approach to identify clustering pattern of the local batik industry and the local embeddedness factors which have shaped the existing business environment. In-depth interviews, content analysis, and field observations are engaged in the latter approach to explore reciprocal relationships between the elements of social capital and the local batik cluster development. The result demonstrates that particular social ties have determined the forms of spatial proximity manifested in forward and backward business linkages. Trust, shared norms, and inherited traditions are the key social capital attributes that lead to such a socio-spatial divide. Therefore, the intermediating roles of the bridging actors are necessary to encouraging cooperation among the participating stakeholders for a better cluster development.
A new approach for evaluating flexible working hours.
Giebel, Ole; Janssen, Daniela; Schomann, Carsten; Nachreiner, Friedhelm
2004-01-01
Recent studies on flexible working hours show at least some of these working time arrangements seem to be associated with impairing effects of health and well-being. According to available evidence, variability of working hours seems to play an important role. The question, however, is how this variability can be assessed and used to explain or predict impairments. Based on earlier methods used to assess shift-work effects, a time series analysis approach was applied to the matter of flexible working hours. Data on the working hours of 4 week's length of 137 respondents derived from a survey on flexible work hours involving 15 companies of different production and service sectors in Germany were converted to time series and analyzed by spectral analysis. A cluster analysis of the resulting power spectra yielded 5 clusters of flexible work hours. Analyzing these clusters for differences in reported impairments showed that workers who showed suppression of circadian and weekly rhythms experienced severest impairments, especially in circadian controlled functions like sleep and digestion. The results thus indicate that analyzing the periodicity of flexible working hours seems to be a promising approach for predicting impairments which should be investigated further in the future.
Visualizing Confidence in Cluster-Based Ensemble Weather Forecast Analyses.
Kumpf, Alexander; Tost, Bianca; Baumgart, Marlene; Riemer, Michael; Westermann, Rudiger; Rautenhaus, Marc
2018-01-01
In meteorology, cluster analysis is frequently used to determine representative trends in ensemble weather predictions in a selected spatio-temporal region, e.g., to reduce a set of ensemble members to simplify and improve their analysis. Identified clusters (i.e., groups of similar members), however, can be very sensitive to small changes of the selected region, so that clustering results can be misleading and bias subsequent analyses. In this article, we - a team of visualization scientists and meteorologists-deliver visual analytics solutions to analyze the sensitivity of clustering results with respect to changes of a selected region. We propose an interactive visual interface that enables simultaneous visualization of a) the variation in composition of identified clusters (i.e., their robustness), b) the variability in cluster membership for individual ensemble members, and c) the uncertainty in the spatial locations of identified trends. We demonstrate that our solution shows meteorologists how representative a clustering result is, and with respect to which changes in the selected region it becomes unstable. Furthermore, our solution helps to identify those ensemble members which stably belong to a given cluster and can thus be considered similar. In a real-world application case we show how our approach is used to analyze the clustering behavior of different regions in a forecast of "Tropical Cyclone Karl", guiding the user towards the cluster robustness information required for subsequent ensemble analysis.
Scoring clustering solutions by their biological relevance.
Gat-Viks, I; Sharan, R; Shamir, R
2003-12-12
A central step in the analysis of gene expression data is the identification of groups of genes that exhibit similar expression patterns. Clustering gene expression data into homogeneous groups was shown to be instrumental in functional annotation, tissue classification, regulatory motif identification, and other applications. Although there is a rich literature on clustering algorithms for gene expression analysis, very few works addressed the systematic comparison and evaluation of clustering results. Typically, different clustering algorithms yield different clustering solutions on the same data, and there is no agreed upon guideline for choosing among them. We developed a novel statistically based method for assessing a clustering solution according to prior biological knowledge. Our method can be used to compare different clustering solutions or to optimize the parameters of a clustering algorithm. The method is based on projecting vectors of biological attributes of the clustered elements onto the real line, such that the ratio of between-groups and within-group variance estimators is maximized. The projected data are then scored using a non-parametric analysis of variance test, and the score's confidence is evaluated. We validate our approach using simulated data and show that our scoring method outperforms several extant methods, including the separation to homogeneity ratio and the silhouette measure. We apply our method to evaluate results of several clustering methods on yeast cell-cycle gene expression data. The software is available from the authors upon request.
Groundwater Quality: Analysis of Its Temporal and Spatial Variability in a Karst Aquifer.
Pacheco Castro, Roger; Pacheco Ávila, Julia; Ye, Ming; Cabrera Sansores, Armando
2018-01-01
This study develops an approach based on hierarchical cluster analysis for investigating the spatial and temporal variation of water quality governing processes. The water quality data used in this study were collected in the karst aquifer of Yucatan, Mexico, the only source of drinking water for a population of nearly two million people. Hierarchical cluster analysis was applied to the quality data of all the sampling periods lumped together. This was motivated by the observation that, if water quality does not vary significantly in time, two samples from the same sampling site will belong to the same cluster. The resulting distribution maps of clusters and box-plots of the major chemical components reveal the spatial and temporal variability of groundwater quality. Principal component analysis was used to verify the results of cluster analysis and to derive the variables that explained most of the variation of the groundwater quality data. Results of this work increase the knowledge about how precipitation and human contamination impact groundwater quality in Yucatan. Spatial variability of groundwater quality in the study area is caused by: a) seawater intrusion and groundwater rich in sulfates at the west and in the coast, b) water rock interactions and the average annual precipitation at the middle and east zones respectively, and c) human contamination present in two localized zones. Changes in the amount and distribution of precipitation cause temporal variation by diluting groundwater in the aquifer. This approach allows to analyze the variation of groundwater quality controlling processes efficiently and simultaneously. © 2017, National Ground Water Association.
Assessment of cluster yield components by image analysis.
Diago, Maria P; Tardaguila, Javier; Aleixos, Nuria; Millan, Borja; Prats-Montalban, Jose M; Cubero, Sergio; Blasco, Jose
2015-04-01
Berry weight, berry number and cluster weight are key parameters for yield estimation for wine and tablegrape industry. Current yield prediction methods are destructive, labour-demanding and time-consuming. In this work, a new methodology, based on image analysis was developed to determine cluster yield components in a fast and inexpensive way. Clusters of seven different red varieties of grapevine (Vitis vinifera L.) were photographed under laboratory conditions and their cluster yield components manually determined after image acquisition. Two algorithms based on the Canny and the logarithmic image processing approaches were tested to find the contours of the berries in the images prior to berry detection performed by means of the Hough Transform. Results were obtained in two ways: by analysing either a single image of the cluster or using four images per cluster from different orientations. The best results (R(2) between 69% and 95% in berry detection and between 65% and 97% in cluster weight estimation) were achieved using four images and the Canny algorithm. The model's capability based on image analysis to predict berry weight was 84%. The new and low-cost methodology presented here enabled the assessment of cluster yield components, saving time and providing inexpensive information in comparison with current manual methods. © 2014 Society of Chemical Industry.
Sputum neutrophil counts are associated with more severe asthma phenotypes using cluster analysis.
Moore, Wendy C; Hastie, Annette T; Li, Xingnan; Li, Huashi; Busse, William W; Jarjour, Nizar N; Wenzel, Sally E; Peters, Stephen P; Meyers, Deborah A; Bleecker, Eugene R
2014-06-01
Clinical cluster analysis from the Severe Asthma Research Program (SARP) identified 5 asthma subphenotypes that represent the severity spectrum of early-onset allergic asthma, late-onset severe asthma, and severe asthma with chronic obstructive pulmonary disease characteristics. Analysis of induced sputum from a subset of SARP subjects showed 4 sputum inflammatory cellular patterns. Subjects with concurrent increases in eosinophil (≥2%) and neutrophil (≥40%) percentages had characteristics of very severe asthma. To better understand interactions between inflammation and clinical subphenotypes, we integrated inflammatory cellular measures and clinical variables in a new cluster analysis. Participants in SARP who underwent sputum induction at 3 clinical sites were included in this analysis (n = 423). Fifteen variables, including clinical characteristics and blood and sputum inflammatory cell assessments, were selected using factor analysis for unsupervised cluster analysis. Four phenotypic clusters were identified. Cluster A (n = 132) and B (n = 127) subjects had mild-to-moderate early-onset allergic asthma with paucigranulocytic or eosinophilic sputum inflammatory cell patterns. In contrast, these inflammatory patterns were present in only 7% of cluster C (n = 117) and D (n = 47) subjects who had moderate-to-severe asthma with frequent health care use despite treatment with high doses of inhaled or oral corticosteroids and, in cluster D, reduced lung function. The majority of these subjects (>83%) had sputum neutrophilia either alone or with concurrent sputum eosinophilia. Baseline lung function and sputum neutrophil percentages were the most important variables determining cluster assignment. This multivariate approach identified 4 asthma subphenotypes representing the severity spectrum from mild-to-moderate allergic asthma with minimal or eosinophil-predominant sputum inflammation to moderate-to-severe asthma with neutrophil-predominant or mixed granulocytic inflammation. Published by Mosby, Inc.
Sputum neutrophils are associated with more severe asthma phenotypes using cluster analysis
Moore, Wendy C.; Hastie, Annette T.; Li, Xingnan; Li, Huashi; Busse, William W.; Jarjour, Nizar N.; Wenzel, Sally E.; Peters, Stephen P.; Meyers, Deborah A.; Bleecker, Eugene R.
2013-01-01
Background Clinical cluster analysis from the Severe Asthma Research Program (SARP) identified five asthma subphenotypes that represent the severity spectrum of early onset allergic asthma, late onset severe asthma and severe asthma with COPD characteristics. Analysis of induced sputum from a subset of SARP subjects showed four sputum inflammatory cellular patterns. Subjects with concurrent increases in eosinophils (≥2%) and neutrophils (≥40%) had characteristics of very severe asthma. Objective To better understand interactions between inflammation and clinical subphenotypes we integrated inflammatory cellular measures and clinical variables in a new cluster analysis. Methods Participants in SARP at three clinical sites who underwent sputum induction were included in this analysis (n=423). Fifteen variables including clinical characteristics and blood and sputum inflammatory cell assessments were selected by factor analysis for unsupervised cluster analysis. Results Four phenotypic clusters were identified. Cluster A (n=132) and B (n=127) subjects had mild-moderate early onset allergic asthma with paucigranulocytic or eosinophilic sputum inflammatory cell patterns. In contrast, these inflammatory patterns were present in only 7% of Cluster C (n=117) and D (n=47) subjects who had moderate-severe asthma with frequent health care utilization despite treatment with high doses of inhaled or oral corticosteroids, and in Cluster D, reduced lung function. The majority these subjects (>83%) had sputum neutrophilia either alone or with concurrent sputum eosinophilia. Baseline lung function and sputum neutrophils were the most important variables determining cluster assignment. Conclusion This multivariate approach identified four asthma subphenotypes representing the severity spectrum from mild-moderate allergic asthma with minimal or eosinophilic predominant sputum inflammation to moderate-severe asthma with neutrophilic predominant or mixed granulocytic inflammation. PMID:24332216
Daniels, Lia M.; Radil, Amanda I.; Goegan, Lauren D.
2017-01-01
Pre-service and practicing teachers feel responsible for a range of educational activities. Four domains of personal responsibility emerging in the literature are: student achievement, student motivation, relationships with students, and responsibility for ones own teaching. To date, most research has used variable-centered approaches to examining responsibilities even though the domains appear related. In two separate samples we used cluster analysis to explore how pre-service (n = 130) and practicing (n = 105) teachers combined personal responsibilities and their impact on three professional cognitions and their wellbeing. Both groups had low and high responsibility clusters but the third cluster differed: Pre-service teachers combined responsibilities for relationships and their own teaching in a cluster we refer to as teacher-based responsibility; whereas, practicing teachers combined achievement and motivation in a cluster we refer to as student-outcome focused responsibility. These combinations affected outcomes for pre-service but not practicing teachers. Pre-service teachers in the low responsibility cluster reported less engagement, less mastery approaches to instruction, and more performance goal structures than the other two clusters. PMID:28620332
Daniels, Lia M; Radil, Amanda I; Goegan, Lauren D
2017-01-01
Pre-service and practicing teachers feel responsible for a range of educational activities. Four domains of personal responsibility emerging in the literature are: student achievement, student motivation, relationships with students, and responsibility for ones own teaching. To date, most research has used variable-centered approaches to examining responsibilities even though the domains appear related. In two separate samples we used cluster analysis to explore how pre-service ( n = 130) and practicing ( n = 105) teachers combined personal responsibilities and their impact on three professional cognitions and their wellbeing. Both groups had low and high responsibility clusters but the third cluster differed: Pre-service teachers combined responsibilities for relationships and their own teaching in a cluster we refer to as teacher-based responsibility; whereas, practicing teachers combined achievement and motivation in a cluster we refer to as student-outcome focused responsibility. These combinations affected outcomes for pre-service but not practicing teachers. Pre-service teachers in the low responsibility cluster reported less engagement, less mastery approaches to instruction, and more performance goal structures than the other two clusters.
Clustering analysis of moving target signatures
NASA Astrophysics Data System (ADS)
Martone, Anthony; Ranney, Kenneth; Innocenti, Roberto
2010-04-01
Previously, we developed a moving target indication (MTI) processing approach to detect and track slow-moving targets inside buildings, which successfully detected moving targets (MTs) from data collected by a low-frequency, ultra-wideband radar. Our MTI algorithms include change detection, automatic target detection (ATD), clustering, and tracking. The MTI algorithms can be implemented in a real-time or near-real-time system; however, a person-in-the-loop is needed to select input parameters for the clustering algorithm. Specifically, the number of clusters to input into the cluster algorithm is unknown and requires manual selection. A critical need exists to automate all aspects of the MTI processing formulation. In this paper, we investigate two techniques that automatically determine the number of clusters: the adaptive knee-point (KP) algorithm and the recursive pixel finding (RPF) algorithm. The KP algorithm is based on a well-known heuristic approach for determining the number of clusters. The RPF algorithm is analogous to the image processing, pixel labeling procedure. Both algorithms are used to analyze the false alarm and detection rates of three operational scenarios of personnel walking inside wood and cinderblock buildings.
Charge-controlled switchable CO adsorption on FeN4 cluster embedded in graphene
NASA Astrophysics Data System (ADS)
Omidvar, Akbar
2018-02-01
Electrical charging of an FeN4 cluster embedded in graphene (FeN4G) is proposed as an approach for electrocatalytically switchable carbon monoxide (CO) adsorption. Using density functional theory (DFT), we found that the CO molecule is strongly adsorbed on the uncharged FeN4G cluster. Our results show that the adsorption energy of a CO molecule on the FeN4G cluster is dramatically decreased by introducing extra electrons into the cluster. Once the charges are removed, the CO molecule is spontaneously adsorbed on the FeN4G absorbent. In the framework of frontier molecular orbital (FMO) analysis, the enhanced sensitivity and reactivity of the FeN4G cluster towards the CO molecule can be interpreted in terms of interaction between the HOMO of CO molecule and the LUMO of FeN4G cluster. Therefore, this approach promises both facile reversibility and tunable kinetics without the need of specific catalysts. Our study indicates that the FeN4G nanomaterial is an excellent absorbent for controllable and reversible capture and release of the CO.
Learning in First-Year Biology: Approaches of Distance and On-Campus Students
NASA Astrophysics Data System (ADS)
Quinn, Frances Catherine
2011-01-01
This paper aims to extend previous research into learning of tertiary biology, by exploring the learning approaches adopted by two groups of students studying the same first-year biology topic in either on-campus or off-campus "distance" modes. The research involved 302 participants, who responded to a topic-specific version of the Study Process Questionnaire, and in-depth interviews with 16 of these students. Several quantitative analytic techniques, including cluster analysis and Rasch differential item functioning analysis, showed that the younger, on-campus cohort made less use of deep approaches, and more use of surface approaches than the older, off-campus group. At a finer scale, clusters of students within these categories demonstrated different patterns of learning approach. Students' descriptions of their learning approaches at interview provided richer complementary descriptions of the approach they took to their study in the topic, showing how deep and surface approaches were manifested in the study context. These findings are critically analysed in terms of recent literature questioning the applicability of learning approaches theory in mass education, and their implications for teaching and research in undergraduate biology.
Fong, Allan; Clark, Lindsey; Cheng, Tianyi; Franklin, Ella; Fernandez, Nicole; Ratwani, Raj; Parker, Sarah Henrickson
2017-07-01
The objective of this paper is to identify attribute patterns of influential individuals in intensive care units using unsupervised cluster analysis. Despite the acknowledgement that culture of an organisation is critical to improving patient safety, specific methods to shift culture have not been explicitly identified. A social network analysis survey was conducted and an unsupervised cluster analysis was used. A total of 100 surveys were gathered. Unsupervised cluster analysis was used to group individuals with similar dimensions highlighting three general genres of influencers: well-rounded, knowledge and relational. Culture is created locally by individual influencers. Cluster analysis is an effective way to identify common characteristics among members of an intensive care unit team that are noted as highly influential by their peers. To change culture, identifying and then integrating the influencers in intervention development and dissemination may create more sustainable and effective culture change. Additional studies are ongoing to test the effectiveness of utilising these influencers to disseminate patient safety interventions. This study offers an approach that can be helpful in both identifying and understanding influential team members and may be an important aspect of developing methods to change organisational culture. © 2017 John Wiley & Sons Ltd.
NASA Technical Reports Server (NTRS)
Sutherland, Betsy M.; Georgakilas, Alexandros G.; Bennett, Paula V.; Laval, Jacques; Sutherland, John C.; Gewirtz, A. M. (Principal Investigator)
2003-01-01
Assessing DNA damage induction, repair and consequences of such damages requires measurement of specific DNA lesions by methods that are independent of biological responses to such lesions. Lesions affecting one DNA strand (altered bases, abasic sites, single strand breaks (SSB)) as well as damages affecting both strands (clustered damages, double strand breaks) can be quantified by direct measurement of DNA using gel electrophoresis, gel imaging and number average length analysis. Damage frequencies as low as a few sites per gigabase pair (10(9)bp) can be quantified by this approach in about 50ng of non-radioactive DNA, and single molecule methods may allow such measurements in DNA from single cells. This review presents the theoretical basis, biochemical requirements and practical aspects of this approach, and shows examples of their applications in identification and quantitation of complex clustered damages.
Iatropoulos, Paraskevas; Daina, Erica; Curreri, Manuela; Piras, Rossella; Valoti, Elisabetta; Mele, Caterina; Bresin, Elena; Gamba, Sara; Alberti, Marta; Breno, Matteo; Perna, Annalisa; Bettoni, Serena; Sabadini, Ettore; Murer, Luisa; Vivarelli, Marina; Noris, Marina; Remuzzi, Giuseppe
2018-01-01
Membranoproliferative GN (MPGN) was recently reclassified as alternative pathway complement-mediated C3 glomerulopathy (C3G) and immune complex-mediated membranoproliferative GN (IC-MPGN). However, genetic and acquired alternative pathway abnormalities are also observed in IC-MPGN. Here, we explored the presence of distinct disease entities characterized by specific pathophysiologic mechanisms. We performed unsupervised hierarchical clustering, a data-driven statistical approach, on histologic, genetic, and clinical data and data regarding serum/plasma complement parameters from 173 patients with C3G/IC-MPGN. This approach divided patients into four clusters, indicating the existence of four different pathogenetic patterns. Specifically, this analysis separated patients with fluid-phase complement activation (clusters 1-3) who had low serum C3 levels and a high prevalence of genetic and acquired alternative pathway abnormalities from patients with solid-phase complement activation (cluster 4) who had normal or mildly altered serum C3, late disease onset, and poor renal survival. In patients with fluid-phase complement activation, those in clusters 1 and 2 had massive activation of the alternative pathway, including activation of the terminal pathway, and the highest prevalence of subendothelial deposits, but those in cluster 2 had additional activation of the classic pathway and the highest prevalence of nephrotic syndrome at disease onset. Patients in cluster 3 had prevalent activation of C3 convertase and highly electron-dense intramembranous deposits. In addition, we provide a simple algorithm to assign patients with C3G/IC-MPGN to specific clusters. These distinct clusters may facilitate clarification of disease etiology, improve risk assessment for ESRD, and pave the way for personalized treatment. Copyright © 2018 by the American Society of Nephrology.
Modularization of biochemical networks based on classification of Petri net t-invariants.
Grafahrend-Belau, Eva; Schreiber, Falk; Heiner, Monika; Sackmann, Andrea; Junker, Björn H; Grunwald, Stefanie; Speer, Astrid; Winder, Katja; Koch, Ina
2008-02-08
Structural analysis of biochemical networks is a growing field in bioinformatics and systems biology. The availability of an increasing amount of biological data from molecular biological networks promises a deeper understanding but confronts researchers with the problem of combinatorial explosion. The amount of qualitative network data is growing much faster than the amount of quantitative data, such as enzyme kinetics. In many cases it is even impossible to measure quantitative data because of limitations of experimental methods, or for ethical reasons. Thus, a huge amount of qualitative data, such as interaction data, is available, but it was not sufficiently used for modeling purposes, until now. New approaches have been developed, but the complexity of data often limits the application of many of the methods. Biochemical Petri nets make it possible to explore static and dynamic qualitative system properties. One Petri net approach is model validation based on the computation of the system's invariant properties, focusing on t-invariants. T-invariants correspond to subnetworks, which describe the basic system behavior.With increasing system complexity, the basic behavior can only be expressed by a huge number of t-invariants. According to our validation criteria for biochemical Petri nets, the necessary verification of the biological meaning, by interpreting each subnetwork (t-invariant) manually, is not possible anymore. Thus, an automated, biologically meaningful classification would be helpful in analyzing t-invariants, and supporting the understanding of the basic behavior of the considered biological system. Here, we introduce a new approach to automatically classify t-invariants to cope with network complexity. We apply clustering techniques such as UPGMA, Complete Linkage, Single Linkage, and Neighbor Joining in combination with different distance measures to get biologically meaningful clusters (t-clusters), which can be interpreted as modules. To find the optimal number of t-clusters to consider for interpretation, the cluster validity measure, Silhouette Width, is applied. We considered two different case studies as examples: a small signal transduction pathway (pheromone response pathway in Saccharomyces cerevisiae) and a medium-sized gene regulatory network (gene regulation of Duchenne muscular dystrophy). We automatically classified the t-invariants into functionally distinct t-clusters, which could be interpreted biologically as functional modules in the network. We found differences in the suitability of the various distance measures as well as the clustering methods. In terms of a biologically meaningful classification of t-invariants, the best results are obtained using the Tanimoto distance measure. Considering clustering methods, the obtained results suggest that UPGMA and Complete Linkage are suitable for clustering t-invariants with respect to the biological interpretability. We propose a new approach for the biological classification of Petri net t-invariants based on cluster analysis. Due to the biologically meaningful data reduction and structuring of network processes, large sets of t-invariants can be evaluated, allowing for model validation of qualitative biochemical Petri nets. This approach can also be applied to elementary mode analysis.
Modularization of biochemical networks based on classification of Petri net t-invariants
Grafahrend-Belau, Eva; Schreiber, Falk; Heiner, Monika; Sackmann, Andrea; Junker, Björn H; Grunwald, Stefanie; Speer, Astrid; Winder, Katja; Koch, Ina
2008-01-01
Background Structural analysis of biochemical networks is a growing field in bioinformatics and systems biology. The availability of an increasing amount of biological data from molecular biological networks promises a deeper understanding but confronts researchers with the problem of combinatorial explosion. The amount of qualitative network data is growing much faster than the amount of quantitative data, such as enzyme kinetics. In many cases it is even impossible to measure quantitative data because of limitations of experimental methods, or for ethical reasons. Thus, a huge amount of qualitative data, such as interaction data, is available, but it was not sufficiently used for modeling purposes, until now. New approaches have been developed, but the complexity of data often limits the application of many of the methods. Biochemical Petri nets make it possible to explore static and dynamic qualitative system properties. One Petri net approach is model validation based on the computation of the system's invariant properties, focusing on t-invariants. T-invariants correspond to subnetworks, which describe the basic system behavior. With increasing system complexity, the basic behavior can only be expressed by a huge number of t-invariants. According to our validation criteria for biochemical Petri nets, the necessary verification of the biological meaning, by interpreting each subnetwork (t-invariant) manually, is not possible anymore. Thus, an automated, biologically meaningful classification would be helpful in analyzing t-invariants, and supporting the understanding of the basic behavior of the considered biological system. Methods Here, we introduce a new approach to automatically classify t-invariants to cope with network complexity. We apply clustering techniques such as UPGMA, Complete Linkage, Single Linkage, and Neighbor Joining in combination with different distance measures to get biologically meaningful clusters (t-clusters), which can be interpreted as modules. To find the optimal number of t-clusters to consider for interpretation, the cluster validity measure, Silhouette Width, is applied. Results We considered two different case studies as examples: a small signal transduction pathway (pheromone response pathway in Saccharomyces cerevisiae) and a medium-sized gene regulatory network (gene regulation of Duchenne muscular dystrophy). We automatically classified the t-invariants into functionally distinct t-clusters, which could be interpreted biologically as functional modules in the network. We found differences in the suitability of the various distance measures as well as the clustering methods. In terms of a biologically meaningful classification of t-invariants, the best results are obtained using the Tanimoto distance measure. Considering clustering methods, the obtained results suggest that UPGMA and Complete Linkage are suitable for clustering t-invariants with respect to the biological interpretability. Conclusion We propose a new approach for the biological classification of Petri net t-invariants based on cluster analysis. Due to the biologically meaningful data reduction and structuring of network processes, large sets of t-invariants can be evaluated, allowing for model validation of qualitative biochemical Petri nets. This approach can also be applied to elementary mode analysis. PMID:18257938
Zhang, Jiang; Liu, Qi; Chen, Huafu; Yuan, Zhen; Huang, Jin; Deng, Lihua; Lu, Fengmei; Zhang, Junpeng; Wang, Yuqing; Wang, Mingwen; Chen, Liangyin
2015-01-01
Clustering analysis methods have been widely applied to identifying the functional brain networks of a multitask paradigm. However, the previously used clustering analysis techniques are computationally expensive and thus impractical for clinical applications. In this study a novel method, called SOM-SAPC that combines self-organizing mapping (SOM) and supervised affinity propagation clustering (SAPC), is proposed and implemented to identify the motor execution (ME) and motor imagery (MI) networks. In SOM-SAPC, SOM was first performed to process fMRI data and SAPC is further utilized for clustering the patterns of functional networks. As a result, SOM-SAPC is able to significantly reduce the computational cost for brain network analysis. Simulation and clinical tests involving ME and MI were conducted based on SOM-SAPC, and the analysis results indicated that functional brain networks were clearly identified with different response patterns and reduced computational cost. In particular, three activation clusters were clearly revealed, which include parts of the visual, ME and MI functional networks. These findings validated that SOM-SAPC is an effective and robust method to analyze the fMRI data with multitasks.
An ensemble framework for clustering protein-protein interaction networks.
Asur, Sitaram; Ucar, Duygu; Parthasarathy, Srinivasan
2007-07-01
Protein-Protein Interaction (PPI) networks are believed to be important sources of information related to biological processes and complex metabolic functions of the cell. The presence of biologically relevant functional modules in these networks has been theorized by many researchers. However, the application of traditional clustering algorithms for extracting these modules has not been successful, largely due to the presence of noisy false positive interactions as well as specific topological challenges in the network. In this article, we propose an ensemble clustering framework to address this problem. For base clustering, we introduce two topology-based distance metrics to counteract the effects of noise. We develop a PCA-based consensus clustering technique, designed to reduce the dimensionality of the consensus problem and yield informative clusters. We also develop a soft consensus clustering variant to assign multifaceted proteins to multiple functional groups. We conduct an empirical evaluation of different consensus techniques using topology-based, information theoretic and domain-specific validation metrics and show that our approaches can provide significant benefits over other state-of-the-art approaches. Our analysis of the consensus clusters obtained demonstrates that ensemble clustering can (a) produce improved biologically significant functional groupings; and (b) facilitate soft clustering by discovering multiple functional associations for proteins. Supplementary data are available at Bioinformatics online.
Influence of birth cohort on age of onset cluster analysis in bipolar I disorder.
Bauer, M; Glenn, T; Alda, M; Andreassen, O A; Angelopoulos, E; Ardau, R; Baethge, C; Bauer, R; Bellivier, F; Belmaker, R H; Berk, M; Bjella, T D; Bossini, L; Bersudsky, Y; Cheung, E Y W; Conell, J; Del Zompo, M; Dodd, S; Etain, B; Fagiolini, A; Frye, M A; Fountoulakis, K N; Garneau-Fournier, J; Gonzalez-Pinto, A; Harima, H; Hassel, S; Henry, C; Iacovides, A; Isometsä, E T; Kapczinski, F; Kliwicki, S; König, B; Krogh, R; Kunz, M; Lafer, B; Larsen, E R; Lewitzka, U; Lopez-Jaramillo, C; MacQueen, G; Manchia, M; Marsh, W; Martinez-Cengotitabengoa, M; Melle, I; Monteith, S; Morken, G; Munoz, R; Nery, F G; O'Donovan, C; Osher, Y; Pfennig, A; Quiroz, D; Ramesar, R; Rasgon, N; Reif, A; Ritter, P; Rybakowski, J K; Sagduyu, K; Scippa, A M; Severus, E; Simhandl, C; Stein, D J; Strejilevich, S; Hatim Sulaiman, A; Suominen, K; Tagata, H; Tatebayashi, Y; Torrent, C; Vieta, E; Viswanath, B; Wanchoo, M J; Zetin, M; Whybrow, P C
2015-01-01
Two common approaches to identify subgroups of patients with bipolar disorder are clustering methodology (mixture analysis) based on the age of onset, and a birth cohort analysis. This study investigates if a birth cohort effect will influence the results of clustering on the age of onset, using a large, international database. The database includes 4037 patients with a diagnosis of bipolar I disorder, previously collected at 36 collection sites in 23 countries. Generalized estimating equations (GEE) were used to adjust the data for country median age, and in some models, birth cohort. Model-based clustering (mixture analysis) was then performed on the age of onset data using the residuals. Clinical variables in subgroups were compared. There was a strong birth cohort effect. Without adjusting for the birth cohort, three subgroups were found by clustering. After adjusting for the birth cohort or when considering only those born after 1959, two subgroups were found. With results of either two or three subgroups, the youngest subgroup was more likely to have a family history of mood disorders and a first episode with depressed polarity. However, without adjusting for birth cohort (three subgroups), family history and polarity of the first episode could not be distinguished between the middle and oldest subgroups. These results using international data confirm prior findings using single country data, that there are subgroups of bipolar I disorder based on the age of onset, and that there is a birth cohort effect. Including the birth cohort adjustment altered the number and characteristics of subgroups detected when clustering by age of onset. Further investigation is needed to determine if combining both approaches will identify subgroups that are more useful for research. Copyright © 2014 Elsevier Masson SAS. All rights reserved.
A comparison of regional flood frequency analysis approaches in a simulation framework
NASA Astrophysics Data System (ADS)
Ganora, D.; Laio, F.
2016-07-01
Regional frequency analysis (RFA) is a well-established methodology to provide an estimate of the flood frequency curve at ungauged (or scarcely gauged) sites. Different RFA approaches exist, depending on the way the information is transferred to the site of interest, but it is not clear in the literature if a specific method systematically outperforms the others. The aim of this study is to provide a framework wherein carrying out the intercomparison by building up a virtual environment based on synthetically generated data. The considered regional approaches include: (i) a unique regional curve for the whole region; (ii) a multiple-region model where homogeneous subregions are determined through cluster analysis; (iii) a Region-of-Influence model which defines a homogeneous subregion for each site; (iv) a spatially smooth estimation procedure where the parameters of the regional model vary continuously along the space. Virtual environments are generated considering different patterns of heterogeneity, including step change and smooth variations. If the region is heterogeneous, with the parent distribution changing continuously within the region, the spatially smooth regional approach outperforms the others, with overall errors 10-50% lower than the other methods. In the case of a step-change, the spatially smooth and clustering procedures perform similarly if the heterogeneity is moderate, while clustering procedures work better when the step-change is severe. To extend our findings, an extensive sensitivity analysis has been performed to investigate the effect of sample length, number of virtual stations, return period of the predicted quantile, variability of the scale parameter of the parent distribution, number of predictor variables and different parent distribution. Overall, the spatially smooth approach appears as the most robust approach as its performances are more stable across different patterns of heterogeneity, especially when short records are considered.
2016-09-30
far from uniform . The final nonuniform distribution of fibers consists of clustered regions and resin pockets. The clustered fiber regions promote...period. Approach and Results A novel procedure has been devised to create nonuniform fiber distributions from the initial fiber bundle (with...used in simulations to produce nonuniform configurations. 2 . , •• ... . .. ·~ · . .. 000 8oa~.foo o0~&mt~ go ... ·~· %(1 "’ ,~o ooif-l /j
Understanding Teacher Users of a Digital Library Service: A Clustering Approach
ERIC Educational Resources Information Center
Xu, Beijie
2011-01-01
This research examined teachers' online behaviors while using a digital library service--the Instructional Architect (IA)--through three consecutive studies. In the first two studies, a statistical model called latent class analysis (LCA) was applied to cluster different groups of IA teachers according to their diverse online behaviors. The third…
Clusters and Correlates of Experiences with Parents and Peers in Early Adolescence
ERIC Educational Resources Information Center
Kan, Marni L.; McHale, Susan M.
2007-01-01
This study used a person-oriented approach to examine links between adolescents' experiences with parents and peers. Cluster analysis classified 361, White, working- and middle-class youth (mean age = 12.16 years) based on mothers' and fathers' reports of parental acceptance and adolescents' reports of perceived peer competence. Three patterns…
Behavioral Profiles in 4-5 Year-Old Children: Normal and Pathological Variants
ERIC Educational Resources Information Center
Larsson, Jan-Olov; Bergman, Lars R.; Earls, Felton; Rydelius, Per-Anders
2004-01-01
Normal and psychopathological patterns of behavior symptoms in preschool children were described by a classification approach using cluster analysis. The behavior of 406 children, average age 4 years 9 months, from the general population was evaluated at home visits. Seven clusters were identified based on empirically defined dimensions:…
Conformational and functional analysis of molecular dynamics trajectories by Self-Organising Maps
2011-01-01
Background Molecular dynamics (MD) simulations are powerful tools to investigate the conformational dynamics of proteins that is often a critical element of their function. Identification of functionally relevant conformations is generally done clustering the large ensemble of structures that are generated. Recently, Self-Organising Maps (SOMs) were reported performing more accurately and providing more consistent results than traditional clustering algorithms in various data mining problems. We present a novel strategy to analyse and compare conformational ensembles of protein domains using a two-level approach that combines SOMs and hierarchical clustering. Results The conformational dynamics of the α-spectrin SH3 protein domain and six single mutants were analysed by MD simulations. The Cα's Cartesian coordinates of conformations sampled in the essential space were used as input data vectors for SOM training, then complete linkage clustering was performed on the SOM prototype vectors. A specific protocol to optimize a SOM for structural ensembles was proposed: the optimal SOM was selected by means of a Taguchi experimental design plan applied to different data sets, and the optimal sampling rate of the MD trajectory was selected. The proposed two-level approach was applied to single trajectories of the SH3 domain independently as well as to groups of them at the same time. The results demonstrated the potential of this approach in the analysis of large ensembles of molecular structures: the possibility of producing a topological mapping of the conformational space in a simple 2D visualisation, as well as of effectively highlighting differences in the conformational dynamics directly related to biological functions. Conclusions The use of a two-level approach combining SOMs and hierarchical clustering for conformational analysis of structural ensembles of proteins was proposed. It can easily be extended to other study cases and to conformational ensembles from other sources. PMID:21569575
Grošelj, Petra; Zadnik Stirn, Lidija
2015-09-15
Environmental management problems can be dealt with by combining participatory methods, which make it possible to include various stakeholders in a decision-making process, and multi-criteria methods, which offer a formal model for structuring and solving a problem. This paper proposes a three-phase decision making approach based on the analytic network process and SWOT (strengths, weaknesses, opportunities and threats) analysis. The approach enables inclusion of various stakeholders or groups of stakeholders in particular stages of decision making. The structure of the proposed approach is composed of a network consisting of an objective cluster, a cluster of strategic goals, a cluster of SWOT factors and a cluster of alternatives. The application of the suggested approach is applied to a management problem of Pohorje, a mountainous area in Slovenia. Stakeholders from sectors that are important for Pohorje (forestry, agriculture, tourism and nature protection agencies) who can offer a wide range of expert knowledge were included in the decision-making process. The results identify the alternative of "sustainable development" as the most appropriate for development of Pohorje. The application in the paper offers an example of employing the new approach to an environmental management problem. This can also be applied to decision-making problems in various other fields. Copyright © 2015 Elsevier Ltd. All rights reserved.
Lin, Shih-Yen; Liu, Chih-Wei
2014-01-01
This study combines cluster analysis and LRFM (length, recency, frequency, and monetary) model in a pediatric dental clinic in Taiwan to analyze patients' values. A two-stage approach by self-organizing maps and K-means method is applied to segment 1,462 patients into twelve clusters. The average values of L, R, and F excluding monetary covered by national health insurance program are computed for each cluster. In addition, customer value matrix is used to analyze customer values of twelve clusters in terms of frequency and monetary. Customer relationship matrix considering length and recency is also applied to classify different types of customers from these twelve clusters. The results show that three clusters can be classified into loyal patients with L, R, and F values greater than the respective average L, R, and F values, while three clusters can be viewed as lost patients without any variable above the average values of L, R, and F. When different types of patients are identified, marketing strategies can be designed to meet different patients' needs. PMID:25045741
Wu, Hsin-Hung; Lin, Shih-Yen; Liu, Chih-Wei
2014-01-01
This study combines cluster analysis and LRFM (length, recency, frequency, and monetary) model in a pediatric dental clinic in Taiwan to analyze patients' values. A two-stage approach by self-organizing maps and K-means method is applied to segment 1,462 patients into twelve clusters. The average values of L, R, and F excluding monetary covered by national health insurance program are computed for each cluster. In addition, customer value matrix is used to analyze customer values of twelve clusters in terms of frequency and monetary. Customer relationship matrix considering length and recency is also applied to classify different types of customers from these twelve clusters. The results show that three clusters can be classified into loyal patients with L, R, and F values greater than the respective average L, R, and F values, while three clusters can be viewed as lost patients without any variable above the average values of L, R, and F. When different types of patients are identified, marketing strategies can be designed to meet different patients' needs.
Entwistle, Noel; McCune, Velda
2013-06-01
A re-analysis of several university-level interview studies has suggested that some students show evidence of a deep and stable approach to learning, along with other characteristics that support the approach. This combination, it was argued, could be seen to indicate a disposition to understand for oneself. To identify a group of students who showed high and consistent scores on deep approach, combined with equivalently high scores on effort and monitoring studying, and to explore these students' experiences of the teaching-learning environments they had experienced. Re-analysis of data from 1,896 students from 25 undergraduate courses taking four contrasting subject areas in eleven British universities. Inventories measuring approaches to studying were given at the beginning and the end of a semester, with the second inventory also exploring students' experiences of teaching. K-means cluster analysis was used to identify groups of students with differing patterns of response on the inventory scales, with a particular focus on students showing high, stable scores. One cluster clearly showed the characteristics expected of the disposition to understand and was also fairly stable over time. Other clusters also had deep approaches, but also showed either surface elements or lower scores on organized effort or monitoring their studying. Combining these findings with interview studies previously reported reinforces the idea of there being a disposition to understand for oneself that could be identified from an inventory scale or through further interviews. © 2013 The British Psychological Society.
WordCluster: detecting clusters of DNA words and genomic elements
2011-01-01
Background Many k-mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds. Results We introduce here an algorithm to detect clusters of DNA words (k-mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome. Conclusions WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes. PMID:21261981
Analyzing coastal environments by means of functional data analysis
NASA Astrophysics Data System (ADS)
Sierra, Carlos; Flor-Blanco, Germán; Ordoñez, Celestino; Flor, Germán; Gallego, José R.
2017-07-01
Here we used Functional Data Analysis (FDA) to examine particle-size distributions (PSDs) in a beach/shallow marine sedimentary environment in Gijón Bay (NW Spain). The work involved both Functional Principal Components Analysis (FPCA) and Functional Cluster Analysis (FCA). The grainsize of the sand samples was characterized by means of laser dispersion spectroscopy. Within this framework, FPCA was used as a dimension reduction technique to explore and uncover patterns in grain-size frequency curves. This procedure proved useful to describe variability in the structure of the data set. Moreover, an alternative approach, FCA, was applied to identify clusters and to interpret their spatial distribution. Results obtained with this latter technique were compared with those obtained by means of two vector approaches that combine PCA with CA (Cluster Analysis). The first method, the point density function (PDF), was employed after adapting a log-normal distribution to each PSD and resuming each of the density functions by its mean, sorting, skewness and kurtosis. The second applied a centered-log-ratio (clr) to the original data. PCA was then applied to the transformed data, and finally CA to the retained principal component scores. The study revealed functional data analysis, specifically FPCA and FCA, as a suitable alternative with considerable advantages over traditional vector analysis techniques in sedimentary geology studies.
COVARIATE-ADAPTIVE CLUSTERING OF EXPOSURES FOR AIR POLLUTION EPIDEMIOLOGY COHORTS*
Keller, Joshua P.; Drton, Mathias; Larson, Timothy; Kaufman, Joel D.; Sandler, Dale P.; Szpiro, Adam A.
2017-01-01
Cohort studies in air pollution epidemiology aim to establish associations between health outcomes and air pollution exposures. Statistical analysis of such associations is complicated by the multivariate nature of the pollutant exposure data as well as the spatial misalignment that arises from the fact that exposure data are collected at regulatory monitoring network locations distinct from cohort locations. We present a novel clustering approach for addressing this challenge. Specifically, we present a method that uses geographic covariate information to cluster multi-pollutant observations and predict cluster membership at cohort locations. Our predictive k-means procedure identifies centers using a mixture model and is followed by multi-class spatial prediction. In simulations, we demonstrate that predictive k-means can reduce misclassification error by over 50% compared to ordinary k-means, with minimal loss in cluster representativeness. The improved prediction accuracy results in large gains of 30% or more in power for detecting effect modification by cluster in a simulated health analysis. In an analysis of the NIEHS Sister Study cohort using predictive k-means, we find that the association between systolic blood pressure (SBP) and long-term fine particulate matter (PM2.5) exposure varies significantly between different clusters of PM2.5 component profiles. Our cluster-based analysis shows that for subjects assigned to a cluster located in the Midwestern U.S., a 10 μg/m3 difference in exposure is associated with 4.37 mmHg (95% CI, 2.38, 6.35) higher SBP. PMID:28572869
NASA Astrophysics Data System (ADS)
Jahangoshai Rezaee, Mustafa; Jozmaleki, Mehrdad; Valipour, Mahsa
2018-01-01
One of the main features to invest in stock exchange companies is their financial performance. On the other hand, conventional evaluation methods such as data envelopment analysis are not only a retrospective process, but are also a process, which are incomplete and ineffective approaches to evaluate the companies in the future. To remove this problem, it is required to plan an expert system for evaluating organizations when the online data are received from stock exchange market. This paper deals with an approach for predicting the online financial performance of companies when data are received in different time's intervals. The proposed approach is based on integrating fuzzy C-means (FCM), data envelopment analysis (DEA) and artificial neural network (ANN). The classical FCM method is unable to update the number of clusters and their members when the data are changed or the new data are received. Hence, this method is developed in order to make dynamic features for the number of clusters and clusters members in classical FCM. Then, DEA is used to evaluate DMUs by using financial ratios to provide targets in neural network. Finally, the designed network is trained and prepared for predicting companies' future performance. The data on Tehran Stock Market companies for six consecutive years (2007-2012) are used to show the abilities of the proposed approach.
D Geomarketing Segmentation: a Higher Spatial Dimension Planning Perspective
NASA Astrophysics Data System (ADS)
Suhaibah, A.; Uznir, U.; Rahman, A. A.; Anton, F.; Mioc, D.
2016-09-01
Geomarketing is a discipline which uses geographic information in the process of planning and implementation of marketing activities. It can be used in any aspect of the marketing such as price, promotion or geo targeting. The analysis of geomarketing data use a huge data pool such as location residential areas, topography, it also analyzes demographic information such as age, genre, annual income and lifestyle. This information can help users to develop successful promotional campaigns in order to achieve marketing goals. One of the common activities in geomarketing is market segmentation. The segmentation clusters the data into several groups based on its geographic criteria. To refine the search operation during analysis, we proposed an approach to cluster the data using a clustering algorithm. However, with the huge data pool, overlap among clusters may happen and leads to inefficient analysis. Moreover, geomarketing is usually active in urban areas and requires clusters to be organized in a three-dimensional (3D) way (i.e. multi-level shop lots, residential apartments). This is a constraint with the current Geographic Information System (GIS) framework. To avoid this issue, we proposed a combination of market segmentation based on geographic criteria and clustering algorithm for 3D geomarketing data management. The proposed approach is capable in minimizing the overlap region during market segmentation. In this paper, geomarketing in urban area is used as a case study. Based on the case study, several locations of customers and stores in 3D are used in the test. The experiments demonstrated in this paper substantiated that the proposed approach is capable of minimizing overlapping segmentation and reducing repetitive data entries. The structure is also tested for retrieving the spatial records from the database. For marketing purposes, certain radius of point is used to analyzing marketing targets. Based on the presented tests in this paper, we strongly believe that the structure is capable in handling and managing huge pool of geomarketing data. For future outlook, this paper also discusses the possibilities of expanding the structure.
Who are the healthy active seniors? A cluster analysis.
Lai, Claudia K Y; Chan, Engle Angela; Chin, Kenny C W
2014-12-01
This paper reports a cluster analysis of a sample recruited from a randomized controlled trial that explored the effect of using a life story work approach to improve the psychological outcomes of older people in the community. 238 subjects from community centers were included in this analysis. After statistical testing, 169 seniors were assigned to the active ageing (AG) cluster and 69 to the inactive ageing (IG) cluster. Those in the AG were younger and healthier, with fewer chronic diseases and fewer depressive symptoms than those in the IG. They were more satisfied with their lives, and had higher self-esteem. They met with their family members more frequently, they engaged in more leisure activities and were more likely to have the ability to move freely. In summary, active ageing was observed in people with better health and functional performance. Our results echoed the limited findings reported in the literature.
NASA Astrophysics Data System (ADS)
Ward, W. O. C.; Wilkinson, P. B.; Chambers, J. E.; Oxby, L. S.; Bai, L.
2014-04-01
A novel method for the effective identification of bedrock subsurface elevation from electrical resistivity tomography images is described. Identifying subsurface boundaries in the topographic data can be difficult due to smoothness constraints used in inversion, so a statistical population-based approach is used that extends previous work in calculating isoresistivity surfaces. The analysis framework involves a procedure for guiding a clustering approach based on the fuzzy c-means algorithm. An approximation of resistivity distributions, found using kernel density estimation, was utilized as a means of guiding the cluster centroids used to classify data. A fuzzy method was chosen over hard clustering due to uncertainty in hard edges in the topography data, and a measure of clustering uncertainty was identified based on the reciprocal of cluster membership. The algorithm was validated using a direct comparison of known observed bedrock depths at two 3-D survey sites, using real-time GPS information of exposed bedrock by quarrying on one site, and borehole logs at the other. Results show similarly accurate detection as a leading isosurface estimation method, and the proposed algorithm requires significantly less user input and prior site knowledge. Furthermore, the method is effectively dimension-independent and will scale to data of increased spatial dimensions without a significant effect on the runtime. A discussion on the results by automated versus supervised analysis is also presented.
Methods for sample size determination in cluster randomized trials
Rutterford, Clare; Copas, Andrew; Eldridge, Sandra
2015-01-01
Background: The use of cluster randomized trials (CRTs) is increasing, along with the variety in their design and analysis. The simplest approach for their sample size calculation is to calculate the sample size assuming individual randomization and inflate this by a design effect to account for randomization by cluster. The assumptions of a simple design effect may not always be met; alternative or more complicated approaches are required. Methods: We summarise a wide range of sample size methods available for cluster randomized trials. For those familiar with sample size calculations for individually randomized trials but with less experience in the clustered case, this manuscript provides formulae for a wide range of scenarios with associated explanation and recommendations. For those with more experience, comprehensive summaries are provided that allow quick identification of methods for a given design, outcome and analysis method. Results: We present first those methods applicable to the simplest two-arm, parallel group, completely randomized design followed by methods that incorporate deviations from this design such as: variability in cluster sizes; attrition; non-compliance; or the inclusion of baseline covariates or repeated measures. The paper concludes with methods for alternative designs. Conclusions: There is a large amount of methodology available for sample size calculations in CRTs. This paper gives the most comprehensive description of published methodology for sample size calculation and provides an important resource for those designing these trials. PMID:26174515
NASA Astrophysics Data System (ADS)
Wright, D. J.; Raad, M.; Hoel, E.; Park, M.; Mollenkopf, A.; Trujillo, R.
2016-12-01
Introduced is a new approach for processing spatiotemporal big data by leveraging distributed analytics and storage. A suite of temporally-aware analysis tools summarizes data nearby or within variable windows, aggregates points (e.g., for various sensor observations or vessel positions), reconstructs time-enabled points into tracks (e.g., for mapping and visualizing storm tracks), joins features (e.g., to find associations between features based on attributes, spatial relationships, temporal relationships or all three simultaneously), calculates point densities, finds hot spots (e.g., in species distributions), and creates space-time slices and cubes (e.g., in microweather applications with temperature, humidity, and pressure, or within human mobility studies). These "feature geo analytics" tools run in both batch and streaming spatial analysis mode as distributed computations across a cluster of servers on typical "big" data sets, where static data exist in traditional geospatial formats (e.g., shapefile) locally on a disk or file share, attached as static spatiotemporal big data stores, or streamed in near-real-time. In other words, the approach registers large datasets or data stores with ArcGIS Server, then distributes analysis across a cluster of machines for parallel processing. Several brief use cases will be highlighted based on a 16-node server cluster at 14 Gb RAM per node, allowing, for example, the buffering of over 8 million points or thousands of polygons in 1 minute. The approach is "hybrid" in that ArcGIS Server integrates open-source big data frameworks such as Apache Hadoop and Apache Spark on the cluster in order to run the analytics. In addition, the user may devise and connect custom open-source interfaces and tools developed in Python or Python Notebooks; the common denominator being the familiar REST API.
Advanced analysis of forest fire clustering
NASA Astrophysics Data System (ADS)
Kanevski, Mikhail; Pereira, Mario; Golay, Jean
2017-04-01
Analysis of point pattern clustering is an important topic in spatial statistics and for many applications: biodiversity, epidemiology, natural hazards, geomarketing, etc. There are several fundamental approaches used to quantify spatial data clustering using topological, statistical and fractal measures. In the present research, the recently introduced multi-point Morisita index (mMI) is applied to study the spatial clustering of forest fires in Portugal. The data set consists of more than 30000 fire events covering the time period from 1975 to 2013. The distribution of forest fires is very complex and highly variable in space. mMI is a multi-point extension of the classical two-point Morisita index. In essence, mMI is estimated by covering the region under study by a grid and by computing how many times more likely it is that m points selected at random will be from the same grid cell than it would be in the case of a complete random Poisson process. By changing the number of grid cells (size of the grid cells), mMI characterizes the scaling properties of spatial clustering. From mMI, the data intrinsic dimension (fractal dimension) of the point distribution can be estimated as well. In this study, the mMI of forest fires is compared with the mMI of random patterns (RPs) generated within the validity domain defined as the forest area of Portugal. It turns out that the forest fires are highly clustered inside the validity domain in comparison with the RPs. Moreover, they demonstrate different scaling properties at different spatial scales. The results obtained from the mMI analysis are also compared with those of fractal measures of clustering - box counting and sand box counting approaches. REFERENCES Golay J., Kanevski M., Vega Orozco C., Leuenberger M., 2014: The multipoint Morisita index for the analysis of spatial patterns. Physica A, 406, 191-202. Golay J., Kanevski M. 2015: A new estimator of intrinsic dimension based on the multipoint Morisita index. Pattern Recognition, 48, 4070-4081.
Making the most of missing values : object clustering with partial data in astronomy
NASA Technical Reports Server (NTRS)
Wagstaff, Kiri L.; Laidler, Victoria G.
2004-01-01
We demonstrate a clustering analysis algorithm, KSC, that a) uses all observed values and b) does not discard the partially observed objects. KSC uses soft constraints defined by the fully observed objects to assist in the grouping of objects with missing values. We present an analysis of objects taken from the Sloan Digital Sky Survey to demonstrate how imputing the values can be misleading and why the KSC approach can produce more appropriate results.
Visualizing statistical significance of disease clusters using cartograms.
Kronenfeld, Barry J; Wong, David W S
2017-05-15
Health officials and epidemiological researchers often use maps of disease rates to identify potential disease clusters. Because these maps exaggerate the prominence of low-density districts and hide potential clusters in urban (high-density) areas, many researchers have used density-equalizing maps (cartograms) as a basis for epidemiological mapping. However, we do not have existing guidelines for visual assessment of statistical uncertainty. To address this shortcoming, we develop techniques for visual determination of statistical significance of clusters spanning one or more districts on a cartogram. We developed the techniques within a geovisual analytics framework that does not rely on automated significance testing, and can therefore facilitate visual analysis to detect clusters that automated techniques might miss. On a cartogram of the at-risk population, the statistical significance of a disease cluster is determinate from the rate, area and shape of the cluster under standard hypothesis testing scenarios. We develop formulae to determine, for a given rate, the area required for statistical significance of a priori and a posteriori designated regions under certain test assumptions. Uniquely, our approach enables dynamic inference of aggregate regions formed by combining individual districts. The method is implemented in interactive tools that provide choropleth mapping, automated legend construction and dynamic search tools to facilitate cluster detection and assessment of the validity of tested assumptions. A case study of leukemia incidence analysis in California demonstrates the ability to visually distinguish between statistically significant and insignificant regions. The proposed geovisual analytics approach enables intuitive visual assessment of statistical significance of arbitrarily defined regions on a cartogram. Our research prompts a broader discussion of the role of geovisual exploratory analyses in disease mapping and the appropriate framework for visually assessing the statistical significance of spatial clusters.
Deckersbach, Thilo; Peters, Amy T.; Sylvia, Louisa G.; Gold, Alexandra K.; da Silva Magalhaes, Pedro Vieira; Henry, David B.; Frank, Ellen; Otto, Michael W.; Berk, Michael; Dougherty, Darin D.; Nierenberg, Andrew A.; Miklowitz, David J.
2016-01-01
Background We sought to address how predictors and moderators of psychotherapy for bipolar depression – identified individually in prior analyses – can inform the development of a metric for prospectively classifying treatment outcome in intensive psychotherapy (IP) versus collaborative care (CC) adjunctive to pharmacotherapy in the Systematic Treatment Enhancement Program (STEP-BD) study. Methods We conducted post-hoc analyses on 135 STEP-BD participants using cluster analysis to identify subsets of participants with similar clinical profiles and investigated this combined metric as a moderator and predictor of response to IP. We used agglomerative hierarchical cluster analyses and k-means clustering to determine the content of the clinical profiles. Logistic regression and Cox proportional hazard models were used to evaluate whether the resulting clusters predicted or moderated likelihood of recovery or time until recovery. Results The cluster analysis yielded a two-cluster solution: 1) “less-recurrent/severe” and 2) “chronic/recurrent.” Rates of recovery in IP were similar for less-recurrent/severe and chronic/recurrent participants. Less-recurrent/severe patients were more likely than chronic/recurrent patients to achieve recovery in CC (p = .040, OR = 4.56). IP yielded a faster recovery for chronic/recurrent participants, whereas CC led to recovery sooner in the less-recurrent/severe cluster (p = .034, OR = 2.62). Limitations Cluster analyses require list-wise deletion of cases with missing data so we were unable to conduct analyses on all STEP-BD participants. Conclusions A well-powered, parametric approach can distinguish patients based on illness history and provide clinicians with symptom profiles of patients that confer differential prognosis in CC vs. IP. PMID:27289316
Rudolf, Jeffrey D.; Yan, Xiaohui; Shen, Ben
2015-01-01
The enediynes are one of the most fascinating families of bacterial natural products given their unprecedented molecular architecture and extraordinary cytotoxicity. Enediynes are rare with only 11 structurally characterized members and four additional members isolated in their cycloaromatized form. Recent advances in DNA sequencing have resulted in an explosion of microbial genomes. A virtual survey of the GenBank and JGI genome databases revealed 87 enediyne biosynthetic gene clusters from 78 bacteria strains, implying enediynes are more common than previously thought. Here we report the construction and analysis of an enediyne genome neighborhood network (GNN) as a high-throughput approach to analyze secondary metabolite gene clusters. Analysis of the enediyne GNN facilitated rapid gene cluster annotation, revealed genetic trends in enediyne biosynthetic gene clusters resulting in a simple prediction scheme to determine 9- vs 10-membered enediyne gene clusters, and supported a genomic-based strain prioritization method for enediyne discovery. PMID:26318027
Review of Instructional Approaches in Ethics Education.
Mulhearn, Tyler J; Steele, Logan M; Watts, Logan L; Medeiros, Kelsey E; Mumford, Michael D; Connelly, Shane
2017-06-01
Increased investment in ethics education has prompted a variety of instructional objectives and frameworks. Yet, no systematic procedure to classify these varying instructional approaches has been attempted. In the present study, a quantitative clustering procedure was conducted to derive a typology of instruction in ethics education. In total, 330 ethics training programs were included in the cluster analysis. The training programs were appraised with respect to four instructional categories including instructional content, processes, delivery methods, and activities. Eight instructional approaches were identified through this clustering procedure, and these instructional approaches showed different levels of effectiveness. Instructional effectiveness was assessed based on one of nine commonly used ethics criteria. With respect to specific training types, Professional Decision Processes Training (d = 0.50) and Field-Specific Compliance Training (d = 0.46) appear to be viable approaches to ethics training based on Cohen's d effect size estimates. By contrast, two commonly used approaches, General Discussion Training (d = 0.31) and Norm Adherence Training (d = 0.37), were found to be considerably less effective. The implications for instruction in ethics training are discussed.
The cosmological analysis of X-ray cluster surveys. III. 4D X-ray observable diagrams
NASA Astrophysics Data System (ADS)
Pierre, M.; Valotti, A.; Faccioli, L.; Clerc, N.; Gastaud, R.; Koulouridis, E.; Pacaud, F.
2017-11-01
Context. Despite compelling theoretical arguments, the use of clusters as cosmological probes is, in practice, frequently questioned because of the many uncertainties surrounding cluster-mass estimates. Aims: Our aim is to develop a fully self-consistent cosmological approach of X-ray cluster surveys, exclusively based on observable quantities rather than masses. This procedure is justified given the possibility to directly derive the cluster properties via ab initio modelling, either analytically or by using hydrodynamical simulations. In this third paper, we evaluate the method on cluster toy-catalogues. Methods: We model the population of detected clusters in the count-rate - hardness-ratio - angular size - redshift space and compare the corresponding four-dimensional diagram with theoretical predictions. The best cosmology+physics parameter configuration is determined using a simple minimisation procedure; errors on the parameters are estimated by averaging the results from ten independent survey realisations. The method allows a simultaneous fit of the cosmological parameters of the cluster evolutionary physics and of the selection effects. Results: When using information from the X-ray survey alone plus redshifts, this approach is shown to be as accurate as the modelling of the mass function for the cosmological parameters and to perform better for the cluster physics, for a similar level of assumptions on the scaling relations. It enables the identification of degenerate combinations of parameter values. Conclusions: Given the considerably shorter computer times involved for running the minimisation procedure in the observed parameter space, this method appears to clearly outperform traditional mass-based approaches when X-ray survey data alone are available.
Clustering Genes of Common Evolutionary History
Gori, Kevin; Suchan, Tomasz; Alvarez, Nadir; Goldman, Nick; Dessimoz, Christophe
2016-01-01
Phylogenetic inference can potentially result in a more accurate tree using data from multiple loci. However, if the loci are incongruent—due to events such as incomplete lineage sorting or horizontal gene transfer—it can be misleading to infer a single tree. To address this, many previous contributions have taken a mechanistic approach, by modeling specific processes. Alternatively, one can cluster loci without assuming how these incongruencies might arise. Such “process-agnostic” approaches typically infer a tree for each locus and cluster these. There are, however, many possible combinations of tree distance and clustering methods; their comparative performance in the context of tree incongruence is largely unknown. Furthermore, because standard model selection criteria such as AIC cannot be applied to problems with a variable number of topologies, the issue of inferring the optimal number of clusters is poorly understood. Here, we perform a large-scale simulation study of phylogenetic distances and clustering methods to infer loci of common evolutionary history. We observe that the best-performing combinations are distances accounting for branch lengths followed by spectral clustering or Ward’s method. We also introduce two statistical tests to infer the optimal number of clusters and show that they strongly outperform the silhouette criterion, a general-purpose heuristic. We illustrate the usefulness of the approach by 1) identifying errors in a previous phylogenetic analysis of yeast species and 2) identifying topological incongruence among newly sequenced loci of the globeflower fly genus Chiastocheta. We release treeCl, a new program to cluster genes of common evolutionary history (http://git.io/treeCl). PMID:26893301
Bolin, Jocelyn H; Edwards, Julianne M; Finch, W Holmes; Cassady, Jerrell C
2014-01-01
Although traditional clustering methods (e.g., K-means) have been shown to be useful in the social sciences it is often difficult for such methods to handle situations where clusters in the population overlap or are ambiguous. Fuzzy clustering, a method already recognized in many disciplines, provides a more flexible alternative to these traditional clustering methods. Fuzzy clustering differs from other traditional clustering methods in that it allows for a case to belong to multiple clusters simultaneously. Unfortunately, fuzzy clustering techniques remain relatively unused in the social and behavioral sciences. The purpose of this paper is to introduce fuzzy clustering to these audiences who are currently relatively unfamiliar with the technique. In order to demonstrate the advantages associated with this method, cluster solutions of a common perfectionism measure were created using both fuzzy clustering and K-means clustering, and the results compared. Results of these analyses reveal that different cluster solutions are found by the two methods, and the similarity between the different clustering solutions depends on the amount of cluster overlap allowed for in fuzzy clustering.
Bolin, Jocelyn H.; Edwards, Julianne M.; Finch, W. Holmes; Cassady, Jerrell C.
2014-01-01
Although traditional clustering methods (e.g., K-means) have been shown to be useful in the social sciences it is often difficult for such methods to handle situations where clusters in the population overlap or are ambiguous. Fuzzy clustering, a method already recognized in many disciplines, provides a more flexible alternative to these traditional clustering methods. Fuzzy clustering differs from other traditional clustering methods in that it allows for a case to belong to multiple clusters simultaneously. Unfortunately, fuzzy clustering techniques remain relatively unused in the social and behavioral sciences. The purpose of this paper is to introduce fuzzy clustering to these audiences who are currently relatively unfamiliar with the technique. In order to demonstrate the advantages associated with this method, cluster solutions of a common perfectionism measure were created using both fuzzy clustering and K-means clustering, and the results compared. Results of these analyses reveal that different cluster solutions are found by the two methods, and the similarity between the different clustering solutions depends on the amount of cluster overlap allowed for in fuzzy clustering. PMID:24795683
Variability in body size and shape of UK offshore workers: A cluster analysis approach.
Stewart, Arthur; Ledingham, Robert; Williams, Hector
2017-01-01
Male UK offshore workers have enlarged dimensions compared with UK norms and knowledge of specific sizes and shapes typifying their physiques will assist a range of functions related to health and ergonomics. A representative sample of the UK offshore workforce (n = 588) underwent 3D photonic scanning, from which 19 extracted dimensional measures were used in k-means cluster analysis to characterise physique groups. Of the 11 resulting clusters four somatotype groups were expressed: one cluster was muscular and lean, four had greater muscularity than adiposity, three had equal adiposity and muscularity and three had greater adiposity than muscularity. Some clusters appeared constitutionally similar to others, differing only in absolute size. These cluster centroids represent an evidence-base for future designs in apparel and other applications where body size and proportions affect functional performance. They also constitute phenotypic evidence providing insight into the 'offshore culture' which may underpin the enlarged dimensions of offshore workers. Copyright © 2016 Elsevier Ltd. All rights reserved.
Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm
Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong
2016-01-01
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis. PMID:27959895
Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm.
Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong
2016-01-01
In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis.
Chapter 7. Cloning and analysis of natural product pathways.
Gust, Bertolt
2009-01-01
The identification of gene clusters of natural products has lead to an enormous wealth of information about their biosynthesis and its regulation, and about self-resistance mechanisms. Well-established routine techniques are now available for the cloning and sequencing of gene clusters. The subsequent functional analysis of the complex biosynthetic machinery requires efficient genetic tools for manipulation. Until recently, techniques for the introduction of defined changes into Streptomyces chromosomes were very time-consuming. In particular, manipulation of large DNA fragments has been challenging due to the absence of suitable restriction sites for restriction- and ligation-based techniques. The homologous recombination approach called recombineering (referred to as Red/ET-mediated recombination in this chapter) has greatly facilitated targeted genetic modifications of complex biosynthetic pathways from actinomycetes by eliminating many of the time-consuming and labor-intensive steps. This chapter describes techniques for the cloning and identification of biosynthetic gene clusters, for the generation of gene replacements within such clusters, for the construction of integrative library clones and their expression in heterologous hosts, and for the assembly of entire biosynthetic gene clusters from the inserts of individual library clones. A systematic approach toward insertional mutation of a complete Streptomyces genome is shown by the use of an in vitro transposon mutagenesis procedure.
Hummel, Michelle; Wood, Nathan J.; Schweikert, Amy; Stacey, Mark T.; Jones, Jeanne; Barnard, Patrick L.; Erikson, Li H.
2018-01-01
Sea level is projected to rise over the coming decades, further increasing the extent of flooding hazards in coastal communities. Efforts to address potential impacts from climate-driven coastal hazards have called for collaboration among communities to strengthen the application of best practices. However, communities currently lack practical tools for identifying potential partner communities based on similar hazard exposure characteristics. This study uses statistical cluster analysis to identify similarities in community exposure to flooding hazards for a suite of sea level rise and storm scenarios. We demonstrate this approach using 63 jurisdictions in the San Francisco Bay region of California (USA) and compare 21 distinct exposure variables related to residents, employees, and structures for six hazard scenario combinations of sea level rise and storms. Results indicate that cluster analysis can provide an effective mechanism for identifying community groupings. Cluster compositions changed based on the selected societal variables and sea level rise scenarios, suggesting that a community could participate in multiple networks to target specific issues or policy interventions. The proposed clustering approach can serve as a data-driven foundation to help communities identify other communities with similar adaptation challenges and to enhance regional efforts that aim to facilitate adaptation planning and investment prioritization.
Efficient generation of low-energy folded states of a model protein
NASA Astrophysics Data System (ADS)
Gordon, Heather L.; Kwan, Wai Kei; Gong, Chunhang; Larrass, Stefan; Rothstein, Stuart M.
2003-01-01
A number of short simulated annealing runs are performed on a highly-frustrated 46-"residue" off-lattice model protein. We perform, in an iterative fashion, a principal component analysis of the 946 nonbonded interbead distances, followed by two varieties of cluster analyses: hierarchical and k-means clustering. We identify several distinct sets of conformations with reasonably consistent cluster membership. Nonbonded distance constraints are derived for each cluster and are employed within a distance geometry approach to generate many new conformations, previously unidentified by the simulated annealing experiments. Subsequent analyses suggest that these new conformations are members of the parent clusters from which they were generated. Furthermore, several novel, previously unobserved structures with low energy were uncovered, augmenting the ensemble of simulated annealing results, and providing a complete distribution of low-energy states. The computational cost of this approach to generating low-energy conformations is small when compared to the expense of further Monte Carlo simulated annealing runs.
Griss, Johannes; Perez-Riverol, Yasset; Lewis, Steve; Tabb, David L.; Dianes, José A.; del-Toro, Noemi; Rurik, Marc; Walzer, Mathias W.; Kohlbacher, Oliver; Hermjakob, Henning; Wang, Rui; Vizcaíno, Juan Antonio
2016-01-01
Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average 75% of spectra analysed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large-scale to shed a light on these unidentified spectra. PRoteomics IDEntifications database (PRIDE) Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in PRIDE Archive, coming from hundreds of datasets, we were able to consistently characterize three distinct groups of spectra: 1) incorrectly identified spectra, 2) spectra correctly identified but below the set scoring threshold, and 3) truly unidentified spectra. Using a multitude of complementary analysis approaches, we were able to identify less than 20% of the consistently unidentified spectra. The complete spectrum clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster). This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra. PMID:27493588
Griss, Johannes; Perez-Riverol, Yasset; Lewis, Steve; Tabb, David L; Dianes, José A; Del-Toro, Noemi; Rurik, Marc; Walzer, Mathias W; Kohlbacher, Oliver; Hermjakob, Henning; Wang, Rui; Vizcaíno, Juan Antonio
2016-08-01
Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average 75% of spectra analysed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large-scale to shed a light on these unidentified spectra. PRoteomics IDEntifications database (PRIDE) Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in PRIDE Archive, coming from hundreds of datasets, we were able to consistently characterize three distinct groups of spectra: 1) incorrectly identified spectra, 2) spectra correctly identified but below the set scoring threshold, and 3) truly unidentified spectra. Using a multitude of complementary analysis approaches, we were able to identify less than 20% of the consistently unidentified spectra. The complete spectrum clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster). This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra.
Py, Béatrice; Barras, Frédéric
2015-06-01
Since their discovery in the 50's, Fe-S cluster proteins have attracted much attention from chemists, biophysicists and biochemists. However, in the 80's they were joined by geneticists who helped to realize that in vivo maturation of Fe-S cluster bound proteins required assistance of a large number of factors defining complex multi-step pathways. The question of how clusters are formed and distributed in vivo has since been the focus of much effort. Here we review how genetics in discovering genes and investigating processes as they unfold in vivo has provoked seminal advances toward our understanding of Fe-S cluster biogenesis. The power and limitations of genetic approaches are discussed. As a final comment, we argue how the marriage of classic strategies and new high-throughput technologies should allow genetics of Fe-S cluster biology to be even more insightful in the future. This article is part of a Special Issue entitled: Fe/S proteins: Analysis, structure, function, biogenesis and diseases. Copyright © 2015 Elsevier B.V. All rights reserved.
Applications of Some Artificial Intelligence Methods to Satellite Soundings
NASA Technical Reports Server (NTRS)
Munteanu, M. J.; Jakubowicz, O.
1985-01-01
Hard clustering of temperature profiles and regression temperature retrievals were used to refine the method using the probabilities of membership of each pattern vector in each of the clusters derived with discriminant analysis. In hard clustering the maximum probability is taken and the corresponding cluster as the correct cluster are considered discarding the rest of the probabilities. In fuzzy partitioned clustering these probabilities are kept and the final regression retrieval is a weighted regression retrieval of several clusters. This method was used in the clustering of brightness temperatures where the purpose was to predict tropopause height. A further refinement is the division of temperature profiles into three major regions for classification purposes. The results are summarized in the tables total r.m.s. errors are displayed. An approach based on fuzzy logic which is intimately related to artificial intelligence methods is recommended.
Clustering and Network Analysis of Reverse Phase Protein Array Data.
Byron, Adam
2017-01-01
Molecular profiling of proteins and phosphoproteins using a reverse phase protein array (RPPA) platform, with a panel of target-specific antibodies, enables the parallel, quantitative proteomic analysis of many biological samples in a microarray format. Hence, RPPA analysis can generate a high volume of multidimensional data that must be effectively interrogated and interpreted. A range of computational techniques for data mining can be applied to detect and explore data structure and to form functional predictions from large datasets. Here, two approaches for the computational analysis of RPPA data are detailed: the identification of similar patterns of protein expression by hierarchical cluster analysis and the modeling of protein interactions and signaling relationships by network analysis. The protocols use freely available, cross-platform software, are easy to implement, and do not require any programming expertise. Serving as data-driven starting points for further in-depth analysis, validation, and biological experimentation, these and related bioinformatic approaches can accelerate the functional interpretation of RPPA data.
2013-01-01
Background The slaughterhouse is a central processing point for food animals and thus a source of both demographic data (age, breed, sex) and health-related data (reason for condemnation and condemned portions) that are not available through other sources. Using these data for syndromic surveillance is therefore tempting. However many possible reasons for condemnation and condemned portions exist, making the definition of relevant syndromes challenging. The objective of this study was to determine a typology of cattle with at least one portion of the carcass condemned in order to define syndromes. Multiple factor analysis (MFA) in combination with clustering methods was performed using both health-related data and demographic data. Results Analyses were performed on 381,186 cattle with at least one portion of the carcass condemned among the 1,937,917 cattle slaughtered in ten French abattoirs. Results of the MFA and clustering methods led to 12 clusters considered as stable according to year of slaughter and slaughterhouse. One cluster was specific to a disease of public health importance (cysticercosis). Two clusters were linked to the slaughtering process (fecal contamination of heart or lungs and deterioration lesions). Two clusters respectively characterized by chronic liver lesions and chronic peritonitis could be linked to diseases of economic importance to farmers. Three clusters could be linked respectively to reticulo-pericarditis, fatty liver syndrome and farmer’s lung syndrome, which are related to both diseases of economic importance to farmers and herd management issues. Three clusters respectively characterized by arthritis, myopathy and Dark Firm Dry (DFD) meat could notably be linked to animal welfare issues. Finally, one cluster, characterized by bronchopneumonia, could be linked to both animal health and herd management issues. Conclusion The statistical approach of combining multiple factor analysis with cluster analysis showed its relevance for the detection of syndromes using available large and complex slaughterhouse data. The advantages of this statistical approach are to i) define groups of reasons for condemnation based on meat inspection data, ii) help grouping reasons for condemnation among a list of various possible reasons for condemnation for which a consensus among experts could be difficult to reach, iii) assign each animal to a single syndrome which allows the detection of changes in trends of syndromes to detect unusual patterns in known diseases and emergence of new diseases. PMID:23628140
Novel approach to classifying patients with pulmonary arterial hypertension using cluster analysis.
Parikh, Kishan S; Rao, Youlan; Ahmad, Tariq; Shen, Kai; Felker, G Michael; Rajagopal, Sudarshan
2017-01-01
Pulmonary arterial hypertension (PAH) patients have distinct disease courses and responses to treatment, but current diagnostic and treatment schemes provide limited insight. We aimed to see if cluster analysis could distinguish clinical phenotypes in PAH. An unbiased cluster analysis was performed on 17 baseline clinical variables of PAH patients from the FREEDOM-M, FREEDOM-C, and FREEDOM-C2 randomized trials of oral treprostinil versus placebo. Participants were either treatment-naïve (FREEDOM-M) or on background therapy (FREEDOM-C, FREEDOM-C2). We tested for association of clusters with outcomes and interaction with respect to treatment. Primary outcome was 6-minute walking distance (6MWD) change. We included 966 participants with 12-week (FREEDOM-M) or 16-week (FREEDOM-C and FREEDOM-C2) follow-up. Four patient clusters were identified. Compared with Clusters 1 (n = 131) and 2 (n = 496), Clusters 3 (n = 246) and 4 (n = 93) patients were older, heavier, had worse baseline functional class, 6MWD, Borg Dyspnea Index, and fewer years since PAH diagnosis. Clusters also differed by PAH etiology and background therapies, but not gender or race. Mean treatment effect of oral treprostinil differed across Clusters 1-4 increased in a monotonic fashion (Cluster 1: 10.9 m; Cluster 2: 13.0 m; Cluster 3: 25.0 m; Cluster 4: 50.9 m; interaction P value = 0.048). We identified four distinct clusters of PAH patients based on common patient characteristics. Patients who were older, diagnosed with PAH for a shorter period, and had worse baseline symptoms and exercise capacity had the greatest response to oral treprostinil treatment.
Carvalho, Carolina Abreu de; Fonsêca, Poliana Cristina de Almeida; Nobre, Luciana Neri; Priore, Silvia Eloiza; Franceschini, Sylvia do Carmo Castro
2016-01-01
The objective of this study is to provide guidance for identifying dietary patterns using the a posteriori approach, and analyze the methodological aspects of the studies conducted in Brazil that identified the dietary patterns of children. Articles were selected from the Latin American and Caribbean Literature on Health Sciences, Scientific Electronic Library Online and Pubmed databases. The key words were: Dietary pattern; Food pattern; Principal Components Analysis; Factor analysis; Cluster analysis; Reduced rank regression. We included studies that identified dietary patterns of children using the a posteriori approach. Seven studies published between 2007 and 2014 were selected, six of which were cross-sectional and one cohort, Five studies used the food frequency questionnaire for dietary assessment; one used a 24-hour dietary recall and the other a food list. The method of exploratory approach used in most publications was principal components factor analysis, followed by cluster analysis. The sample size of the studies ranged from 232 to 4231, the values of the Kaiser-Meyer-Olkin test from 0.524 to 0.873, and Cronbach's alpha from 0.51 to 0.69. Few Brazilian studies identified dietary patterns of children using the a posteriori approach and principal components factor analysis was the technique most used.
2007-01-01
including tree- based methods such as the unweighted pair group method of analysis ( UPGMA ) and Neighbour-joining (NJ) (Saitou & Nei, 1987). By...based Bayesian approach and the tree-based UPGMA and NJ cluster- ing methods. The results obtained suggest that far more species occur in the An...unlikely that groups that differ by more than these levels are conspecific. Genetic distances were clustered using the UPGMA and NJ algorithms in MEGA
Sun Protection Belief Clusters: Analysis of Amazon Mechanical Turk Data.
Santiago-Rivas, Marimer; Schnur, Julie B; Jandorf, Lina
2016-12-01
This study aimed (i) to determine whether people could be differentiated on the basis of their sun protection belief profiles and individual characteristics and (ii) explore the use of a crowdsourcing web service for the assessment of sun protection beliefs. A sample of 500 adults completed an online survey of sun protection belief items using Amazon Mechanical Turk. A two-phased cluster analysis (i.e., hierarchical and non-hierarchical K-means) was utilized to determine clusters of sun protection barriers and facilitators. Results yielded three distinct clusters of sun protection barriers and three distinct clusters of sun protection facilitators. Significant associations between gender, age, sun sensitivity, and cluster membership were identified. Results also showed an association between barrier and facilitator cluster membership. The results of this study provided a potential alternative approach to developing future sun protection promotion initiatives in the population. Findings add to our knowledge regarding individuals who support, oppose, or are ambivalent toward sun protection and inform intervention research by identifying distinct subtypes that may best benefit from (or have a higher need for) skin cancer prevention efforts.
Penalized unsupervised learning with outliers
Witten, Daniela M.
2013-01-01
We consider the problem of performing unsupervised learning in the presence of outliers – that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, K-means clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outliers. In this paper, we take a new approach to extending existing unsupervised learning techniques to accommodate outliers. Our approach is an extension of a recent proposal for outlier detection in the regression setting. We allow each observation to take on an “error” term, and we penalize the errors using a group lasso penalty in order to encourage most of the observations’ errors to exactly equal zero. We show that this approach can be used in order to develop extensions of K-means clustering and principal components analysis that result in accurate outlier detection, as well as improved performance in the presence of outliers. These methods are illustrated in a simulation study and on two gene expression data sets, and connections with M-estimation are explored. PMID:23875057
Analysis of multiplex gene expression maps obtained by voxelation.
An, Li; Xie, Hongbo; Chin, Mark H; Obradovic, Zoran; Smith, Desmond J; Megalooikonomou, Vasileios
2009-04-29
Gene expression signatures in the mammalian brain hold the key to understanding neural development and neurological disease. Researchers have previously used voxelation in combination with microarrays for acquisition of genome-wide atlases of expression patterns in the mouse brain. On the other hand, some work has been performed on studying gene functions, without taking into account the location information of a gene's expression in a mouse brain. In this paper, we present an approach for identifying the relation between gene expression maps obtained by voxelation and gene functions. To analyze the dataset, we chose typical genes as queries and aimed at discovering similar gene groups. Gene similarity was determined by using the wavelet features extracted from the left and right hemispheres averaged gene expression maps, and by the Euclidean distance between each pair of feature vectors. We also performed a multiple clustering approach on the gene expression maps, combined with hierarchical clustering. Among each group of similar genes and clusters, the gene function similarity was measured by calculating the average gene function distances in the gene ontology structure. By applying our methodology to find similar genes to certain target genes we were able to improve our understanding of gene expression patterns and gene functions. By applying the clustering analysis method, we obtained significant clusters, which have both very similar gene expression maps and very similar gene functions respectively to their corresponding gene ontologies. The cellular component ontology resulted in prominent clusters expressed in cortex and corpus callosum. The molecular function ontology gave prominent clusters in cortex, corpus callosum and hypothalamus. The biological process ontology resulted in clusters in cortex, hypothalamus and choroid plexus. Clusters from all three ontologies combined were most prominently expressed in cortex and corpus callosum. The experimental results confirm the hypothesis that genes with similar gene expression maps might have similar gene functions. The voxelation data takes into account the location information of gene expression level in mouse brain, which is novel in related research. The proposed approach can potentially be used to predict gene functions and provide helpful suggestions to biologists.
NASA Astrophysics Data System (ADS)
Takuma, Takehisa; Masugi, Masao
2009-03-01
This paper presents an approach to the assessment of IP-network traffic in terms of the time variation of self-similarity. To get a comprehensive view in analyzing the degree of long-range dependence (LRD) of IP-network traffic, we use a hierarchical clustering scheme, which provides a way to classify high-dimensional data with a tree-like structure. Also, in the LRD-based analysis, we employ detrended fluctuation analysis (DFA), which is applicable to the analysis of long-range power-law correlations or LRD in non-stationary time-series signals. Based on sequential measurements of IP-network traffic at two locations, this paper derives corresponding values for the LRD-related parameter α that reflects the degree of LRD of measured data. In performing the hierarchical clustering scheme, we use three parameters: the α value, average throughput, and the proportion of network traffic that exceeds 80% of network bandwidth for each measured data set. We visually confirm that the traffic data can be classified in accordance with the network traffic properties, resulting in that the combined depiction of the LRD and other factors can give us an effective assessment of network conditions at different times.
Butaciu, Sinziana; Senila, Marin; Sarbu, Costel; Ponta, Michaela; Tanaselia, Claudiu; Cadar, Oana; Roman, Marius; Radu, Emil; Sima, Mihaela; Frentiu, Tiberiu
2017-04-01
The study proposes a combined model based on diagrams (Gibbs, Piper, Stuyfzand Hydrogeochemical Classification System) and unsupervised statistical approaches (Cluster Analysis, Principal Component Analysis, Fuzzy Principal Component Analysis, Fuzzy Hierarchical Cross-Clustering) to describe natural enrichment of inorganic arsenic and co-occurring species in groundwater in the Banat Plain, southwestern Romania. Speciation of inorganic As (arsenite, arsenate), ion concentrations (Na + , K + , Ca 2+ , Mg 2+ , HCO 3 - , Cl - , F - , SO 4 2- , PO 4 3- , NO 3 - ), pH, redox potential, conductivity and total dissolved substances were performed. Classical diagrams provided the hydrochemical characterization, while statistical approaches were helpful to establish (i) the mechanism of naturally occurring of As and F - species and the anthropogenic one for NO 3 - , SO 4 2- , PO 4 3- and K + and (ii) classification of groundwater based on content of arsenic species. The HCO 3 - type of local groundwater and alkaline pH (8.31-8.49) were found to be responsible for the enrichment of arsenic species and occurrence of F - but by different paths. The PO 4 3- -AsO 4 3- ion exchange, water-rock interaction (silicates hydrolysis and desorption from clay) were associated to arsenate enrichment in the oxidizing aquifer. Fuzzy Hierarchical Cross-Clustering was the strongest tool for the rapid simultaneous classification of groundwaters as a function of arsenic content and hydrogeochemical characteristics. The approach indicated the Na + -F - -pH cluster as marker for groundwater with naturally elevated As and highlighted which parameters need to be monitored. A chemical conceptual model illustrating the natural and anthropogenic paths and enrichment of As and co-occurring species in the local groundwater supported by mineralogical analysis of rocks was established. Copyright © 2016 Elsevier Ltd. All rights reserved.
Nursing home care quality: a cluster analysis.
Grøndahl, Vigdis Abrahamsen; Fagerli, Liv Berit
2017-02-13
Purpose The purpose of this paper is to explore potential differences in how nursing home residents rate care quality and to explore cluster characteristics. Design/methodology/approach A cross-sectional design was used, with one questionnaire including questions from quality from patients' perspective and Big Five personality traits, together with questions related to socio-demographic aspects and health condition. Residents ( n=103) from four Norwegian nursing homes participated (74.1 per cent response rate). Hierarchical cluster analysis identified clusters with respect to care quality perceptions. χ 2 tests and one-way between-groups ANOVA were performed to characterise the clusters ( p<0.05). Findings Two clusters were identified; Cluster 1 residents (28.2 per cent) had the best care quality perceptions and Cluster 2 (67.0 per cent) had the worst perceptions. The clusters were statistically significant and characterised by personal-related conditions: gender, psychological well-being, preferences, admission, satisfaction with staying in the nursing home, emotional stability and agreeableness, and by external objective care conditions: healthcare personnel and registered nurses. Research limitations/implications Residents assessed as having no cognitive impairments were included, thus excluding the largest group. By choosing questionnaire design and structured interviews, the number able to participate may increase. Practical implications Findings may provide healthcare personnel and managers with increased knowledge on which to develop strategies to improve specific care quality perceptions. Originality/value Cluster analysis can be an effective tool for differentiating between nursing homes residents' care quality perceptions.
Concept mapping and network analysis: an analytic approach to measure ties among constructs.
Goldman, Alyssa W; Kane, Mary
2014-12-01
Group concept mapping is a mixed-methods approach that helps a group visually represent its ideas on a topic of interest through a series of related maps. The maps and additional graphics are useful for planning, evaluation and theory development. Group concept maps are typically described, interpreted and utilized through points, clusters and distances, and the implications of these features in understanding how constructs relate to one another. This paper focuses on the application of network analysis to group concept mapping to quantify the strength and directionality of relationships among clusters. The authors outline the steps of this analysis, and illustrate its practical use through an organizational strategic planning example. Additional benefits of this analysis to evaluation projects are also discussed, supporting the overall utility of this supplemental technique to the standard concept mapping methodology. Copyright © 2014 Elsevier Ltd. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rusek, Marian; Orlowski, Arkadiusz
2005-04-01
The dynamics of small ({<=}55 atoms) argon clusters ionized by an intense femtosecond laser pulse is studied using a time-dependent Thomas-Fermi model. The resulting Bloch-like hydrodynamic equations are solved numerically using the smooth particle hydrodynamics method without the necessity of grid simulations. As follows from recent experiments, absorption of radiation and subsequent ionization of clusters observed in the short-wavelength laser frequency regime (98 nm) differs considerably from that in the optical spectral range (800 nm). Our theoretical approach provides a unified framework for treating these very different frequency regimes and allows for a deeper understanding of the underlying cluster explosionmore » mechanisms. The results of our analysis following from extensive numerical simulations presented in this paper are compared both with experimental findings and with predictions of other theoretical models.« less
Clustering and Dimensionality Reduction to Discover Interesting Patterns in Binary Data
NASA Astrophysics Data System (ADS)
Palumbo, Francesco; D'Enza, Alfonso Iodice
The attention towards binary data coding increased consistently in the last decade due to several reasons. The analysis of binary data characterizes several fields of application, such as market basket analysis, DNA microarray data, image mining, text mining and web-clickstream mining. The paper illustrates two different approaches exploiting a profitable combination of clustering and dimensionality reduction for the identification of non-trivial association structures in binary data. An application in the Association Rules framework supports the theory with the empirical evidence.
Hybrid cloud and cluster computing paradigms for life science applications
2010-01-01
Background Clouds and MapReduce have shown themselves to be a broadly useful approach to scientific computing especially for parallel data intensive applications. However they have limited applicability to some areas such as data mining because MapReduce has poor performance on problems with an iterative structure present in the linear algebra that underlies much data analysis. Such problems can be run efficiently on clusters using MPI leading to a hybrid cloud and cluster environment. This motivates the design and implementation of an open source Iterative MapReduce system Twister. Results Comparisons of Amazon, Azure, and traditional Linux and Windows environments on common applications have shown encouraging performance and usability comparisons in several important non iterative cases. These are linked to MPI applications for final stages of the data analysis. Further we have released the open source Twister Iterative MapReduce and benchmarked it against basic MapReduce (Hadoop) and MPI in information retrieval and life sciences applications. Conclusions The hybrid cloud (MapReduce) and cluster (MPI) approach offers an attractive production environment while Twister promises a uniform programming environment for many Life Sciences applications. Methods We used commercial clouds Amazon and Azure and the NSF resource FutureGrid to perform detailed comparisons and evaluations of different approaches to data intensive computing. Several applications were developed in MPI, MapReduce and Twister in these different environments. PMID:21210982
Hybrid cloud and cluster computing paradigms for life science applications.
Qiu, Judy; Ekanayake, Jaliya; Gunarathne, Thilina; Choi, Jong Youl; Bae, Seung-Hee; Li, Hui; Zhang, Bingjing; Wu, Tak-Lon; Ruan, Yang; Ekanayake, Saliya; Hughes, Adam; Fox, Geoffrey
2010-12-21
Clouds and MapReduce have shown themselves to be a broadly useful approach to scientific computing especially for parallel data intensive applications. However they have limited applicability to some areas such as data mining because MapReduce has poor performance on problems with an iterative structure present in the linear algebra that underlies much data analysis. Such problems can be run efficiently on clusters using MPI leading to a hybrid cloud and cluster environment. This motivates the design and implementation of an open source Iterative MapReduce system Twister. Comparisons of Amazon, Azure, and traditional Linux and Windows environments on common applications have shown encouraging performance and usability comparisons in several important non iterative cases. These are linked to MPI applications for final stages of the data analysis. Further we have released the open source Twister Iterative MapReduce and benchmarked it against basic MapReduce (Hadoop) and MPI in information retrieval and life sciences applications. The hybrid cloud (MapReduce) and cluster (MPI) approach offers an attractive production environment while Twister promises a uniform programming environment for many Life Sciences applications. We used commercial clouds Amazon and Azure and the NSF resource FutureGrid to perform detailed comparisons and evaluations of different approaches to data intensive computing. Several applications were developed in MPI, MapReduce and Twister in these different environments.
NASA Astrophysics Data System (ADS)
Husein, A. M.; Harahap, M.; Aisyah, S.; Purba, W.; Muhazir, A.
2018-03-01
Medication planning aim to get types, amount of medicine according to needs, and avoid the emptiness medicine based on patterns of disease. In making the medicine planning is still rely on ability and leadership experience, this is due to take a long time, skill, difficult to obtain a definite disease data, need a good record keeping and reporting, and the dependence of the budget resulted in planning is not going well, and lead to frequent lack and excess of medicines. In this research, we propose Adaptive Neuro Fuzzy Inference System (ANFIS) method to predict medication needs in 2016 and 2017 based on medical data in 2015 and 2016 from two source of hospital. The framework of analysis using two approaches. The first phase is implementing ANFIS to a data source, while the second approach we keep using ANFIS, but after the process of clustering from K-Means algorithm, both approaches are calculated values of Root Mean Square Error (RMSE) for training and testing. From the testing result, the proposed method with better prediction rates based on the evaluation analysis of quantitative and qualitative compared with existing systems, however the implementation of K-Means Algorithm against ANFIS have an effect on the timing of the training process and provide a classification accuracy significantly better without clustering.
The Potential of Multivariate Analysis in Assessing Students' Attitude to Curriculum Subjects
ERIC Educational Resources Information Center
Gaotlhobogwe, Michael; Laugharne, Janet; Durance, Isabelle
2011-01-01
Background: Understanding student attitudes to curriculum subjects is central to providing evidence-based options to policy makers in education. Purpose: We illustrate how quantitative approaches used in the social sciences and based on multivariate analysis (categorical Principal Components Analysis, Clustering Analysis and General Linear…
On Learning Cluster Coefficient of Private Networks
Wang, Yue; Wu, Xintao; Zhu, Jun; Xiang, Yang
2013-01-01
Enabling accurate analysis of social network data while preserving differential privacy has been challenging since graph features such as clustering coefficient or modularity often have high sensitivity, which is different from traditional aggregate functions (e.g., count and sum) on tabular data. In this paper, we treat a graph statistics as a function f and develop a divide and conquer approach to enforce differential privacy. The basic procedure of this approach is to first decompose the target computation f into several less complex unit computations f1, …, fm connected by basic mathematical operations (e.g., addition, subtraction, multiplication, division), then perturb the output of each fi with Laplace noise derived from its own sensitivity value and the distributed privacy threshold εi, and finally combine those perturbed fi as the perturbed output of computation f. We examine how various operations affect the accuracy of complex computations. When unit computations have large global sensitivity values, we enforce the differential privacy by calibrating noise based on the smooth sensitivity, rather than the global sensitivity. By doing this, we achieve the strict differential privacy guarantee with smaller magnitude noise. We illustrate our approach by using clustering coefficient, which is a popular statistics used in social network analysis. Empirical evaluations on five real social networks and various synthetic graphs generated from three random graph models show the developed divide and conquer approach outperforms the direct approach. PMID:24429843
Marateb, Hamid Reza; Mansourian, Marjan; Adibi, Peyman; Farina, Dario
2014-01-01
Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables. PMID:24672565
Mitchell-Foster, Kendra; Ayala, Efraín Beltrán; Breilh, Jaime; Spiegel, Jerry; Wilches, Ana Arichabala; Leon, Tania Ordóñez; Delgado, Jefferson Adrian
2015-02-01
This project investigates the effectiveness and feasibility of scaling-up an eco-bio-social approach for implementing an integrated community-based approach for dengue prevention in comparison with existing insecticide-based and emerging biolarvicide-based programs in an endemic setting in Machala, Ecuador. An integrated intervention strategy (IIS) for dengue prevention (an elementary school-based dengue education program, and clean patio and safe container program) was implemented in 10 intervention clusters from November 2012 to November 2013 using a randomized controlled cluster trial design (20 clusters: 10 intervention, 10 control; 100 households per cluster with 1986 total households). Current existing dengue prevention programs served as the control treatment in comparison clusters. Pupa per person index (PPI) is used as the main outcome measure. Particular attention was paid to social mobilization and empowerment with IIS. Overall, IIS was successful in reducing PPI levels in intervention communities versus control clusters, with intervention clusters in the six paired clusters that followed the study design experiencing a greater reduction of PPI compared to controls (2.2 OR, 95% CI: 1.2 to 4.7). Analysis of individual cases demonstrates that consideration for contexualizing programs and strategies to local neighborhoods can be very effective in reducing PPI for dengue transmission risk reduction. In the rapidly evolving political climate for dengue control in Ecuador, integration of successful social mobilization and empowerment strategies with existing and emerging biolarvicide-based government dengue prevention and control programs is promising in reducing PPI and dengue transmission risk in southern coastal communities like Machala. However, more profound analysis of social determination of health is called for to assess sustainability prospects. © The author 2015. The World Health Organization has granted Oxford University Press permission for the reproduction of this article.
Mitchell-Foster, Kendra; Ayala, Efraín Beltrán; Breilh, Jaime; Spiegel, Jerry; Wilches, Ana Arichabala; Leon, Tania Ordóñez; Delgado, Jefferson Adrian
2015-01-01
Background This project investigates the effectiveness and feasibility of scaling-up an eco-bio-social approach for implementing an integrated community-based approach for dengue prevention in comparison with existing insecticide-based and emerging biolarvicide-based programs in an endemic setting in Machala, Ecuador. Methods An integrated intervention strategy (IIS) for dengue prevention (an elementary school-based dengue education program, and clean patio and safe container program) was implemented in 10 intervention clusters from November 2012 to November 2013 using a randomized controlled cluster trial design (20 clusters: 10 intervention, 10 control; 100 households per cluster with 1986 total households). Current existing dengue prevention programs served as the control treatment in comparison clusters. Pupa per person index (PPI) is used as the main outcome measure. Particular attention was paid to social mobilization and empowerment with IIS. Results Overall, IIS was successful in reducing PPI levels in intervention communities versus control clusters, with intervention clusters in the six paired clusters that followed the study design experiencing a greater reduction of PPI compared to controls (2.2 OR, 95% CI: 1.2 to 4.7). Analysis of individual cases demonstrates that consideration for contexualizing programs and strategies to local neighborhoods can be very effective in reducing PPI for dengue transmission risk reduction. Conclusions In the rapidly evolving political climate for dengue control in Ecuador, integration of successful social mobilization and empowerment strategies with existing and emerging biolarvicide-based government dengue prevention and control programs is promising in reducing PPI and dengue transmission risk in southern coastal communities like Machala. However, more profound analysis of social determination of health is called for to assess sustainability prospects. PMID:25604763
Cardiovascular reactivity patterns and pathways to hypertension: a multivariate cluster analysis.
Brindle, R C; Ginty, A T; Jones, A; Phillips, A C; Roseboom, T J; Carroll, D; Painter, R C; de Rooij, S R
2016-12-01
Substantial evidence links exaggerated mental stress induced blood pressure reactivity to future hypertension, but the results for heart rate reactivity are less clear. For this reason multivariate cluster analysis was carried out to examine the relationship between heart rate and blood pressure reactivity patterns and hypertension in a large prospective cohort (age range 55-60 years). Four clusters emerged with statistically different systolic and diastolic blood pressure and heart rate reactivity patterns. Cluster 1 was characterised by a relatively exaggerated blood pressure and heart rate response while the blood pressure and heart rate responses of cluster 2 were relatively modest and in line with the sample mean. Cluster 3 was characterised by blunted cardiovascular stress reactivity across all variables and cluster 4, by an exaggerated blood pressure response and modest heart rate response. Membership to cluster 4 conferred an increased risk of hypertension at 5-year follow-up (hazard ratio=2.98 (95% CI: 1.50-5.90), P<0.01) that survived adjustment for a host of potential confounding variables. These results suggest that the cardiac reactivity plays a potentially important role in the link between blood pressure reactivity and hypertension and support the use of multivariate approaches to stress psychophysiology.
Weighted graph cuts without eigenvectors a multilevel approach.
Dhillon, Inderjit S; Guan, Yuqiang; Kulis, Brian
2007-11-01
A variety of clustering algorithms have recently been proposed to handle data that is not linearly separable; spectral clustering and kernel k-means are two of the main methods. In this paper, we discuss an equivalence between the objective functions used in these seemingly different methods--in particular, a general weighted kernel k-means objective is mathematically equivalent to a weighted graph clustering objective. We exploit this equivalence to develop a fast, high-quality multilevel algorithm that directly optimizes various weighted graph clustering objectives, such as the popular ratio cut, normalized cut, and ratio association criteria. This eliminates the need for any eigenvector computation for graph clustering problems, which can be prohibitive for very large graphs. Previous multilevel graph partitioning methods, such as Metis, have suffered from the restriction of equal-sized clusters; our multilevel algorithm removes this restriction by using kernel k-means to optimize weighted graph cuts. Experimental results show that our multilevel algorithm outperforms a state-of-the-art spectral clustering algorithm in terms of speed, memory usage, and quality. We demonstrate that our algorithm is applicable to large-scale clustering tasks such as image segmentation, social network analysis and gene network analysis.
Novel Approach to Classify Plants Based on Metabolite-Content Similarity.
Liu, Kang; Abdullah, Azian Azamimi; Huang, Ming; Nishioka, Takaaki; Altaf-Ul-Amin, Md; Kanaya, Shigehiko
2017-01-01
Secondary metabolites are bioactive substances with diverse chemical structures. Depending on the ecological environment within which they are living, higher plants use different combinations of secondary metabolites for adaptation (e.g., defense against attacks by herbivores or pathogenic microbes). This suggests that the similarity in metabolite content is applicable to assess phylogenic similarity of higher plants. However, such a chemical taxonomic approach has limitations of incomplete metabolomics data. We propose an approach for successfully classifying 216 plants based on their known incomplete metabolite content. Structurally similar metabolites have been clustered using the network clustering algorithm DPClus. Plants have been represented as binary vectors, implying relations with structurally similar metabolite groups, and classified using Ward's method of hierarchical clustering. Despite incomplete data, the resulting plant clusters are consistent with the known evolutional relations of plants. This finding reveals the significance of metabolite content as a taxonomic marker. We also discuss the predictive power of metabolite content in exploring nutritional and medicinal properties in plants. As a byproduct of our analysis, we could predict some currently unknown species-metabolite relations.
Novel Approach to Classify Plants Based on Metabolite-Content Similarity
Abdullah, Azian Azamimi; Huang, Ming; Nishioka, Takaaki
2017-01-01
Secondary metabolites are bioactive substances with diverse chemical structures. Depending on the ecological environment within which they are living, higher plants use different combinations of secondary metabolites for adaptation (e.g., defense against attacks by herbivores or pathogenic microbes). This suggests that the similarity in metabolite content is applicable to assess phylogenic similarity of higher plants. However, such a chemical taxonomic approach has limitations of incomplete metabolomics data. We propose an approach for successfully classifying 216 plants based on their known incomplete metabolite content. Structurally similar metabolites have been clustered using the network clustering algorithm DPClus. Plants have been represented as binary vectors, implying relations with structurally similar metabolite groups, and classified using Ward's method of hierarchical clustering. Despite incomplete data, the resulting plant clusters are consistent with the known evolutional relations of plants. This finding reveals the significance of metabolite content as a taxonomic marker. We also discuss the predictive power of metabolite content in exploring nutritional and medicinal properties in plants. As a byproduct of our analysis, we could predict some currently unknown species-metabolite relations. PMID:28164123
Replicating cluster subtypes for the prevention of adolescent smoking and alcohol use.
Babbin, Steven F; Velicer, Wayne F; Paiva, Andrea L; Brick, Leslie Ann D; Redding, Colleen A
2015-01-01
Substance abuse interventions tailored to the individual level have produced effective outcomes for a wide variety of behaviors. One approach to enhancing tailoring involves using cluster analysis to identify prevention subtypes that represent different attitudes about substance use. This study applied this approach to better understand tailored interventions for smoking and alcohol prevention. Analyses were performed on a sample of sixth graders from 20 New England middle schools involved in a 36-month tailored intervention study. Most adolescents reported being in the Acquisition Precontemplation (aPC) stage at baseline: not smoking or not drinking and not planning to start in the next six months. For smoking (N=4059) and alcohol (N=3973), each sample was randomly split into five subsamples. Cluster analysis was performed within each subsample based on three variables: Pros and Cons (from Decisional Balance Scales), and Situational Temptations. Across all subsamples for both smoking and alcohol, the following four clusters were identified: (1) Most Protected (MP; low Pros, high Cons, low Temptations); (2) Ambivalent (AM; high Pros, average Cons and Temptations); (3) Risk Denial (RD; average Pros, low Cons, average Temptations); and (4) High Risk (HR; high Pros, low Cons, and very high Temptations). Finding the same four clusters within aPC for both smoking and alcohol, replicating the results across the five subsamples, and demonstrating hypothesized relations among the clusters with additional external validity analyses provide strong evidence of the robustness of these results. These clusters demonstrate evidence of validity and can provide a basis for tailoring interventions. Copyright © 2014. Published by Elsevier Ltd.
Replicating cluster subtypes for the prevention of adolescent smoking and alcohol use
Babbin, Steven F.; Velicer, Wayne F.; Paiva, Andrea L.; Brick, Leslie Ann D.; Redding, Colleen A.
2015-01-01
Introduction Substance abuse interventions tailored to the individual level have produced effective outcomes for a wide variety of behaviors. One approach to enhancing tailoring involves using cluster analysis to identify prevention subtypes that represent different attitudes about substance use. This study applied this approach to better understand tailored interventions for smoking and alcohol prevention. Methods Analyses were performed on a sample of sixth graders from 20 New England middle schools involved in a 36-month tailored intervention study. Most adolescents reported being in the Acquisition Precontemplation (aPC) stage at baseline: not smoking or not drinking and not planning to start in the next six months. For smoking (N= 4059) and alcohol (N= 3973), each sample was randomly split into five subsamples. Cluster analysis was performed within each subsample based on three variables: Pros and Cons (from Decisional Balance Scales), and Situational Temptations. Results Across all subsamples for both smoking and alcohol, the following four clusters were identified: (1) Most Protected (MP; low Pros, high Cons, low Temptations); (2) Ambivalent (AM; high Pros, average Cons and Temptations); (3) Risk Denial (RD; average Pros, low Cons, average Temptations); and (4) High Risk (HR; high Pros, low Cons, and very high Temptations). Conclusions Finding the same four clusters within aPC for both smoking and alcohol, replicating the results across the five subsamples, and demonstrating hypothesized relations among the clusters with additional external validity analyses provide strong evidence of the robustness of these results. These clusters demonstrate evidence of validity and can provide a basis for tailoring interventions. PMID:25222849
NASA Astrophysics Data System (ADS)
Schaefer, A. M.; Daniell, J. E.; Wenzel, F.
2014-12-01
Earthquake clustering tends to be an increasingly important part of general earthquake research especially in terms of seismic hazard assessment and earthquake forecasting and prediction approaches. The distinct identification and definition of foreshocks, aftershocks, mainshocks and secondary mainshocks is taken into account using a point based spatio-temporal clustering algorithm originating from the field of classic machine learning. This can be further applied for declustering purposes to separate background seismicity from triggered seismicity. The results are interpreted and processed to assemble 3D-(x,y,t) earthquake clustering maps which are based on smoothed seismicity records in space and time. In addition, multi-dimensional Gaussian functions are used to capture clustering parameters for spatial distribution and dominant orientations. Clusters are further processed using methodologies originating from geostatistics, which have been mostly applied and developed in mining projects during the last decades. A 2.5D variogram analysis is applied to identify spatio-temporal homogeneity in terms of earthquake density and energy output. The results are mitigated using Kriging to provide an accurate mapping solution for clustering features. As a case study, seismic data of New Zealand and the United States is used, covering events since the 1950s, from which an earthquake cluster catalogue is assembled for most of the major events, including a detailed analysis of the Landers and Christchurch sequences.
Roca, Josep; Vargas, Claudia; Cano, Isaac; Selivanov, Vitaly; Barreiro, Esther; Maier, Dieter; Falciani, Francesco; Wagner, Peter; Cascante, Marta; Garcia-Aymerich, Judith; Kalko, Susana; De Mas, Igor; Tegnér, Jesper; Escarrabill, Joan; Agustí, Alvar; Gomez-Cabrero, David
2014-11-28
Heterogeneity in clinical manifestations and disease progression in Chronic Obstructive Pulmonary Disease (COPD) lead to consequences for patient health risk assessment, stratification and management. Implicit with the classical "spill over" hypothesis is that COPD heterogeneity is driven by the pulmonary events of the disease. Alternatively, we hypothesized that COPD heterogeneities result from the interplay of mechanisms governing three conceptually different phenomena: 1) pulmonary disease, 2) systemic effects of COPD and 3) co-morbidity clustering, each of them with their own dynamics. To explore the potential of a systems analysis of COPD heterogeneity focused on skeletal muscle dysfunction and on co-morbidity clustering aiming at generating predictive modeling with impact on patient management. To this end, strategies combining deterministic modeling and network medicine analyses of the Biobridge dataset were used to investigate the mechanisms of skeletal muscle dysfunction. An independent data driven analysis of co-morbidity clustering examining associated genes and pathways was performed using a large dataset (ICD9-CM data from Medicare, 13 million people). Finally, a targeted network analysis using the outcomes of the two approaches (skeletal muscle dysfunction and co-morbidity clustering) explored shared pathways between these phenomena. (1) Evidence of abnormal regulation of skeletal muscle bioenergetics and skeletal muscle remodeling showing a significant association with nitroso-redox disequilibrium was observed in COPD; (2) COPD patients presented higher risk for co-morbidity clustering than non-COPD patients increasing with ageing; and, (3) the on-going targeted network analyses suggests shared pathways between skeletal muscle dysfunction and co-morbidity clustering. The results indicate the high potential of a systems approach to address COPD heterogeneity. Significant knowledge gaps were identified that are relevant to shape strategies aiming at fostering 4P Medicine for patients with COPD.
Zakharov, A.; Vitale, C.; Kilinc, E.; Koroleva, K.; Fayuk, D.; Shelukhina, I.; Naumenko, N.; Skorinkin, A.; Khazipov, R.; Giniatullin, R.
2015-01-01
Trigeminal nerves in meninges are implicated in generation of nociceptive firing underlying migraine pain. However, the neurochemical mechanisms of nociceptive firing in meningeal trigeminal nerves are little understood. In this study, using suction electrode recordings from peripheral branches of the trigeminal nerve in isolated rat meninges, we analyzed spontaneous and capsaicin-induced orthodromic spiking activity. In control, biphasic single spikes with variable amplitude and shapes were observed. Application of the transient receptor potential vanilloid 1 (TRPV1) agonist capsaicin to meninges dramatically increased firing whereas the amplitudes and shapes of spikes remained essentially unchanged. This effect was antagonized by the specific TRPV1 antagonist capsazepine. Using the clustering approach, several groups of uniform spikes (clusters) were identified. The clustering approach combined with capsaicin application allowed us to detect and to distinguish “responder” (65%) from “non-responder” clusters (35%). Notably, responders fired spikes at frequencies exceeding 10 Hz, high enough to provide postsynaptic temporal summation of excitation at brainstem and spinal cord level. Almost all spikes were suppressed by tetrodotoxin (TTX) suggesting an involvement of the TTX-sensitive sodium channels in nociceptive signaling at the peripheral branches of trigeminal neurons. Our analysis also identified transient (desensitizing) and long-lasting (slowly desensitizing) responses to the continuous application of capsaicin. Thus, the persistent activation of nociceptors in capsaicin-sensitive nerve fibers shown here may be involved in trigeminal pain signaling and plasticity along with the release of migraine-related neuropeptides from TRPV1 positive neurons. Furthermore, cluster analysis could be widely used to characterize the temporal and neurochemical profiles of other pain transducers likely implicated in migraine. PMID:26283923
Spot detection and image segmentation in DNA microarray data.
Qin, Li; Rueda, Luis; Ali, Adnan; Ngom, Alioune
2005-01-01
Following the invention of microarrays in 1994, the development and applications of this technology have grown exponentially. The numerous applications of microarray technology include clinical diagnosis and treatment, drug design and discovery, tumour detection, and environmental health research. One of the key issues in the experimental approaches utilising microarrays is to extract quantitative information from the spots, which represent genes in a given experiment. For this process, the initial stages are important and they influence future steps in the analysis. Identifying the spots and separating the background from the foreground is a fundamental problem in DNA microarray data analysis. In this review, we present an overview of state-of-the-art methods for microarray image segmentation. We discuss the foundations of the circle-shaped approach, adaptive shape segmentation, histogram-based methods and the recently introduced clustering-based techniques. We analytically show that clustering-based techniques are equivalent to the one-dimensional, standard k-means clustering algorithm that utilises the Euclidean distance.
Burte, Emilie; Bousquet, Jean; Varraso, Raphaëlle; Gormand, Frédéric; Just, Jocelyne; Matran, Régis; Pin, Isabelle; Siroux, Valérie; Jacquemin, Bénédicte; Nadif, Rachel
2015-01-01
The classification of rhinitis in adults is missing in epidemiological studies. To identify phenotypes of adult rhinitis using an unsupervised approach (data-driven) compared with a classical hypothesis-driven approach. 983 adults of the French Epidemiological Study on the Genetics and Environment of Asthma (EGEA) were studied. Self-reported symptoms related to rhinitis such as nasal symptoms, hay fever, sinusitis, conjunctivitis, and sensitivities to different triggers (dust, animals, hay/flowers, cold air…) were used. Allergic sensitization was defined by at least one positive skin prick test to 12 aeroallergens. Mixture model was used to cluster participants, independently in those without (Asthma-, n = 582) and with asthma (Asthma+, n = 401). Three clusters were identified in both groups: 1) Cluster A (55% in Asthma-, and 22% in Asthma+) mainly characterized by the absence of nasal symptoms, 2) Cluster B (23% in Asthma-, 36% in Asthma+) mainly characterized by nasal symptoms all over the year, sinusitis and a low prevalence of positive skin prick tests, and 3) Cluster C (22% in Asthma-, 42% in Asthma+) mainly characterized by a peak of nasal symptoms during spring, a high prevalence of positive skin prick tests and a high report of hay fever, allergic rhinitis and conjunctivitis. The highest rate of polysensitization (80%) was found in participants with comorbid asthma and allergic rhinitis. This cluster analysis highlighted three clusters of rhinitis with similar characteristics than those known by clinicians but differing according to allergic sensitization, and this whatever the asthma status. These clusters could be easily rebuilt using a small number of variables.
Crawford, Megan R.; Chirinos, Diana A.; Iurcotta, Toni; Edinger, Jack D.; Wyatt, James K.; Manber, Rachel; Ong, Jason C.
2017-01-01
Study Objectives: This study examined empirically derived symptom cluster profiles among patients who present with insomnia using clinical data and polysomnography. Methods: Latent profile analysis was used to identify symptom cluster profiles of 175 individuals (63% female) with insomnia disorder based on total scores on validated self-report instruments of daytime and nighttime symptoms (Insomnia Severity Index, Glasgow Sleep Effort Scale, Fatigue Severity Scale, Beliefs and Attitudes about Sleep, Epworth Sleepiness Scale, Pre-Sleep Arousal Scale), mean values from a 7-day sleep diary (sleep onset latency, wake after sleep onset, and sleep efficiency), and total sleep time derived from an in-laboratory PSG. Results: The best-fitting model had three symptom cluster profiles: “High Subjective Wakefulness” (HSW), “Mild Insomnia” (MI) and “Insomnia-Related Distress” (IRD). The HSW symptom cluster profile (26.3% of the sample) reported high wake after sleep onset, high sleep onset latency, and low sleep efficiency. Despite relatively comparable PSG-derived total sleep time, they reported greater levels of daytime sleepiness. The MI symptom cluster profile (45.1%) reported the least disturbance in the sleep diary and questionnaires and had the highest sleep efficiency. The IRD symptom cluster profile (28.6%) reported the highest mean scores on the insomnia-related distress measures (eg, sleep effort and arousal) and waking correlates (fatigue). Covariates associated with symptom cluster membership were older age for the HSW profile, greater obstructive sleep apnea severity for the MI profile, and, when adjusting for obstructive sleep apnea severity, being overweight/obese for the IRD profile. Conclusions: The heterogeneous nature of insomnia disorder is captured by this data-driven approach to identify symptom cluster profiles. The adaptation of a symptom cluster-based approach could guide tailored patient-centered management of patients presenting with insomnia, and enhance patient care. Citation: Crawford MR, Chirinos DA, Iurcotta T, Edinger JD, Wyatt JK, Manber R, Ong JC. Characterization of patients who present with insomnia: is there room for a symptom cluster-based approach? J Clin Sleep Med. 2017;13(7):911–921. PMID:28633722
ERIC Educational Resources Information Center
Young, Edna Carter; Thompson, Cynthia K.
1987-01-01
The effects of treatment on errors in consonant clusters and in ambisyllabic consonants were investigated in two adults with histories of developmental phonological problems. Results indicated that treatment, consisting of a sound-referenced rebus approach, affected change in production of trained words as well as generalization to untrained words…
A method of alignment masking for refining the phylogenetic signal of multiple sequence alignments.
Rajan, Vaibhav
2013-03-01
Inaccurate inference of positional homologies in multiple sequence alignments and systematic errors introduced by alignment heuristics obfuscate phylogenetic inference. Alignment masking, the elimination of phylogenetically uninformative or misleading sites from an alignment before phylogenetic analysis, is a common practice in phylogenetic analysis. Although masking is often done manually, automated methods are necessary to handle the much larger data sets being prepared today. In this study, we introduce the concept of subsplits and demonstrate their use in extracting phylogenetic signal from alignments. We design a clustering approach for alignment masking where each cluster contains similar columns-similarity being defined on the basis of compatible subsplits; our approach then identifies noisy clusters and eliminates them. Trees inferred from the columns in the retained clusters are found to be topologically closer to the reference trees. We test our method on numerous standard benchmarks (both synthetic and biological data sets) and compare its performance with other methods of alignment masking. We find that our method can eliminate sites more accurately than other methods, particularly on divergent data, and can improve the topologies of the inferred trees in likelihood-based analyses. Software available upon request from the author.
Vasylkivska, Veronika S.; Huerta, Nicolas J.
2017-06-24
Determining the spatiotemporal characteristics of natural and induced seismic events holds the opportunity to gain new insights into why these events occur. Linking the seismicity characteristics with other geologic, geographic, natural, or anthropogenic factors could help to identify the causes and suggest mitigation strategies that reduce the risk associated with such events. The nearest-neighbor approach utilized in this work represents a practical first step toward identifying statistically correlated clusters of recorded earthquake events. Detailed study of the Oklahoma earthquake catalog’s inherent errors, empirical model parameters, and model assumptions is presented. We found that the cluster analysis results are stable withmore » respect to empirical parameters (e.g., fractal dimension) but were sensitive to epicenter location errors and seismicity rates. Most critically, we show that the patterns in the distribution of earthquake clusters in Oklahoma are primarily defined by spatial relationships between events. This observation is a stark contrast to California (also known for induced seismicity) where a comparable cluster distribution is defined by both spatial and temporal interactions between events. These results highlight the difficulty in understanding the mechanisms and behavior of induced seismicity but provide insights for future work.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vasylkivska, Veronika S.; Huerta, Nicolas J.
Determining the spatiotemporal characteristics of natural and induced seismic events holds the opportunity to gain new insights into why these events occur. Linking the seismicity characteristics with other geologic, geographic, natural, or anthropogenic factors could help to identify the causes and suggest mitigation strategies that reduce the risk associated with such events. The nearest-neighbor approach utilized in this work represents a practical first step toward identifying statistically correlated clusters of recorded earthquake events. Detailed study of the Oklahoma earthquake catalog’s inherent errors, empirical model parameters, and model assumptions is presented. We found that the cluster analysis results are stable withmore » respect to empirical parameters (e.g., fractal dimension) but were sensitive to epicenter location errors and seismicity rates. Most critically, we show that the patterns in the distribution of earthquake clusters in Oklahoma are primarily defined by spatial relationships between events. This observation is a stark contrast to California (also known for induced seismicity) where a comparable cluster distribution is defined by both spatial and temporal interactions between events. These results highlight the difficulty in understanding the mechanisms and behavior of induced seismicity but provide insights for future work.« less
A mixture model-based approach to the clustering of microarray expression data.
McLachlan, G J; Bean, R W; Peel, D
2002-03-01
This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. EMMIX-GENE is available at http://www.maths.uq.edu.au/~gjm/emmix-gene/
Kharroubi, Adel; Gargouri, Dorra; Baati, Houda; Azri, Chafai
2012-06-01
Concentrations of selected heavy metals (Cd, Pb, Zn, Cu, Mn, and Fe) in surface sediments from 66 sites in both northern and eastern Mediterranean Sea-Boughrara lagoon exchange areas (southeastern Tunisia) were studied in order to understand current metal contamination due to the urbanization and economic development of nearby several coastal regions of the Gulf of Gabès. Multiple approaches were applied for the sediment quality assessment. These approaches were based on GIS coupled with chemometric methods (enrichment factors, geoaccumulation index, principal component analysis, and cluster analysis). Enrichment factors and principal component analysis revealed two distinct groups of metals. The first group corresponded to Fe and Mn derived from natural sources, and the second group contained Cd, Pb, Zn, and Cu originated from man-made sources. For these latter metals, cluster analysis showed two distinct distributions in the selected areas. They were attributed to temporal and spatial variations of contaminant sources input. The geoaccumulation index (I (geo)) values explained that only Cd, Pb, and Cu can be considered as moderate to extreme pollutants in the studied sediments.
Ha, Ninh Thi; Harris, Mark; Preen, David; Robinson, Suzanne; Moorin, Rachael
2018-04-01
We aimed to characterise use of general practitioners (GP) simultaneously across multiple attributes in people with diabetes and examine its impact on diabetes related potentially preventable hospitalisations (PPHs). Five-years of panel data from 40,625 adults with diabetes were sourced from Western Australian administrative health records. Cluster analysis (CA) was used to group individuals with similar patterns of GP utilisation characterised by frequency and recency of services. The relationship between GP utilisation cluster and the risk of PPHs was examined using multivariable random-effects negative binomial regression. CA categorised GP utilisation into three clusters: moderate; high and very high usage, having distinct patient characteristics. After adjusting for potential confounders, the rate of PPHs was significantly lower across all GP usage clusters compared with those with no GP usage; IRR = 0.67 (95%CI: 0.62-0.71) among the moderate, IRR = 0.70 (95%CI 0.66-0.73) high and IRR = 0.76 (95%CI 0.72-0.80) very high GP usage clusters. Combination of temporal factors with measures of frequency of use of GP services revealed patterns of primary health care utilisation associated with different underlying patient characteristics. Incorporation of multiple attributes, that go beyond frequency-based approaches may better characterise the complex relationship between use of GP services and diabetes-related hospitalisation. Copyright © 2018 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Farsadnia, Farhad; Ghahreman, Bijan
2016-04-01
Hydrologic homogeneous group identification is considered both fundamental and applied research in hydrology. Clustering methods are among conventional methods to assess the hydrological homogeneous regions. Recently, Self-Organizing feature Map (SOM) method has been applied in some studies. However, the main problem of this method is the interpretation on the output map of this approach. Therefore, SOM is used as input to other clustering algorithms. The aim of this study is to apply a two-level Self-Organizing feature map and Ward hierarchical clustering method to determine the hydrologic homogenous regions in North and Razavi Khorasan provinces. At first by principal component analysis, we reduced SOM input matrix dimension, then the SOM was used to form a two-dimensional features map. To determine homogeneous regions for flood frequency analysis, SOM output nodes were used as input into the Ward method. Generally, the regions identified by the clustering algorithms are not statistically homogeneous. Consequently, they have to be adjusted to improve their homogeneity. After adjustment of the homogeneity regions by L-moment tests, five hydrologic homogeneous regions were identified. Finally, adjusted regions were created by a two-level SOM and then the best regional distribution function and associated parameters were selected by the L-moment approach. The results showed that the combination of self-organizing maps and Ward hierarchical clustering by principal components as input is more effective than the hierarchical method, by principal components or standardized inputs to achieve hydrologic homogeneous regions.
Banerjee, Amit; Misra, Milind; Pai, Deepa; Shih, Liang-Yu; Woodley, Rohan; Lu, Xiang-Jun; Srinivasan, A R; Olson, Wilma K; Davé, Rajesh N; Venanzi, Carol A
2007-01-01
Six rigid-body parameters (Shift, Slide, Rise, Tilt, Roll, Twist) are commonly used to describe the relative displacement and orientation of successive base pairs in a nucleic acid structure. The present work adapts this approach to describe the relative displacement and orientation of any two planes in an arbitrary molecule-specifically, planes which contain important pharmacophore elements. Relevant code from the 3DNA software package (Nucleic Acids Res. 2003, 31, 5108-5121) was generalized to treat molecular fragments other than DNA bases as input for the calculation of the corresponding rigid-body (or "planes") parameters. These parameters were used to construct feature vectors for a fuzzy relational clustering study of over 700 conformations of a flexible analogue of the dopamine reuptake inhibitor, GBR 12909. Several cluster validity measures were used to determine the optimal number of clusters. Translational (Shift, Slide, Rise) rather than rotational (Tilt, Roll, Twist) features dominate clustering based on planes that are relatively far apart, whereas both types of features are important to clustering when the pair of planes are close by. This approach was able to classify the data set of molecular conformations into groups and to identify representative conformers for use as template conformers in future Comparative Molecular Field Analysis studies of GBR 12909 analogues. The advantage of using the planes parameters, rather than the combination of atomic coordinates and angles between molecular planes used in our previous fuzzy relational clustering of the same data set (J. Chem. Inf. Model. 2005, 45, 610-623), is that the present clustering results are independent of molecular superposition and the technique is able to identify clusters in the molecule considered as a whole. This approach is easily generalizable to any two planes in any molecule.
Xu, Min; Wang, Yemin; Zhao, Zhilong; Gao, Guixi; Huang, Sheng-Xiong; Kang, Qianjin; He, Xinyi; Lin, Shuangjun; Pang, Xiuhua; Deng, Zixin
2016-01-01
ABSTRACT Genome sequencing projects in the last decade revealed numerous cryptic biosynthetic pathways for unknown secondary metabolites in microbes, revitalizing drug discovery from microbial metabolites by approaches called genome mining. In this work, we developed a heterologous expression and functional screening approach for genome mining from genomic bacterial artificial chromosome (BAC) libraries in Streptomyces spp. We demonstrate mining from a strain of Streptomyces rochei, which is known to produce streptothricins and borrelidin, by expressing its BAC library in the surrogate host Streptomyces lividans SBT5, and screening for antimicrobial activity. In addition to the successful capture of the streptothricin and borrelidin biosynthetic gene clusters, we discovered two novel linear lipopeptides and their corresponding biosynthetic gene cluster, as well as a novel cryptic gene cluster for an unknown antibiotic from S. rochei. This high-throughput functional genome mining approach can be easily applied to other streptomycetes, and it is very suitable for the large-scale screening of genomic BAC libraries for bioactive natural products and the corresponding biosynthetic pathways. IMPORTANCE Microbial genomes encode numerous cryptic biosynthetic gene clusters for unknown small metabolites with potential biological activities. Several genome mining approaches have been developed to activate and bring these cryptic metabolites to biological tests for future drug discovery. Previous sequence-guided procedures relied on bioinformatic analysis to predict potentially interesting biosynthetic gene clusters. In this study, we describe an efficient approach based on heterologous expression and functional screening of a whole-genome library for the mining of bioactive metabolites from Streptomyces. The usefulness of this function-driven approach was demonstrated by the capture of four large biosynthetic gene clusters for metabolites of various chemical types, including streptothricins, borrelidin, two novel lipopeptides, and one unknown antibiotic from Streptomyces rochei Sal35. The transfer, expression, and screening of the library were all performed in a high-throughput way, so that this approach is scalable and adaptable to industrial automation for next-generation antibiotic discovery. PMID:27451447
de Jong, Jan A Stavenga; Wierstra, Ronny F A; Hermanussen, José
2006-03-01
Research on individual learning approaches (or learning styles) is split in two traditions, one of which is biased towards academic learning, and the other towards learning from direct experience. In the reported study, the two traditions are linked by investigating the relationships between school-based (academic) and work-based (experiential) learning approaches of students in vocational education programs. Participants were 899 students of a Dutch school for secondary vocational education; 758 provided data on school-based learning, and 407 provided data on work-based learning, resulting in an overlap of 266 students from whom data were obtained on learning in both settings. Learning approaches in school and work settings were measured with questionnaires. Using factor analysis and cluster analysis, items and students were grouped, both with respect to school- and work-based learning. The study identified two academic learning dimensions (constructive learning and reproductive learning), and three experiential learning dimensions (analysis, initiative, and immersion). Construction and analysis were correlated positively, and reproduction and initiative negatively. Cluster analysis resulted in the identification of three school-based learning orientations and three work-based learning orientations. The relation between the two types of learning orientations, expressed in Cramér's V, appeared to be weak. It is concluded that learning approaches are relatively context specific, which implies that neither theoretical tradition can claim general applicability.
Network module detection: Affinity search technique with the multi-node topological overlap measure
Li, Ai; Horvath, Steve
2009-01-01
Background Many clustering procedures only allow the user to input a pairwise dissimilarity or distance measure between objects. We propose a clustering method that can input a multi-point dissimilarity measure d(i1, i2, ..., iP) where the number of points P can be larger than 2. The work is motivated by gene network analysis where clusters correspond to modules of highly interconnected nodes. Here, we define modules as clusters of network nodes with high multi-node topological overlap. The topological overlap measure is a robust measure of interconnectedness which is based on shared network neighbors. In previous work, we have shown that the multi-node topological overlap measure yields biologically meaningful results when used as input of network neighborhood analysis. Findings We adapt network neighborhood analysis for the use of module detection. We propose the Module Affinity Search Technique (MAST), which is a generalized version of the Cluster Affinity Search Technique (CAST). MAST can accommodate a multi-node dissimilarity measure. Clusters grow around user-defined or automatically chosen seeds (e.g. hub nodes). We propose both local and global cluster growth stopping rules. We use several simulations and a gene co-expression network application to argue that the MAST approach leads to biologically meaningful results. We compare MAST with hierarchical clustering and partitioning around medoid clustering. Conclusion Our flexible module detection method is implemented in the MTOM software which can be downloaded from the following webpage: PMID:19619323
Network module detection: Affinity search technique with the multi-node topological overlap measure.
Li, Ai; Horvath, Steve
2009-07-20
Many clustering procedures only allow the user to input a pairwise dissimilarity or distance measure between objects. We propose a clustering method that can input a multi-point dissimilarity measure d(i1, i2, ..., iP) where the number of points P can be larger than 2. The work is motivated by gene network analysis where clusters correspond to modules of highly interconnected nodes. Here, we define modules as clusters of network nodes with high multi-node topological overlap. The topological overlap measure is a robust measure of interconnectedness which is based on shared network neighbors. In previous work, we have shown that the multi-node topological overlap measure yields biologically meaningful results when used as input of network neighborhood analysis. We adapt network neighborhood analysis for the use of module detection. We propose the Module Affinity Search Technique (MAST), which is a generalized version of the Cluster Affinity Search Technique (CAST). MAST can accommodate a multi-node dissimilarity measure. Clusters grow around user-defined or automatically chosen seeds (e.g. hub nodes). We propose both local and global cluster growth stopping rules. We use several simulations and a gene co-expression network application to argue that the MAST approach leads to biologically meaningful results. We compare MAST with hierarchical clustering and partitioning around medoid clustering. Our flexible module detection method is implemented in the MTOM software which can be downloaded from the following webpage: http://www.genetics.ucla.edu/labs/horvath/MTOM/
Hebbian self-organizing integrate-and-fire networks for data clustering.
Landis, Florian; Ott, Thomas; Stoop, Ruedi
2010-01-01
We propose a Hebbian learning-based data clustering algorithm using spiking neurons. The algorithm is capable of distinguishing between clusters and noisy background data and finds an arbitrary number of clusters of arbitrary shape. These properties render the approach particularly useful for visual scene segmentation into arbitrarily shaped homogeneous regions. We present several application examples, and in order to highlight the advantages and the weaknesses of our method, we systematically compare the results with those from standard methods such as the k-means and Ward's linkage clustering. The analysis demonstrates that not only the clustering ability of the proposed algorithm is more powerful than those of the two concurrent methods, the time complexity of the method is also more modest than that of its generally used strongest competitor.
Dynamic Trajectory Extraction from Stereo Vision Using Fuzzy Clustering
NASA Astrophysics Data System (ADS)
Onishi, Masaki; Yoda, Ikushi
In recent years, many human tracking researches have been proposed in order to analyze human dynamic trajectory. These researches are general technology applicable to various fields, such as customer purchase analysis in a shopping environment and safety control in a (railroad) crossing. In this paper, we present a new approach for tracking human positions by stereo image. We use the framework of two-stepped clustering with k-means method and fuzzy clustering to detect human regions. In the initial clustering, k-means method makes middle clusters from objective features extracted by stereo vision at high speed. In the last clustering, c-means fuzzy method cluster middle clusters based on attributes into human regions. Our proposed method can be correctly clustered by expressing ambiguity using fuzzy clustering, even when many people are close to each other. The validity of our technique was evaluated with the experiment of trajectories extraction of doctors and nurses in an emergency room of a hospital.
Stynes, Siobhán; Konstantinou, Kika; Ogollah, Reuben; Hay, Elaine M; Dunn, Kate M
2018-04-01
Traditionally, low back-related leg pain (LBLP) is diagnosed clinically as referred leg pain or sciatica (nerve root involvement). However, within the spectrum of LBLP, we hypothesised that there may be other unrecognised patient subgroups. This study aimed to identify clusters of patients with LBLP using latent class analysis and describe their clinical course. The study population was 609 LBLP primary care consulters. Variables from clinical assessment were included in the latent class analysis. Characteristics of the statistically identified clusters were compared, and their clinical course over 1 year was described. A 5 cluster solution was optimal. Cluster 1 (n = 104) had mild leg pain severity and was considered to represent a referred leg pain group with no clinical signs, suggesting nerve root involvement (sciatica). Cluster 2 (n = 122), cluster 3 (n = 188), and cluster 4 (n = 69) had mild, moderate, and severe pain and disability, respectively, and response to clinical assessment items suggested categories of mild, moderate, and severe sciatica. Cluster 5 (n = 126) had high pain and disability, longer pain duration, and more comorbidities and was difficult to map to a clinical diagnosis. Most improvement for pain and disability was seen in the first 4 months for all clusters. At 12 months, the proportion of patients reporting recovery ranged from 27% for cluster 5 to 45% for cluster 2 (mild sciatica). This is the first study that empirically shows the variability in profile and clinical course of patients with LBLP including sciatica. More homogenous groups were identified, which could be considered in future clinical and research settings.
Cluster: A New Application for Spatial Analysis of Pixelated Data for Epiphytotics.
Nelson, Scot C; Corcoja, Iulian; Pethybridge, Sarah J
2017-12-01
Spatial analysis of epiphytotics is essential to develop and test hypotheses about pathogen ecology, disease dynamics, and to optimize plant disease management strategies. Data collection for spatial analysis requires substantial investment in time to depict patterns in various frames and hierarchies. We developed a new approach for spatial analysis of pixelated data in digital imagery and incorporated the method in a stand-alone desktop application called Cluster. The user isolates target entities (clusters) by designating up to 24 pixel colors as nontargets and moves a threshold slider to visualize the targets. The app calculates the percent area occupied by targeted pixels, identifies the centroids of targeted clusters, and computes the relative compass angle of orientation for each cluster. Users can deselect anomalous clusters manually and/or automatically by specifying a size threshold value to exclude smaller targets from the analysis. Up to 1,000 stochastic simulations randomly place the centroids of each cluster in ranked order of size (largest to smallest) within each matrix while preserving their calculated angles of orientation for the long axes. A two-tailed probability t test compares the mean inter-cluster distances for the observed versus the values derived from randomly simulated maps. This is the basis for statistical testing of the null hypothesis that the clusters are randomly distributed within the frame of interest. These frames can assume any shape, from natural (e.g., leaf) to arbitrary (e.g., a rectangular or polygonal field). Cluster summarizes normalized attributes of clusters, including pixel number, axis length, axis width, compass orientation, and the length/width ratio, available to the user as a downloadable spreadsheet. Each simulated map may be saved as an image and inspected. Provided examples demonstrate the utility of Cluster to analyze patterns at various spatial scales in plant pathology and ecology and highlight the limitations, trade-offs, and considerations for the sensitivities of variables and the biological interpretations of results. The Cluster app is available as a free download for Apple computers at iTunes, with a link to a user guide website.
O'Donovan, Clare B; Walsh, Marianne C; Woolhead, Clara; Forster, Hannah; Celis-Morales, Carlos; Fallaize, Rosalind; Macready, Anna L; Marsaux, Cyril F M; Navas-Carretero, Santiago; Rodrigo San-Cristobal, S; Kolossa, Silvia; Tsirigoti, Lydia; Mvrogianni, Christina; Lambrinou, Christina P; Moschonis, George; Godlewska, Magdalena; Surwillo, Agnieszka; Traczyk, Iwona; Drevon, Christian A; Daniel, Hannelore; Manios, Yannis; Martinez, J Alfredo; Saris, Wim H M; Lovegrove, Julie A; Mathers, John C; Gibney, Michael J; Gibney, Eileen R; Brennan, Lorraine
2017-10-01
Traditionally, personalised nutrition was delivered at an individual level. However, the concept of delivering tailored dietary advice at a group level through the identification of metabotypes or groups of metabolically similar individuals has emerged. Although this approach to personalised nutrition looks promising, further work is needed to examine this concept across a wider population group. Therefore, the objectives of this study are to: (1) identify metabotypes in a European population and (2) develop targeted dietary advice solutions for these metabotypes. Using data from the Food4Me study (n 1607), k-means cluster analysis revealed the presence of three metabolically distinct clusters based on twenty-seven metabolic markers including cholesterol, individual fatty acids and carotenoids. Cluster 2 was identified as a metabolically healthy metabotype as these individuals had the highest Omega-3 Index (6·56 (sd 1·29) %), carotenoids (2·15 (sd 0·71) µm) and lowest total saturated fat levels. On the basis of its fatty acid profile, cluster 1 was characterised as a metabolically unhealthy cluster. Targeted dietary advice solutions were developed per cluster using a decision tree approach. Testing of the approach was performed by comparison with the personalised dietary advice, delivered by nutritionists to Food4Me study participants (n 180). Excellent agreement was observed between the targeted and individualised approaches with an average match of 82 % at the level of delivery of the same dietary message. Future work should ascertain whether this proposed method could be utilised in a healthcare setting, for the rapid and efficient delivery of tailored dietary advice solutions.
Automated modal parameter estimation using correlation analysis and bootstrap sampling
NASA Astrophysics Data System (ADS)
Yaghoubi, Vahid; Vakilzadeh, Majid K.; Abrahamsson, Thomas J. S.
2018-02-01
The estimation of modal parameters from a set of noisy measured data is a highly judgmental task, with user expertise playing a significant role in distinguishing between estimated physical and noise modes of a test-piece. Various methods have been developed to automate this procedure. The common approach is to identify models with different orders and cluster similar modes together. However, most proposed methods based on this approach suffer from high-dimensional optimization problems in either the estimation or clustering step. To overcome this problem, this study presents an algorithm for autonomous modal parameter estimation in which the only required optimization is performed in a three-dimensional space. To this end, a subspace-based identification method is employed for the estimation and a non-iterative correlation-based method is used for the clustering. This clustering is at the heart of the paper. The keys to success are correlation metrics that are able to treat the problems of spatial eigenvector aliasing and nonunique eigenvectors of coalescent modes simultaneously. The algorithm commences by the identification of an excessively high-order model from frequency response function test data. The high number of modes of this model provides bases for two subspaces: one for likely physical modes of the tested system and one for its complement dubbed the subspace of noise modes. By employing the bootstrap resampling technique, several subsets are generated from the same basic dataset and for each of them a model is identified to form a set of models. Then, by correlation analysis with the two aforementioned subspaces, highly correlated modes of these models which appear repeatedly are clustered together and the noise modes are collected in a so-called Trashbox cluster. Stray noise modes attracted to the mode clusters are trimmed away in a second step by correlation analysis. The final step of the algorithm is a fuzzy c-means clustering procedure applied to a three-dimensional feature space to assign a degree of physicalness to each cluster. The proposed algorithm is applied to two case studies: one with synthetic data and one with real test data obtained from a hammer impact test. The results indicate that the algorithm successfully clusters similar modes and gives a reasonable quantification of the extent to which each cluster is physical.
Clustering stock market companies via chaotic map synchronization
NASA Astrophysics Data System (ADS)
Basalto, N.; Bellotti, R.; De Carlo, F.; Facchi, P.; Pascazio, S.
2005-01-01
A pairwise clustering approach is applied to the analysis of the Dow Jones index companies, in order to identify similar temporal behavior of the traded stock prices. To this end, the chaotic map clustering algorithm is used, where a map is associated to each company and the correlation coefficients of the financial time series to the coupling strengths between maps. The simulation of a chaotic map dynamics gives rise to a natural partition of the data, as companies belonging to the same industrial branch are often grouped together. The identification of clusters of companies of a given stock market index can be exploited in the portfolio optimization strategies.
A comparison of heuristic and model-based clustering methods for dietary pattern analysis.
Greve, Benjamin; Pigeot, Iris; Huybrechts, Inge; Pala, Valeria; Börnhorst, Claudia
2016-02-01
Cluster analysis is widely applied to identify dietary patterns. A new method based on Gaussian mixture models (GMM) seems to be more flexible compared with the commonly applied k-means and Ward's method. In the present paper, these clustering approaches are compared to find the most appropriate one for clustering dietary data. The clustering methods were applied to simulated data sets with different cluster structures to compare their performance knowing the true cluster membership of observations. Furthermore, the three methods were applied to FFQ data assessed in 1791 children participating in the IDEFICS (Identification and Prevention of Dietary- and Lifestyle-Induced Health Effects in Children and Infants) Study to explore their performance in practice. The GMM outperformed the other methods in the simulation study in 72 % up to 100 % of cases, depending on the simulated cluster structure. Comparing the computationally less complex k-means and Ward's methods, the performance of k-means was better in 64-100 % of cases. Applied to real data, all methods identified three similar dietary patterns which may be roughly characterized as a 'non-processed' cluster with a high consumption of fruits, vegetables and wholemeal bread, a 'balanced' cluster with only slight preferences of single foods and a 'junk food' cluster. The simulation study suggests that clustering via GMM should be preferred due to its higher flexibility regarding cluster volume, shape and orientation. The k-means seems to be a good alternative, being easier to use while giving similar results when applied to real data.
Automated flow cytometric analysis across large numbers of samples and cell types.
Chen, Xiaoyi; Hasan, Milena; Libri, Valentina; Urrutia, Alejandra; Beitz, Benoît; Rouilly, Vincent; Duffy, Darragh; Patin, Étienne; Chalmond, Bernard; Rogge, Lars; Quintana-Murci, Lluis; Albert, Matthew L; Schwikowski, Benno
2015-04-01
Multi-parametric flow cytometry is a key technology for characterization of immune cell phenotypes. However, robust high-dimensional post-analytic strategies for automated data analysis in large numbers of donors are still lacking. Here, we report a computational pipeline, called FlowGM, which minimizes operator input, is insensitive to compensation settings, and can be adapted to different analytic panels. A Gaussian Mixture Model (GMM)-based approach was utilized for initial clustering, with the number of clusters determined using Bayesian Information Criterion. Meta-clustering in a reference donor permitted automated identification of 24 cell types across four panels. Cluster labels were integrated into FCS files, thus permitting comparisons to manual gating. Cell numbers and coefficient of variation (CV) were similar between FlowGM and conventional gating for lymphocyte populations, but notably FlowGM provided improved discrimination of "hard-to-gate" monocyte and dendritic cell (DC) subsets. FlowGM thus provides rapid high-dimensional analysis of cell phenotypes and is amenable to cohort studies. Copyright © 2015. Published by Elsevier Inc.
Sauzet, Odile; Peacock, Janet L
2017-07-20
The analysis of perinatal outcomes often involves datasets with some multiple births. These are datasets mostly formed of independent observations and a limited number of clusters of size two (twins) and maybe of size three or more. This non-independence needs to be accounted for in the statistical analysis. Using simulated data based on a dataset of preterm infants we have previously investigated the performance of several approaches to the analysis of continuous outcomes in the presence of some clusters of size two. Mixed models have been developed for binomial outcomes but very little is known about their reliability when only a limited number of small clusters are present. Using simulated data based on a dataset of preterm infants we investigated the performance of several approaches to the analysis of binomial outcomes in the presence of some clusters of size two. Logistic models, several methods of estimation for the logistic random intercept models and generalised estimating equations were compared. The presence of even a small percentage of twins means that a logistic regression model will underestimate all parameters but a logistic random intercept model fails to estimate the correlation between siblings if the percentage of twins is too small and will provide similar estimates to logistic regression. The method which seems to provide the best balance between estimation of the standard error and the parameter for any percentage of twins is the generalised estimating equations. This study has shown that the number of covariates or the level two variance do not necessarily affect the performance of the various methods used to analyse datasets containing twins but when the percentage of small clusters is too small, mixed models cannot capture the dependence between siblings.
Oberle, Michael; Wohlwend, Nadia; Jonas, Daniel; Maurer, Florian P; Jost, Geraldine; Tschudin-Sutter, Sarah; Vranckx, Katleen; Egli, Adrian
2016-01-01
The technical, biological, and inter-center reproducibility of matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI TOF MS) typing data has not yet been explored. The aim of this study is to compare typing data from multiple centers employing bioinformatics using bacterial strains from two past outbreaks and non-related strains. Participants received twelve extended spectrum betalactamase-producing E. coli isolates and followed the same standard operating procedure (SOP) including a full-protein extraction protocol. All laboratories provided visually read spectra via flexAnalysis (Bruker, Germany). Raw data from each laboratory allowed calculating the technical and biological reproducibility between centers using BioNumerics (Applied Maths NV, Belgium). Technical and biological reproducibility ranged between 96.8-99.4% and 47.6-94.4%, respectively. The inter-center reproducibility showed a comparable clustering among identical isolates. Principal component analysis indicated a higher tendency to cluster within the same center. Therefore, we used a discriminant analysis, which completely separated the clusters. Next, we defined a reference center and performed a statistical analysis to identify specific peaks to identify the outbreak clusters. Finally, we used a classifier algorithm and a linear support vector machine on the determined peaks as classifier. A validation showed that within the set of the reference center, the identification of the cluster was 100% correct with a large contrast between the score with the correct cluster and the next best scoring cluster. Based on the sufficient technical and biological reproducibility of MALDI-TOF MS based spectra, detection of specific clusters is possible from spectra obtained from different centers. However, we believe that a shared SOP and a bioinformatics approach are required to make the analysis robust and reliable.
NASA Astrophysics Data System (ADS)
Lyakh, Dmitry I.
2018-03-01
A novel reduced-scaling, general-order coupled-cluster approach is formulated by exploiting hierarchical representations of many-body tensors, combined with the recently suggested formalism of scale-adaptive tensor algebra. Inspired by the hierarchical techniques from the renormalisation group approach, H/H2-matrix algebra and fast multipole method, the computational scaling reduction in our formalism is achieved via coarsening of quantum many-body interactions at larger interaction scales, thus imposing a hierarchical structure on many-body tensors of coupled-cluster theory. In our approach, the interaction scale can be defined on any appropriate Euclidean domain (spatial domain, momentum-space domain, energy domain, etc.). We show that the hierarchically resolved many-body tensors can reduce the storage requirements to O(N), where N is the number of simulated quantum particles. Subsequently, we prove that any connected many-body diagram consisting of a finite number of arbitrary-order tensors, e.g. an arbitrary coupled-cluster diagram, can be evaluated in O(NlogN) floating-point operations. On top of that, we suggest an additional approximation to further reduce the computational complexity of higher order coupled-cluster equations, i.e. equations involving higher than double excitations, which otherwise would introduce a large prefactor into formal O(NlogN) scaling.
Clustering analysis of line indices for LAMOST spectra with AstroStat
NASA Astrophysics Data System (ADS)
Chen, Shu-Xin; Sun, Wei-Min; Yan, Qi
2018-06-01
The application of data mining in astronomical surveys, such as the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) survey, provides an effective approach to automatically analyze a large amount of complex survey data. Unsupervised clustering could help astronomers find the associations and outliers in a big data set. In this paper, we employ the k-means method to perform clustering for the line index of LAMOST spectra with the powerful software AstroStat. Implementing the line index approach for analyzing astronomical spectra is an effective way to extract spectral features for low resolution spectra, which can represent the main spectral characteristics of stars. A total of 144 340 line indices for A type stars is analyzed through calculating their intra and inter distances between pairs of stars. For intra distance, we use the definition of Mahalanobis distance to explore the degree of clustering for each class, while for outlier detection, we define a local outlier factor for each spectrum. AstroStat furnishes a set of visualization tools for illustrating the analysis results. Checking the spectra detected as outliers, we find that most of them are problematic data and only a few correspond to rare astronomical objects. We show two examples of these outliers, a spectrum with abnormal continuumand a spectrum with emission lines. Our work demonstrates that line index clustering is a good method for examining data quality and identifying rare objects.
Person mobility in the design and analysis of cluster-randomized cohort prevention trials.
Vuchinich, Sam; Flay, Brian R; Aber, Lawrence; Bickman, Leonard
2012-06-01
Person mobility is an inescapable fact of life for most cluster-randomized (e.g., schools, hospitals, clinic, cities, state) cohort prevention trials. Mobility rates are an important substantive consideration in estimating the effects of an intervention. In cluster-randomized trials, mobility rates are often correlated with ethnicity, poverty and other variables associated with disparity. This raises the possibility that estimated intervention effects may generalize to only the least mobile segments of a population and, thus, create a threat to external validity. Such mobility can also create threats to the internal validity of conclusions from randomized trials. Researchers must decide how to deal with persons who leave study clusters during a trial (dropouts), persons and clusters that do not comply with an assigned intervention, and persons who enter clusters during a trial (late entrants), in addition to the persons who remain for the duration of a trial (stayers). Statistical techniques alone cannot solve the key issues of internal and external validity raised by the phenomenon of person mobility. This commentary presents a systematic, Campbellian-type analysis of person mobility in cluster-randomized cohort prevention trials. It describes four approaches for dealing with dropouts, late entrants and stayers with respect to data collection, analysis and generalizability. The questions at issue are: 1) From whom should data be collected at each wave of data collection? 2) Which cases should be included in the analyses of an intervention effect? and 3) To what populations can trial results be generalized? The conclusions lead to recommendations for the design and analysis of future cluster-randomized cohort prevention trials.
Andridge, Rebecca. R.
2011-01-01
In cluster randomized trials (CRTs), identifiable clusters rather than individuals are randomized to study groups. Resulting data often consist of a small number of clusters with correlated observations within a treatment group. Missing data often present a problem in the analysis of such trials, and multiple imputation (MI) has been used to create complete data sets, enabling subsequent analysis with well-established analysis methods for CRTs. We discuss strategies for accounting for clustering when multiply imputing a missing continuous outcome, focusing on estimation of the variance of group means as used in an adjusted t-test or ANOVA. These analysis procedures are congenial to (can be derived from) a mixed effects imputation model; however, this imputation procedure is not yet available in commercial statistical software. An alternative approach that is readily available and has been used in recent studies is to include fixed effects for cluster, but the impact of using this convenient method has not been studied. We show that under this imputation model the MI variance estimator is positively biased and that smaller ICCs lead to larger overestimation of the MI variance. Analytical expressions for the bias of the variance estimator are derived in the case of data missing completely at random (MCAR), and cases in which data are missing at random (MAR) are illustrated through simulation. Finally, various imputation methods are applied to data from the Detroit Middle School Asthma Project, a recent school-based CRT, and differences in inference are compared. PMID:21259309
Naegle, Kristen M; Welsch, Roy E; Yaffe, Michael B; White, Forest M; Lauffenburger, Douglas A
2011-07-01
Advances in proteomic technologies continue to substantially accelerate capability for generating experimental data on protein levels, states, and activities in biological samples. For example, studies on receptor tyrosine kinase signaling networks can now capture the phosphorylation state of hundreds to thousands of proteins across multiple conditions. However, little is known about the function of many of these protein modifications, or the enzymes responsible for modifying them. To address this challenge, we have developed an approach that enhances the power of clustering techniques to infer functional and regulatory meaning of protein states in cell signaling networks. We have created a new computational framework for applying clustering to biological data in order to overcome the typical dependence on specific a priori assumptions and expert knowledge concerning the technical aspects of clustering. Multiple clustering analysis methodology ('MCAM') employs an array of diverse data transformations, distance metrics, set sizes, and clustering algorithms, in a combinatorial fashion, to create a suite of clustering sets. These sets are then evaluated based on their ability to produce biological insights through statistical enrichment of metadata relating to knowledge concerning protein functions, kinase substrates, and sequence motifs. We applied MCAM to a set of dynamic phosphorylation measurements of the ERRB network to explore the relationships between algorithmic parameters and the biological meaning that could be inferred and report on interesting biological predictions. Further, we applied MCAM to multiple phosphoproteomic datasets for the ERBB network, which allowed us to compare independent and incomplete overlapping measurements of phosphorylation sites in the network. We report specific and global differences of the ERBB network stimulated with different ligands and with changes in HER2 expression. Overall, we offer MCAM as a broadly-applicable approach for analysis of proteomic data which may help increase the current understanding of molecular networks in a variety of biological problems. © 2011 Naegle et al.
Rhodes, Scott D.; McCoy, Thomas P.
2014-01-01
This study explored correlates of condom use within a respondent-driven sample of 190 Spanish-speaking immigrant Latino sexual minorities, including gay and bisexual men, other men who have sex with men (MSM), and transgender person, in North Carolina. Five analytic approaches for modeling data collected using respondent-driven sampling (RDS) were compared. Across most approaches, knowledge of HIV and sexually transmitted infections (STIs) and increased condom use self-efficacy predicted consistent condom use and increased homophobia predicted decreased consistent condom use. The same correlates were not significant in all analyses but were consistent in most. Clustering due to recruitment chains was low, while clustering due to recruiter was substantial. This highlights the importance accounting for clustering when analyzing RDS data. PMID:25646728
Hierarchical Spatio-temporal Visual Analysis of Cluster Evolution in Electrocorticography Data
Murugesan, Sugeerth; Bouchard, Kristofer; Chang, Edward; ...
2016-10-02
Here, we present ECoG ClusterFlow, a novel interactive visual analysis tool for the exploration of high-resolution Electrocorticography (ECoG) data. Our system detects and visualizes dynamic high-level structures, such as communities, using the time-varying spatial connectivity network derived from the high-resolution ECoG data. ECoG ClusterFlow provides a multi-scale visualization of the spatio-temporal patterns underlying the time-varying communities using two views: 1) an overview summarizing the evolution of clusters over time and 2) a hierarchical glyph-based technique that uses data aggregation and small multiples techniques to visualize the propagation of clusters in their spatial domain. ECoG ClusterFlow makes it possible 1) tomore » compare the spatio-temporal evolution patterns across various time intervals, 2) to compare the temporal information at varying levels of granularity, and 3) to investigate the evolution of spatial patterns without occluding the spatial context information. Lastly, we present case studies done in collaboration with neuroscientists on our team for both simulated and real epileptic seizure data aimed at evaluating the effectiveness of our approach.« less
Hyde, J M; Cerezo, A; Williams, T J
2009-04-01
Statistical analysis of atom probe data has improved dramatically in the last decade and it is now possible to determine the size, the number density and the composition of individual clusters or precipitates such as those formed in reactor pressure vessel (RPV) steels during irradiation. However, the characterisation of the onset of clustering or co-segregation is more difficult and has traditionally focused on the use of composition frequency distributions (for detecting clustering) and contingency tables (for detecting co-segregation). In this work, the authors investigate the possibility of directly examining the neighbourhood of each individual solute atom as a means of identifying the onset of solute clustering and/or co-segregation. The methodology involves comparing the mean observed composition around a particular type of solute with that expected from the overall composition of the material. The methodology has been applied to atom probe data obtained from several irradiated RPV steels. The results show that the new approach is more sensitive to fine scale clustering and co-segregation than that achievable using composition frequency distribution and contingency table analyses.
Gifford, Elizabeth V; Tavakoli, Sara; Weingardt, Kenneth R; Finney, John W; Pierson, Heather M; Rosen, Craig S; Hagedorn, Hildi J; Cook, Joan M; Curran, Geoff M
2012-01-01
Evidence-based psychological treatments (EBPTs) are clusters of interventions, but it is unclear how providers actually implement these clusters in practice. A disaggregated measure of EBPTs was developed to characterize clinicians' component-level evidence-based practices and to examine relationships among these practices. Survey items captured components of evidence-based treatments based on treatment integrity measures. The Web-based survey was conducted with 75 U.S. Department of Veterans Affairs (VA) substance use disorder (SUD) practitioners and 149 non-VA community-based SUD practitioners. Clinician's self-designated treatment orientations were positively related to their endorsement of those EBPT components; however, clinicians used components from a variety of EBPTs. Hierarchical cluster analysis indicated that clinicians combined and organized interventions from cognitive-behavioral therapy, the community reinforcement approach, motivational interviewing, structured family and couples therapy, 12-step facilitation, and contingency management into clusters including empathy and support, treatment engagement and activation, abstinence initiation, and recovery maintenance. Understanding how clinicians use EBPT components may lead to improved evidence-based practice dissemination and implementation. Published by Elsevier Inc.
Ajayi, Alex A; Syed, Moin
2014-10-01
This study used a person-oriented analytic approach to identify meaningful patterns of barriers-focused racial socialization and perceived racial discrimination experiences in a sample of 295 late adolescents. Using cluster analysis, three distinct groups were identified: Low Barrier Socialization-Low Discrimination, High Barrier Socialization-Low Discrimination, and High Barrier Socialization-High Discrimination clusters. These groups were substantively unique in terms of the frequency of racial socialization messages about bias preparation and out-group mistrust its members received and their actual perceived discrimination experiences. Further, individuals in the High Barrier Socialization-High Discrimination cluster reported significantly higher depressive symptoms than those in the Low Barrier Socialization-Low Discrimination and High Barrier Socialization-Low Discrimination clusters. However, no differences in adjustment were observed between the Low Barrier Socialization-Low Discrimination and High Barrier Socialization-Low Discrimination clusters. Overall, the findings highlight important individual differences in how young people of color experience their race and how these differences have significant implications on psychological adjustment. Copyright © 2014 The Foundation for Professionals in Services for Adolescents. Published by Elsevier Ltd. All rights reserved.
AN AGGREGATION AND EPISODE SELECTION SCHEME FOR EPA'S MODELS-3 CMAQ
The development of an episode selection and aggregation approach, designed to support distributional estimation for use with the Models-3 Community Multiscale Air Quality (CMAQ) model, is described. The approach utilized cluster analysis of the 700 hPa u and v wind field compo...
Nemmi, Federico; Saint-Aubert, Laure; Adel, Djilali; Salabert, Anne-Sophie; Pariente, Jérémie; Barbeau, Emmanuel; Payoux, Pierre; Péran, Patrice
2014-01-01
Purpose AV-45 amyloid biomarker is known to show uptake in white matter in patients with Alzheimer’s disease (AD) but also in healthy population. This binding; thought to be of a non-specific lipophilic nature has not yet been investigated. The aim of this study was to determine the differential pattern of AV-45 binding in healthy and pathological populations in white matter. Methods We recruited 24 patients presenting with AD at early stage and 17 matched, healthy subjects. We used an optimized PET-MRI registration method and an approach based on intensity histogram using several indexes. We compared the results of the intensity histogram analyses with a more canonical approach based on target-to-cerebellum Standard Uptake Value (SUVr) in white and grey matters using MANOVA and discriminant analyses. A cluster analysis on white and grey matter histograms was also performed. Results White matter histogram analysis revealed significant differences between AD and healthy subjects, which were not revealed by SUVr analysis. However, white matter histograms was not decisive to discriminate groups, and indexes based on grey matter only showed better discriminative power than SUVr. The cluster analysis divided our sample in two clusters, showing different uptakes in grey but also in white matter. Conclusion These results demonstrate that AV-45 binding in white matter conveys subtle information not detectable using SUVr approach. Although it is not better than standard SUVr to discriminate AD patients from healthy subjects, this information could reveal white matter modifications. PMID:24573658
Yoo, Illhoi; Hu, Xiaohua; Song, Il-Yeol
2007-11-27
A huge amount of biomedical textual information has been produced and collected in MEDLINE for decades. In order to easily utilize biomedical information in the free text, document clustering and text summarization together are used as a solution for text information overload problem. In this paper, we introduce a coherent graph-based semantic clustering and summarization approach for biomedical literature. Our extensive experimental results show the approach shows 45% cluster quality improvement and 72% clustering reliability improvement, in terms of misclassification index, over Bisecting K-means as a leading document clustering approach. In addition, our approach provides concise but rich text summary in key concepts and sentences. Our coherent biomedical literature clustering and summarization approach that takes advantage of ontology-enriched graphical representations significantly improves the quality of document clusters and understandability of documents through summaries.
Yoo, Illhoi; Hu, Xiaohua; Song, Il-Yeol
2007-01-01
Background A huge amount of biomedical textual information has been produced and collected in MEDLINE for decades. In order to easily utilize biomedical information in the free text, document clustering and text summarization together are used as a solution for text information overload problem. In this paper, we introduce a coherent graph-based semantic clustering and summarization approach for biomedical literature. Results Our extensive experimental results show the approach shows 45% cluster quality improvement and 72% clustering reliability improvement, in terms of misclassification index, over Bisecting K-means as a leading document clustering approach. In addition, our approach provides concise but rich text summary in key concepts and sentences. Conclusion Our coherent biomedical literature clustering and summarization approach that takes advantage of ontology-enriched graphical representations significantly improves the quality of document clusters and understandability of documents through summaries. PMID:18047705
NASA Astrophysics Data System (ADS)
Arimbi, Mentari Dian; Bustamam, Alhadi; Lestari, Dian
2017-03-01
Data clustering can be executed through partition or hierarchical method for many types of data including DNA sequences. Both clustering methods can be combined by processing partition algorithm in the first level and hierarchical in the second level, called hybrid clustering. In the partition phase some popular methods such as PAM, K-means, or Fuzzy c-means methods could be applied. In this study we selected partitioning around medoids (PAM) in our partition stage. Furthermore, following the partition algorithm, in hierarchical stage we applied divisive analysis algorithm (DIANA) in order to have more specific clusters and sub clusters structures. The number of main clusters is determined using Davies Bouldin Index (DBI) value. We choose the optimal number of clusters if the results minimize the DBI value. In this work, we conduct the clustering on 1252 HPV DNA sequences data from GenBank. The characteristic extraction is initially performed, followed by normalizing and genetic distance calculation using Euclidean distance. In our implementation, we used the hybrid PAM and DIANA using the R open source programming tool. In our results, we obtained 3 main clusters with average DBI value is 0.979, using PAM in the first stage. After executing DIANA in the second stage, we obtained 4 sub clusters for Cluster-1, 9 sub clusters for Cluster-2 and 2 sub clusters in Cluster-3, with the BDI value 0.972, 0.771, and 0.768 for each main cluster respectively. Since the second stage produce lower DBI value compare to the DBI value in the first stage, we conclude that this hybrid approach can improve the accuracy of our clustering results.
Batke, Monika; Gütlein, Martin; Partosch, Falko; Gundert-Remy, Ursula; Helma, Christoph; Kramer, Stefan; Maunz, Andreas; Seeland, Madeleine; Bitsch, Annette
2016-01-01
Interest is increasing in the development of non-animal methods for toxicological evaluations. These methods are however, particularly challenging for complex toxicological endpoints such as repeated dose toxicity. European Legislation, e.g., the European Union's Cosmetic Directive and REACH, demands the use of alternative methods. Frameworks, such as the Read-across Assessment Framework or the Adverse Outcome Pathway Knowledge Base, support the development of these methods. The aim of the project presented in this publication was to develop substance categories for a read-across with complex endpoints of toxicity based on existing databases. The basic conceptual approach was to combine structural similarity with shared mechanisms of action. Substances with similar chemical structure and toxicological profile form candidate categories suitable for read-across. We combined two databases on repeated dose toxicity, RepDose database, and ELINCS database to form a common database for the identification of categories. The resulting database contained physicochemical, structural, and toxicological data, which were refined and curated for cluster analyses. We applied the Predictive Clustering Tree (PCT) approach for clustering chemicals based on structural and on toxicological information to detect groups of chemicals with similar toxic profiles and pathways/mechanisms of toxicity. As many of the experimental toxicity values were not available, this data was imputed by predicting them with a multi-label classification method, prior to clustering. The clustering results were evaluated by assessing chemical and toxicological similarities with the aim of identifying clusters with a concordance between structural information and toxicity profiles/mechanisms. From these chosen clusters, seven were selected for a quantitative read-across, based on a small ratio of NOAEL of the members with the highest and the lowest NOAEL in the cluster (< 5). We discuss the limitations of the approach. Based on this analysis we propose improvements for a follow-up approach, such as incorporation of metabolic information and more detailed mechanistic information. The software enables the user to allocate a substance in a cluster and to use this information for a possible read- across. The clustering tool is provided as a free web service, accessible at http://mlc-reach.informatik.uni-mainz.de.
MODEL-FREE MULTI-PROBE LENSING RECONSTRUCTION OF CLUSTER MASS PROFILES
DOE Office of Scientific and Technical Information (OSTI.GOV)
Umetsu, Keiichi
2013-05-20
Lens magnification by galaxy clusters induces characteristic spatial variations in the number counts of background sources, amplifying their observed fluxes and expanding the area of sky, the net effect of which, known as magnification bias, depends on the intrinsic faint-end slope of the source luminosity function. The bias is strongly negative for red galaxies, dominated by the geometric area distortion, whereas it is mildly positive for blue galaxies, enhancing the blue counts toward the cluster center. We generalize the Bayesian approach of Umetsu et al. for reconstructing projected cluster mass profiles, by incorporating multiple populations of background sources for magnification-biasmore » measurements and combining them with complementary lens-distortion measurements, effectively breaking the mass-sheet degeneracy and improving the statistical precision of cluster mass measurements. The approach can be further extended to include strong-lensing projected mass estimates, thus allowing for non-parametric absolute mass determinations in both the weak and strong regimes. We apply this method to our recent CLASH lensing measurements of MACS J1206.2-0847, and demonstrate how combining multi-probe lensing constraints can improve the reconstruction of cluster mass profiles. This method will also be useful for a stacked lensing analysis, combining all lensing-related effects in the cluster regime, for a definitive determination of the averaged mass profile.« less
Application of microarray analysis on computer cluster and cloud platforms.
Bernau, C; Boulesteix, A-L; Knaus, J
2013-01-01
Analysis of recent high-dimensional biological data tends to be computationally intensive as many common approaches such as resampling or permutation tests require the basic statistical analysis to be repeated many times. A crucial advantage of these methods is that they can be easily parallelized due to the computational independence of the resampling or permutation iterations, which has induced many statistics departments to establish their own computer clusters. An alternative is to rent computing resources in the cloud, e.g. at Amazon Web Services. In this article we analyze whether a selection of statistical projects, recently implemented at our department, can be efficiently realized on these cloud resources. Moreover, we illustrate an opportunity to combine computer cluster and cloud resources. In order to compare the efficiency of computer cluster and cloud implementations and their respective parallelizations we use microarray analysis procedures and compare their runtimes on the different platforms. Amazon Web Services provide various instance types which meet the particular needs of the different statistical projects we analyzed in this paper. Moreover, the network capacity is sufficient and the parallelization is comparable in efficiency to standard computer cluster implementations. Our results suggest that many statistical projects can be efficiently realized on cloud resources. It is important to mention, however, that workflows can change substantially as a result of a shift from computer cluster to cloud computing.
Cluster Analysis of Rat Olfactory Bulb Responses to Diverse Odorants
Falasconi, Matteo; Leon, Michael; Johnson, Brett A.; Marco, Santiago
2012-01-01
In an effort to deepen our understanding of mammalian olfactory coding, we have used an objective method to analyze a large set of odorant-evoked activity maps collected systematically across the rat olfactory bulb to determine whether such an approach could identify specific glomerular regions that are activated by related odorants. To that end, we combined fuzzy c-means clustering methods with a novel validity approach based on cluster stability to evaluate the significance of the fuzzy partitions on a data set of glomerular layer responses to a large diverse group of odorants. Our results confirm the existence of glomerular response clusters to similar odorants. They further indicate a partial hierarchical chemotopic organization wherein larger glomerular regions can be subdivided into smaller areas that are rather specific in their responses to particular functional groups of odorants. These clusters bear many similarities to, as well as some differences from, response domains previously proposed for the glomerular layer of the bulb. These data also provide additional support for the concept of an identity code in the mammalian olfactory system. PMID:22459165
Pakhomov, Serguei V.S.; Hemmy, Laura S.
2014-01-01
Generative semantic verbal fluency (SVF) tests show early and disproportionate decline relative to other abilities in individuals developing Alzheimer’s disease. Optimal performance on SVF tests depends on the efficiency of using clustered organization of semantically related items and the ability to switch between clusters. Traditional approaches to clustering and switching have relied on manual determination of clusters. We evaluated a novel automated computational linguistic approach for quantifying clustering behavior. Our approach is based on Latent Semantic Analysis (LSA) for computing strength of semantic relatedness between pairs of words produced in response to SVF test. The mean size of semantic clusters (MCS) and semantic chains (MChS) are calculated based on pairwise relatedness values between words. We evaluated the predictive validity of these measures on a set of 239 participants in the Nun Study, a longitudinal study of aging. All were cognitively intact at baseline assessment, measured with the CERAD battery, and were followed in 18 month waves for up to 20 years. The onset of either dementia or memory impairment were used as outcomes in Cox proportional hazards models adjusted for age and education and censored at follow up waves 5 (6.3 years) and 13 (16.96 years). Higher MCS was associated with 38% reduction in dementia risk at wave 5 and 26% reduction at wave 13, but not with the onset of memory impairment. Higher (+1 SD) MChS was associated with 39% dementia risk reduction at wave 5 but not wave 13, and association with memory impairment was not significant. Higher traditional SVF scores were associated with 22–29% memory impairment and 35–40% dementia risk reduction. SVF scores were not correlated with either MCS or MChS. Our study suggests that an automated approach to measuring clustering behavior can be used to estimate dementia risk in cognitively normal individuals. PMID:23845236
Pakhomov, Serguei V S; Hemmy, Laura S
2014-06-01
Generative semantic verbal fluency (SVF) tests show early and disproportionate decline relative to other abilities in individuals developing Alzheimer's disease. Optimal performance on SVF tests depends on the efficiency of using clustered organization of semantically related items and the ability to switch between clusters. Traditional approaches to clustering and switching have relied on manual determination of clusters. We evaluated a novel automated computational linguistic approach for quantifying clustering behavior. Our approach is based on Latent Semantic Analysis (LSA) for computing strength of semantic relatedness between pairs of words produced in response to SVF test. The mean size of semantic clusters (MCS) and semantic chains (MChS) are calculated based on pairwise relatedness values between words. We evaluated the predictive validity of these measures on a set of 239 participants in the Nun Study, a longitudinal study of aging. All were cognitively intact at baseline assessment, measured with the Consortium to Establish a Registry for Alzheimer's Disease (CERAD) battery, and were followed in 18-month waves for up to 20 years. The onset of either dementia or memory impairment were used as outcomes in Cox proportional hazards models adjusted for age and education and censored at follow-up waves 5 (6.3 years) and 13 (16.96 years). Higher MCS was associated with 38% reduction in dementia risk at wave 5 and 26% reduction at wave 13, but not with the onset of memory impairment. Higher [+1 standard deviation (SD)] MChS was associated with 39% dementia risk reduction at wave 5 but not wave 13, and association with memory impairment was not significant. Higher traditional SVF scores were associated with 22-29% memory impairment and 35-40% dementia risk reduction. SVF scores were not correlated with either MCS or MChS. Our study suggests that an automated approach to measuring clustering behavior can be used to estimate dementia risk in cognitively normal individuals. Copyright © 2013 Elsevier Ltd. All rights reserved.
Seismic Data Analysis throught Multi-Class Classification.
NASA Astrophysics Data System (ADS)
Anderson, P.; Kappedal, R. D.; Magana-Zook, S. A.
2017-12-01
In this research, we conducted twenty experiments of varying time and frequency bands on 5000seismic signals with the intent of finding a method to classify signals as either an explosion or anearthquake in an automated fashion. We used a multi-class approach by clustering of the data throughvarious techniques. Dimensional reduction was examined through the use of wavelet transforms withthe use of the coiflet mother wavelet and various coefficients to explore possible computational time vsaccuracy dependencies. Three and four classes were generated from the clustering techniques andexamined with the three class approach producing the most accurate and realistic results.
Lukashin, A V; Fuchs, R
2001-05-01
Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and samples. In the present paper, we focus on several important issues related to clustering algorithms that have not yet been fully studied. We describe a simple and robust algorithm for the clustering of temporal gene expression profiles that is based on the simulated annealing procedure. In general, this algorithm guarantees to eventually find the globally optimal distribution of genes over clusters. We introduce an iterative scheme that serves to evaluate quantitatively the optimal number of clusters for each specific data set. The scheme is based on standard approaches used in regular statistical tests. The basic idea is to organize the search of the optimal number of clusters simultaneously with the optimization of the distribution of genes over clusters. The efficiency of the proposed algorithm has been evaluated by means of a reverse engineering experiment, that is, a situation in which the correct distribution of genes over clusters is known a priori. The employment of this statistically rigorous test has shown that our algorithm places greater than 90% genes into correct clusters. Finally, the algorithm has been tested on real gene expression data (expression changes during yeast cell cycle) for which the fundamental patterns of gene expression and the assignment of genes to clusters are well understood from numerous previous studies.
Jeon, Jihyoun; Hsu, Li; Gorfine, Malka
2012-07-01
Frailty models are useful for measuring unobserved heterogeneity in risk of failures across clusters, providing cluster-specific risk prediction. In a frailty model, the latent frailties shared by members within a cluster are assumed to act multiplicatively on the hazard function. In order to obtain parameter and frailty variate estimates, we consider the hierarchical likelihood (H-likelihood) approach (Ha, Lee and Song, 2001. Hierarchical-likelihood approach for frailty models. Biometrika 88, 233-243) in which the latent frailties are treated as "parameters" and estimated jointly with other parameters of interest. We find that the H-likelihood estimators perform well when the censoring rate is low, however, they are substantially biased when the censoring rate is moderate to high. In this paper, we propose a simple and easy-to-implement bias correction method for the H-likelihood estimators under a shared frailty model. We also extend the method to a multivariate frailty model, which incorporates complex dependence structure within clusters. We conduct an extensive simulation study and show that the proposed approach performs very well for censoring rates as high as 80%. We also illustrate the method with a breast cancer data set. Since the H-likelihood is the same as the penalized likelihood function, the proposed bias correction method is also applicable to the penalized likelihood estimators.
Sun, Liping; Luo, Yonglong; Ding, Xintao; Zhang, Ji
2014-01-01
An important component of a spatial clustering algorithm is the distance measure between sample points in object space. In this paper, the traditional Euclidean distance measure is replaced with innovative obstacle distance measure for spatial clustering under obstacle constraints. Firstly, we present a path searching algorithm to approximate the obstacle distance between two points for dealing with obstacles and facilitators. Taking obstacle distance as similarity metric, we subsequently propose the artificial immune clustering with obstacle entity (AICOE) algorithm for clustering spatial point data in the presence of obstacles and facilitators. Finally, the paper presents a comparative analysis of AICOE algorithm and the classical clustering algorithms. Our clustering model based on artificial immune system is also applied to the case of public facility location problem in order to establish the practical applicability of our approach. By using the clone selection principle and updating the cluster centers based on the elite antibodies, the AICOE algorithm is able to achieve the global optimum and better clustering effect.
A QUANTITATIVE ANALYSIS OF DISTANT OPEN CLUSTERS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Janes, Kenneth A.; Hoq, Sadia
2011-03-15
The oldest open star clusters are important for tracing the history of the Galactic disk, but many of the more distant clusters are heavily reddened and projected against the rich stellar background of the Galaxy. We have undertaken an investigation of several distant clusters (Berkeley 19, Berkeley 44, King 25, NGC 6802, NGC 6827, Berkeley 52, Berkeley 56, NGC 7142, NGC 7245, and King 9) to develop procedures for separating probable cluster members from the background field. We next created a simple quantitative approach for finding approximate cluster distances, reddenings, and ages. We first conclude that with the possible exceptionmore » of King 25 they are probably all physical clusters. We also find that for these distant clusters our typical errors are about {+-}0.07 in E(B - V), {+-}0.15 in log(age), and {+-}0.25 in (m - M){sub o}. The clusters range in age from 470 Myr to 7 Gyr and range from 7.1 to 16.4 kpc from the Galactic center.« less
Nguyen, Hien D; Ullmann, Jeremy F P; McLachlan, Geoffrey J; Voleti, Venkatakaushik; Li, Wenze; Hillman, Elizabeth M C; Reutens, David C; Janke, Andrew L
2018-02-01
Calcium is a ubiquitous messenger in neural signaling events. An increasing number of techniques are enabling visualization of neurological activity in animal models via luminescent proteins that bind to calcium ions. These techniques generate large volumes of spatially correlated time series. A model-based functional data analysis methodology via Gaussian mixtures is suggested for the clustering of data from such visualizations is proposed. The methodology is theoretically justified and a computationally efficient approach to estimation is suggested. An example analysis of a zebrafish imaging experiment is presented.
MIXOR: a computer program for mixed-effects ordinal regression analysis.
Hedeker, D; Gibbons, R D
1996-03-01
MIXOR provides maximum marginal likelihood estimates for mixed-effects ordinal probit, logistic, and complementary log-log regression models. These models can be used for analysis of dichotomous and ordinal outcomes from either a clustered or longitudinal design. For clustered data, the mixed-effects model assumes that data within clusters are dependent. The degree of dependency is jointly estimated with the usual model parameters, thus adjusting for dependence resulting from clustering of the data. Similarly, for longitudinal data, the mixed-effects approach can allow for individual-varying intercepts and slopes across time, and can estimate the degree to which these time-related effects vary in the population of individuals. MIXOR uses marginal maximum likelihood estimation, utilizing a Fisher-scoring solution. For the scoring solution, the Cholesky factor of the random-effects variance-covariance matrix is estimated, along with the effects of model covariates. Examples illustrating usage and features of MIXOR are provided.
NASA Astrophysics Data System (ADS)
Jha, S. K.; Brockman, R. A.; Hoffman, R. M.; Sinha, V.; Pilchak, A. L.; Porter, W. J.; Buchanan, D. J.; Larsen, J. M.; John, R.
2018-05-01
Principal component analysis and fuzzy c-means clustering algorithms were applied to slip-induced strain and geometric metric data in an attempt to discover unique microstructural configurations and their frequencies of occurrence in statistically representative instantiations of a titanium alloy microstructure. Grain-averaged fatigue indicator parameters were calculated for the same instantiation. The fatigue indicator parameters strongly correlated with the spatial location of the microstructural configurations in the principal components space. The fuzzy c-means clustering method identified clusters of data that varied in terms of their average fatigue indicator parameters. Furthermore, the number of points in each cluster was inversely correlated to the average fatigue indicator parameter. This analysis demonstrates that data-driven methods have significant potential for providing unbiased determination of unique microstructural configurations and their frequencies of occurrence in a given volume from the point of view of strain localization and fatigue crack initiation.
Aslan, Mikail; Davis, Jack B A; Johnston, Roy L
2016-03-07
The global optimisation of small bimetallic PdCo binary nanoalloys are systematically investigated using the Birmingham Cluster Genetic Algorithm (BCGA). The effect of size and composition on the structures, stability, magnetic and electronic properties including the binding energies, second finite difference energies and mixing energies of Pd-Co binary nanoalloys are discussed. A detailed analysis of Pd-Co structural motifs and segregation effects is also presented. The maximal mixing energy corresponds to Pd atom compositions for which the number of mixed Pd-Co bonds is maximised. Global minimum clusters are distinguished from transition states by vibrational frequency analysis. HOMO-LUMO gap, electric dipole moment and vibrational frequency analyses are made to enable correlation with future experiments.
Clustering cancer gene expression data by projective clustering ensemble
Yu, Xianxue; Yu, Guoxian
2017-01-01
Gene expression data analysis has paramount implications for gene treatments, cancer diagnosis and other domains. Clustering is an important and promising tool to analyze gene expression data. Gene expression data is often characterized by a large amount of genes but with limited samples, thus various projective clustering techniques and ensemble techniques have been suggested to combat with these challenges. However, it is rather challenging to synergy these two kinds of techniques together to avoid the curse of dimensionality problem and to boost the performance of gene expression data clustering. In this paper, we employ a projective clustering ensemble (PCE) to integrate the advantages of projective clustering and ensemble clustering, and to avoid the dilemma of combining multiple projective clusterings. Our experimental results on publicly available cancer gene expression data show PCE can improve the quality of clustering gene expression data by at least 4.5% (on average) than other related techniques, including dimensionality reduction based single clustering and ensemble approaches. The empirical study demonstrates that, to further boost the performance of clustering cancer gene expression data, it is necessary and promising to synergy projective clustering with ensemble clustering. PCE can serve as an effective alternative technique for clustering gene expression data. PMID:28234920
A transversal approach to predict gene product networks from ontology-based similarity
Chabalier, Julie; Mosser, Jean; Burgun, Anita
2007-01-01
Background Interpretation of transcriptomic data is usually made through a "standard" approach which consists in clustering the genes according to their expression patterns and exploiting Gene Ontology (GO) annotations within each expression cluster. This approach makes it difficult to underline functional relationships between gene products that belong to different expression clusters. To address this issue, we propose a transversal analysis that aims to predict functional networks based on a combination of GO processes and data expression. Results The transversal approach presented in this paper consists in computing the semantic similarity between gene products in a Vector Space Model. Through a weighting scheme over the annotations, we take into account the representativity of the terms that annotate a gene product. Comparing annotation vectors results in a matrix of gene product similarities. Combined with expression data, the matrix is displayed as a set of functional gene networks. The transversal approach was applied to 186 genes related to the enterocyte differentiation stages. This approach resulted in 18 functional networks proved to be biologically relevant. These results were compared with those obtained through a standard approach and with an approach based on information content similarity. Conclusion Complementary to the standard approach, the transversal approach offers new insight into the cellular mechanisms and reveals new research hypotheses by combining gene product networks based on semantic similarity, and data expression. PMID:17605807
The PhytoClust tool for metabolic gene clusters discovery in plant genomes
Fuchs, Lisa-Maria
2017-01-01
Abstract The existence of Metabolic Gene Clusters (MGCs) in plant genomes has recently raised increased interest. Thus far, MGCs were commonly identified for pathways of specialized metabolism, mostly those associated with terpene type products. For efficient identification of novel MGCs, computational approaches are essential. Here, we present PhytoClust; a tool for the detection of candidate MGCs in plant genomes. The algorithm employs a collection of enzyme families related to plant specialized metabolism, translated into hidden Markov models, to mine given genome sequences for physically co-localized metabolic enzymes. Our tool accurately identifies previously characterized plant MGCs. An exhaustive search of 31 plant genomes detected 1232 and 5531 putative gene cluster types and candidates, respectively. Clustering analysis of putative MGCs types by species reflected plant taxonomy. Furthermore, enrichment analysis revealed taxa- and species-specific enrichment of certain enzyme families in MGCs. When operating through our web-interface, PhytoClust users can mine a genome either based on a list of known cluster types or by defining new cluster rules. Moreover, for selected plant species, the output can be complemented by co-expression analysis. Altogether, we envisage PhytoClust to enhance novel MGCs discovery which will in turn impact the exploration of plant metabolism. PMID:28486689
The PhytoClust tool for metabolic gene clusters discovery in plant genomes.
Töpfer, Nadine; Fuchs, Lisa-Maria; Aharoni, Asaph
2017-07-07
The existence of Metabolic Gene Clusters (MGCs) in plant genomes has recently raised increased interest. Thus far, MGCs were commonly identified for pathways of specialized metabolism, mostly those associated with terpene type products. For efficient identification of novel MGCs, computational approaches are essential. Here, we present PhytoClust; a tool for the detection of candidate MGCs in plant genomes. The algorithm employs a collection of enzyme families related to plant specialized metabolism, translated into hidden Markov models, to mine given genome sequences for physically co-localized metabolic enzymes. Our tool accurately identifies previously characterized plant MGCs. An exhaustive search of 31 plant genomes detected 1232 and 5531 putative gene cluster types and candidates, respectively. Clustering analysis of putative MGCs types by species reflected plant taxonomy. Furthermore, enrichment analysis revealed taxa- and species-specific enrichment of certain enzyme families in MGCs. When operating through our web-interface, PhytoClust users can mine a genome either based on a list of known cluster types or by defining new cluster rules. Moreover, for selected plant species, the output can be complemented by co-expression analysis. Altogether, we envisage PhytoClust to enhance novel MGCs discovery which will in turn impact the exploration of plant metabolism. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Boström, P K; Broberg, M; Bodin, L
2011-01-01
Despite previous efforts to understand temperament in children with intellectual disability (ID), and how child temperament may affect parents, the approach has so far been unidimensional. Child temperament has been considered in relation to diagnosis, with the inherent risk of overlooking individual variation of children's temperament profiles within diagnostic groups. The aim of the present study was to identify temperamental profiles of children with ID, and investigate how these may affect parents in terms of positive and negative impacts. Parent-rated temperament in children with ID was explored through a person-oriented approach (cluster analysis). Children with ID (N=49) and typically developing (TD) children (N=82) aged between 4 and 6 years were clustered separately. Variation in temperament profiles was more prominent among children with ID than in TD children. Out of the three clusters found in the ID group, the disruptive, and passive/withdrawn clusters were distinctly different from clusters found in the TD group in terms of temperament, while the cluster active and outgoing was similar in shape and level of temperament ratings of TD children. Children within the disruptive cluster were described to have more negative and less positive impacts on mothers compared to children within the other clusters in the ID group. Mothers who describe their children as having disruptive temperament may be at particular risk for experiencing higher parenting stress as they report that the child has higher negative and lower positive impacts than other parents describe. The absence of a relationship between child temperament profile and positive or negative impact on fathers may indicate that fathers are less affected by child temperament. However, this relationship needs to be further explored. Copyright © 2011 Elsevier Ltd. All rights reserved.
Integrative Data Analysis of Multi-Platform Cancer Data with a Multimodal Deep Learning Approach.
Liang, Muxuan; Li, Zhizhong; Chen, Ting; Zeng, Jianyang
2015-01-01
Identification of cancer subtypes plays an important role in revealing useful insights into disease pathogenesis and advancing personalized therapy. The recent development of high-throughput sequencing technologies has enabled the rapid collection of multi-platform genomic data (e.g., gene expression, miRNA expression, and DNA methylation) for the same set of tumor samples. Although numerous integrative clustering approaches have been developed to analyze cancer data, few of them are particularly designed to exploit both deep intrinsic statistical properties of each input modality and complex cross-modality correlations among multi-platform input data. In this paper, we propose a new machine learning model, called multimodal deep belief network (DBN), to cluster cancer patients from multi-platform observation data. In our integrative clustering framework, relationships among inherent features of each single modality are first encoded into multiple layers of hidden variables, and then a joint latent model is employed to fuse common features derived from multiple input modalities. A practical learning algorithm, called contrastive divergence (CD), is applied to infer the parameters of our multimodal DBN model in an unsupervised manner. Tests on two available cancer datasets show that our integrative data analysis approach can effectively extract a unified representation of latent features to capture both intra- and cross-modality correlations, and identify meaningful disease subtypes from multi-platform cancer data. In addition, our approach can identify key genes and miRNAs that may play distinct roles in the pathogenesis of different cancer subtypes. Among those key miRNAs, we found that the expression level of miR-29a is highly correlated with survival time in ovarian cancer patients. These results indicate that our multimodal DBN based data analysis approach may have practical applications in cancer pathogenesis studies and provide useful guidelines for personalized cancer therapy.
Understanding students' motivation in project work: a 2 x 2 achievement goal approach.
Liu, Woon Chia; Wang, C K John; Tan, Oon Seng; Ee, Jessie; Koh, Caroline
2009-03-01
The project work (PW) initiative was launched in 2000 by the Ministry of Education, Singapore, to encourage application of knowledge across disciplines, and to develop thinking, communication, collaboration, and metacognitive skills. Although PW has been introduced for a few years, few studies have examined the motivation of students in PW, especially with the use of the recently proposed 2 x 2 achievement goal framework. To use a cluster analytic approach to identify students' achievement goal profiles at an intra-individual level, and to examine their links to various psychological characteristics and perceived outcomes in PW. Participants were 491 Secondary 2 students (mean age = 13.78, SD = 0.77) from two government coeducational schools. Cluster analysis was performed to identify distinct subgroups of students with similar achievement goal profiles. One-way MANOVAs, followed by post hoc Tukey HSD tests for pairwise comparisons were used to determine whether there was any significant difference amongst clusters in terms of the psychological characteristics and perceived outcomes in PW. Four distinct clusters of students were identified. The cluster with high achievement goals and the cluster with moderately high goals had the most positive psychological characteristics and perceived outcomes. In contrast, the cluster with very low scores for all four achievement goals had the most maladaptive profile. The study provides support for the 2 x 2 achievement goal framework, and demonstrates that multiple goals can operate simultaneously. However, it highlights the need for cross-cultural studies to look into the approach-avoidance dimension in the 2 x 2 achievement goal framework.
Online Learning: Learner Characteristics and Their Approaches to Managing Learning
ERIC Educational Resources Information Center
del Valle, Rodrigo; Duffy, Thomas M.
2009-01-01
Using cluster analysis this study investigated the characteristics of learning strategies learners use in online courses with one-on-one mentoring. Three distinct approaches were identified: "Mastery oriented", "Task focused" and "Minimalist in effort". Despite the widespread concern that students will have difficulty managing their time in online…
Arpino, Bruno; Cannas, Massimo
2016-05-30
This article focuses on the implementation of propensity score matching for clustered data. Different approaches to reduce bias due to cluster-level confounders are considered and compared using Monte Carlo simulations. We investigated methods that exploit the clustered structure of the data in two ways: in the estimation of the propensity score model (through the inclusion of fixed or random effects) or in the implementation of the matching algorithm. In addition to a pure within-cluster matching, we also assessed the performance of a new approach, 'preferential' within-cluster matching. This approach first searches for control units to be matched to treated units within the same cluster. If matching is not possible within-cluster, then the algorithm searches in other clusters. All considered approaches successfully reduced the bias due to the omission of a cluster-level confounder. The preferential within-cluster matching approach, combining the advantages of within-cluster and between-cluster matching, showed a relatively good performance both in the presence of big and small clusters, and it was often the best method. An important advantage of this approach is that it reduces the number of unmatched units as compared with a pure within-cluster matching. We applied these methods to the estimation of the effect of caesarean section on the Apgar score using birth register data. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Scharfenberg, Franz-Josef; Bogner, Franz X.
2011-08-01
Emphasis on improving higher level biology education continues. A new two-step approach to the experimental phases within an outreach gene technology lab, derived from cognitive load theory, is presented. We compared our approach using a quasi-experimental design with the conventional one-step mode. The difference consisted of additional focused discussions combined with students writing down their ideas (step one) prior to starting any experimental procedure (step two). We monitored students' activities during the experimental phases by continuously videotaping 20 work groups within each approach ( N = 131). Subsequent classification of students' activities yielded 10 categories (with well-fitting intra- and inter-observer scores with respect to reliability). Based on the students' individual time budgets, we evaluated students' roles during experimentation from their prevalent activities (by independently using two cluster analysis methods). Independently of the approach, two common clusters emerged, which we labeled as `all-rounders' and as `passive students', and two clusters specific to each approach: `observers' as well as `high-experimenters' were identified only within the one-step approach whereas under the two-step conditions `managers' and `scribes' were identified. Potential changes in group-leadership style during experimentation are discussed, and conclusions for optimizing science teaching are drawn.
Improving Fraud and Abuse Detection in General Physician Claims: A Data Mining Study
Joudaki, Hossein; Rashidian, Arash; Minaei-Bidgoli, Behrouz; Mahmoodi, Mahmood; Geraili, Bijan; Nasiri, Mahdi; Arab, Mohammad
2016-01-01
Background: We aimed to identify the indicators of healthcare fraud and abuse in general physicians’ drug prescription claims, and to identify a subset of general physicians that were more likely to have committed fraud and abuse. Methods: We applied data mining approach to a major health insurance organization dataset of private sector general physicians’ prescription claims. It involved 5 steps: clarifying the nature of the problem and objectives, data preparation, indicator identification and selection, cluster analysis to identify suspect physicians, and discriminant analysis to assess the validity of the clustering approach. Results: Thirteen indicators were developed in total. Over half of the general physicians (54%) were ‘suspects’ of conducting abusive behavior. The results also identified 2% of physicians as suspects of fraud. Discriminant analysis suggested that the indicators demonstrated adequate performance in the detection of physicians who were suspect of perpetrating fraud (98%) and abuse (85%) in a new sample of data. Conclusion: Our data mining approach will help health insurance organizations in low-and middle-income countries (LMICs) in streamlining auditing approaches towards the suspect groups rather than routine auditing of all physicians. PMID:26927587
Improving Fraud and Abuse Detection in General Physician Claims: A Data Mining Study.
Joudaki, Hossein; Rashidian, Arash; Minaei-Bidgoli, Behrouz; Mahmoodi, Mahmood; Geraili, Bijan; Nasiri, Mahdi; Arab, Mohammad
2015-11-10
We aimed to identify the indicators of healthcare fraud and abuse in general physicians' drug prescription claims, and to identify a subset of general physicians that were more likely to have committed fraud and abuse. We applied data mining approach to a major health insurance organization dataset of private sector general physicians' prescription claims. It involved 5 steps: clarifying the nature of the problem and objectives, data preparation, indicator identification and selection, cluster analysis to identify suspect physicians, and discriminant analysis to assess the validity of the clustering approach. Thirteen indicators were developed in total. Over half of the general physicians (54%) were 'suspects' of conducting abusive behavior. The results also identified 2% of physicians as suspects of fraud. Discriminant analysis suggested that the indicators demonstrated adequate performance in the detection of physicians who were suspect of perpetrating fraud (98%) and abuse (85%) in a new sample of data. Our data mining approach will help health insurance organizations in low-and middle-income countries (LMICs) in streamlining auditing approaches towards the suspect groups rather than routine auditing of all physicians. © 2016 by Kerman University of Medical Sciences.
Stratification of co-evolving genomic groups using ranked phylogenetic profiles
Freilich, Shiri; Goldovsky, Leon; Gottlieb, Assaf; Blanc, Eric; Tsoka, Sophia; Ouzounis, Christos A
2009-01-01
Background Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database. Results The rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples. Conclusion Our results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples. PMID:19860884
Genetic Network Inference: From Co-Expression Clustering to Reverse Engineering
NASA Technical Reports Server (NTRS)
Dhaeseleer, Patrik; Liang, Shoudan; Somogyi, Roland
2000-01-01
Advances in molecular biological, analytical, and computational technologies are enabling us to systematically investigate the complex molecular processes underlying biological systems. In particular, using high-throughput gene expression assays, we are able to measure the output of the gene regulatory network. We aim here to review datamining and modeling approaches for conceptualizing and unraveling the functional relationships implicit in these datasets. Clustering of co-expression profiles allows us to infer shared regulatory inputs and functional pathways. We discuss various aspects of clustering, ranging from distance measures to clustering algorithms and multiple-duster memberships. More advanced analysis aims to infer causal connections between genes directly, i.e., who is regulating whom and how. We discuss several approaches to the problem of reverse engineering of genetic networks, from discrete Boolean networks, to continuous linear and non-linear models. We conclude that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting, and bioengineering.
Cluster Analysis of Weighted Bipartite Networks: A New Copula-Based Approach
Chessa, Alessandro; Crimaldi, Irene; Riccaboni, Massimo; Trapin, Luca
2014-01-01
In this work we are interested in identifying clusters of “positional equivalent” actors, i.e. actors who play a similar role in a system. In particular, we analyze weighted bipartite networks that describes the relationships between actors on one side and features or traits on the other, together with the intensity level to which actors show their features. We develop a methodological approach that takes into account the underlying multivariate dependence among groups of actors. The idea is that positions in a network could be defined on the basis of the similar intensity levels that the actors exhibit in expressing some features, instead of just considering relationships that actors hold with each others. Moreover, we propose a new clustering procedure that exploits the potentiality of copula functions, a mathematical instrument for the modelization of the stochastic dependence structure. Our clustering algorithm can be applied both to binary and real-valued matrices. We validate it with simulations and applications to real-world data. PMID:25303095
The Productivity Analysis of Chennai Automotive Industry Cluster
NASA Astrophysics Data System (ADS)
Bhaskaran, E.
2014-07-01
Chennai, also called the Detroit of India, is India's second fastest growing auto market and exports auto components and vehicles to US, Germany, Japan and Brazil. For inclusive growth and sustainable development, 250 auto component industries in Ambattur, Thirumalisai and Thirumudivakkam Industrial Estates located in Chennai have adopted the Cluster Development Approach called Automotive Component Cluster. The objective is to study the Value Chain, Correlation and Data Envelopment Analysis by determining technical efficiency, peer weights, input and output slacks of 100 auto component industries in three estates. The methodology adopted is using Data Envelopment Analysis of Output Oriented Banker Charnes Cooper model by taking net worth, fixed assets, employment as inputs and gross output as outputs. The non-zero represents the weights for efficient clusters. The higher slack obtained reveals the excess net worth, fixed assets, employment and shortage in gross output. To conclude, the variables are highly correlated and the inefficient industries should increase their gross output or decrease the fixed assets or employment. Moreover for sustainable development, the cluster should strengthen infrastructure, technology, procurement, production and marketing interrelationships to decrease costs and to increase productivity and efficiency to compete in the indigenous and export market.
Generation Y, wine and alcohol. A semantic differential approach to consumption analysis in Tuscany.
Marinelli, Nicola; Fabbrizzi, Sara; Alampi Sottini, Veronica; Sacchelli, Sandro; Bernetti, Iacopo; Menghini, Silvio
2014-04-01
The aim of the study is the elicitation of the consumer's semantic perception of different alcoholic beverages in order to provide information for the definition of communication strategies for both the private sector (and specifically the wine industry) and the public decision maker. Such information can be seen as the basis of a wider social marketing construct aimed at the promotion of responsible drinking among young consumers. The semantic differential approach was used in this study. The data collection was based on a survey to 430 consumers between 18 and 35years old in Tuscany, Italy. The database was organized in a three-way structure, indexing the data in a multiway matrix. The data were processed using a Multiple Factor Analysis (MFA). Moreover, homogeneous clusters of consumers were identified using a Hierarchical Clustering on Principal Components (HCPC) approach. The results of the study highlight that beer and spirits are mainly perceived as "Young", "Social", "Euphoric", "Happy", "Appealing" and "Trendy" beverages, while wine is associated mostly with terms such as "Pleasure", "Quality" and "Comfortable". Furthermore, the cluster analysis allowed for the identification of three groups of individuals with different approaches to alcohol drinking. The results of the study supply a useful information framework for the elaboration of specific communication strategies that, based on the drinking habits of young consumers and their perception of different beverages, can use a language that is very close to the consumer typologies. Such information can be helpful for both private and public communication strategies. Copyright © 2013 The Authors. Published by Elsevier Ltd.. All rights reserved.
2010-01-01
Background The implementation of new medical knowledge into general practice is a complex process. Blended learning may offer an effective and efficient educational intervention to reduce the knowledge-to-practice gap. The aim of this study was to compare knowledge acquisition about dementia management between a blended learning approach using online modules in addition to quality circles (QCs) and QCs alone. Methods In this cluster-randomised trial with QCs as clusters and general practitioners (GPs) as participants, 389 GPs from 26 QCs in the western part of Germany were invited to participate. Data on the GPs' knowledge were obtained at three points in time by means of a questionnaire survey. Primary outcome was the knowledge gain before and after the interventions. A subgroup analysis of the users of the online modules was performed. Results 166 GPs were available for analysis and filled out a knowledge test at least two times. A significant increase of knowledge was found in both groups that indicated positive learning effects of both approaches. However, there was no significant difference between the groups. A subgroup analysis of the GPs who self-reported that they had actually used the online modules showed that they had a significant increase in their knowledge scores. Conclusion A blended learning approach was not superior to a QCs approach for improving knowledge about dementia management. However, a subgroup of GPs who were motivated to actually use the online modules had a gain in knowledge. Trial registration Current Controlled Trials ISRCTN36550981. PMID:20047652
Information extraction from dynamic PS-InSAR time series using machine learning
NASA Astrophysics Data System (ADS)
van de Kerkhof, B.; Pankratius, V.; Chang, L.; van Swol, R.; Hanssen, R. F.
2017-12-01
Due to the increasing number of SAR satellites, with shorter repeat intervals and higher resolutions, SAR data volumes are exploding. Time series analyses of SAR data, i.e. Persistent Scatterer (PS) InSAR, enable the deformation monitoring of the built environment at an unprecedented scale, with hundreds of scatterers per km2, updated weekly. Potential hazards, e.g. due to failure of aging infrastructure, can be detected at an early stage. Yet, this requires the operational data processing of billions of measurement points, over hundreds of epochs, updating this data set dynamically as new data come in, and testing whether points (start to) behave in an anomalous way. Moreover, the quality of PS-InSAR measurements is ambiguous and heterogeneous, which will yield false positives and false negatives. Such analyses are numerically challenging. Here we extract relevant information from PS-InSAR time series using machine learning algorithms. We cluster (group together) time series with similar behaviour, even though they may not be spatially close, such that the results can be used for further analysis. First we reduce the dimensionality of the dataset in order to be able to cluster the data, since applying clustering techniques on high dimensional datasets often result in unsatisfying results. Our approach is to apply t-distributed Stochastic Neighbor Embedding (t-SNE), a machine learning algorithm for dimensionality reduction of high-dimensional data to a 2D or 3D map, and cluster this result using Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The results show that we are able to detect and cluster time series with similar behaviour, which is the starting point for more extensive analysis into the underlying driving mechanisms. The results of the methods are compared to conventional hypothesis testing as well as a Self-Organising Map (SOM) approach. Hypothesis testing is robust and takes the stochastic nature of the observations into account, but is time consuming. Therefore, we successively apply our machine learning approach with the hypothesis testing approach in order to benefit from both the reduced computation time of the machine learning approach as from the robust quality metrics of hypothesis testing. We acknowledge support from NASA AISTNNX15AG84G (PI V. Pankratius)
Efficient clustering aggregation based on data fragments.
Wu, Ou; Hu, Weiming; Maybank, Stephen J; Zhu, Mingliang; Li, Bing
2012-06-01
Clustering aggregation, known as clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a single better clustering. Existing clustering aggregation algorithms are applied directly to data points, in what is referred to as the point-based approach. The algorithms are inefficient if the number of data points is large. We define an efficient approach for clustering aggregation based on data fragments. In this fragment-based approach, a data fragment is any subset of the data that is not split by any of the clustering results. To establish the theoretical bases of the proposed approach, we prove that clustering aggregation can be performed directly on data fragments under two widely used goodness measures for clustering aggregation taken from the literature. Three new clustering aggregation algorithms are described. The experimental results obtained using several public data sets show that the new algorithms have lower computational complexity than three well-known existing point-based clustering aggregation algorithms (Agglomerative, Furthest, and LocalSearch); nevertheless, the new algorithms do not sacrifice the accuracy.
Earthquake Declustering via a Nearest-Neighbor Approach in Space-Time-Magnitude Domain
NASA Astrophysics Data System (ADS)
Zaliapin, I. V.; Ben-Zion, Y.
2016-12-01
We propose a new method for earthquake declustering based on nearest-neighbor analysis of earthquakes in space-time-magnitude domain. The nearest-neighbor approach was recently applied to a variety of seismological problems that validate the general utility of the technique and reveal the existence of several different robust types of earthquake clusters. Notably, it was demonstrated that clustering associated with the largest earthquakes is statistically different from that of small-to-medium events. In particular, the characteristic bimodality of the nearest-neighbor distances that helps separating clustered and background events is often violated after the largest earthquakes in their vicinity, which is dominated by triggered events. This prevents using a simple threshold between the two modes of the nearest-neighbor distance distribution for declustering. The current study resolves this problem hence extending the nearest-neighbor approach to the problem of earthquake declustering. The proposed technique is applied to seismicity of different areas in California (San Jacinto, Coso, Salton Sea, Parkfield, Ventura, Mojave, etc.), as well as to the global seismicity, to demonstrate its stability and efficiency in treating various clustering types. The results are compared with those of alternative declustering methods.
Cluster analysis as a prediction tool for pregnancy outcomes.
Banjari, Ines; Kenjerić, Daniela; Šolić, Krešimir; Mandić, Milena L
2015-03-01
Considering specific physiology changes during gestation and thinking of pregnancy as a "critical window", classification of pregnant women at early pregnancy can be considered as crucial. The paper demonstrates the use of a method based on an approach from intelligent data mining, cluster analysis. Cluster analysis method is a statistical method which makes possible to group individuals based on sets of identifying variables. The method was chosen in order to determine possibility for classification of pregnant women at early pregnancy to analyze unknown correlations between different variables so that the certain outcomes could be predicted. 222 pregnant women from two general obstetric offices' were recruited. The main orient was set on characteristics of these pregnant women: their age, pre-pregnancy body mass index (BMI) and haemoglobin value. Cluster analysis gained a 94.1% classification accuracy rate with three branch- es or groups of pregnant women showing statistically significant correlations with pregnancy outcomes. The results are showing that pregnant women both of older age and higher pre-pregnancy BMI have a significantly higher incidence of delivering baby of higher birth weight but they gain significantly less weight during pregnancy. Their babies are also longer, and these women have significantly higher probability for complications during pregnancy (gestosis) and higher probability of induced or caesarean delivery. We can conclude that the cluster analysis method can appropriately classify pregnant women at early pregnancy to predict certain outcomes.
2013-01-01
Background Next generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of next generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. While these tools filled an important gap, current approaches for storing, processing, and analyzing short read datasets generally have remained simple and lack the complexity needed to efficiently model the produced reads and assemble them correctly. Results Previously, we presented an overlap graph coarsening scheme for modeling read overlap relationships on multiple levels. Most current read assembly and analysis approaches use a single graph or set of clusters to represent the relationships among a read dataset. Instead, we use a series of graphs to represent the reads and their overlap relationships across a spectrum of information granularity. At each information level our algorithm is capable of generating clusters of reads from the reduced graph, forming an integrated graph modeling and clustering approach for read analysis and assembly. Previously we applied our algorithm to simulated and real 454 datasets to assess its ability to efficiently model and cluster next generation sequencing data. In this paper we extend our algorithm to large simulated and real Illumina datasets to demonstrate that our algorithm is practical for both sequencing technologies. Conclusions Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships, is scalable for large datasets, and is practical for both Illumina and 454 sequencing technologies. PMID:24564333
A system for learning statistical motion patterns.
Hu, Weiming; Xiao, Xuejuan; Fu, Zhouyu; Xie, Dan; Tan, Tieniu; Maybank, Steve
2006-09-01
Analysis of motion patterns is an effective approach for anomaly detection and behavior prediction. Current approaches for the analysis of motion patterns depend on known scenes, where objects move in predefined ways. It is highly desirable to automatically construct object motion patterns which reflect the knowledge of the scene. In this paper, we present a system for automatically learning motion patterns for anomaly detection and behavior prediction based on a proposed algorithm for robustly tracking multiple objects. In the tracking algorithm, foreground pixels are clustered using a fast accurate fuzzy K-means algorithm. Growing and prediction of the cluster centroids of foreground pixels ensure that each cluster centroid is associated with a moving object in the scene. In the algorithm for learning motion patterns, trajectories are clustered hierarchically using spatial and temporal information and then each motion pattern is represented with a chain of Gaussian distributions. Based on the learned statistical motion patterns, statistical methods are used to detect anomalies and predict behaviors. Our system is tested using image sequences acquired, respectively, from a crowded real traffic scene and a model traffic scene. Experimental results show the robustness of the tracking algorithm, the efficiency of the algorithm for learning motion patterns, and the encouraging performance of algorithms for anomaly detection and behavior prediction.
Udrescu, Lucreţia; Sbârcea, Laura; Topîrceanu, Alexandru; Iovanovici, Alexandru; Kurunczi, Ludovic; Bogdan, Paul; Udrescu, Mihai
2016-09-07
Analyzing drug-drug interactions may unravel previously unknown drug action patterns, leading to the development of new drug discovery tools. We present a new approach to analyzing drug-drug interaction networks, based on clustering and topological community detection techniques that are specific to complex network science. Our methodology uncovers functional drug categories along with the intricate relationships between them. Using modularity-based and energy-model layout community detection algorithms, we link the network clusters to 9 relevant pharmacological properties. Out of the 1141 drugs from the DrugBank 4.1 database, our extensive literature survey and cross-checking with other databases such as Drugs.com, RxList, and DrugBank 4.3 confirm the predicted properties for 85% of the drugs. As such, we argue that network analysis offers a high-level grasp on a wide area of pharmacological aspects, indicating possible unaccounted interactions and missing pharmacological properties that can lead to drug repositioning for the 15% drugs which seem to be inconsistent with the predicted property. Also, by using network centralities, we can rank drugs according to their interaction potential for both simple and complex multi-pathology therapies. Moreover, our clustering approach can be extended for applications such as analyzing drug-target interactions or phenotyping patients in personalized medicine applications.
Udrescu, Lucreţia; Sbârcea, Laura; Topîrceanu, Alexandru; Iovanovici, Alexandru; Kurunczi, Ludovic; Bogdan, Paul; Udrescu, Mihai
2016-01-01
Analyzing drug-drug interactions may unravel previously unknown drug action patterns, leading to the development of new drug discovery tools. We present a new approach to analyzing drug-drug interaction networks, based on clustering and topological community detection techniques that are specific to complex network science. Our methodology uncovers functional drug categories along with the intricate relationships between them. Using modularity-based and energy-model layout community detection algorithms, we link the network clusters to 9 relevant pharmacological properties. Out of the 1141 drugs from the DrugBank 4.1 database, our extensive literature survey and cross-checking with other databases such as Drugs.com, RxList, and DrugBank 4.3 confirm the predicted properties for 85% of the drugs. As such, we argue that network analysis offers a high-level grasp on a wide area of pharmacological aspects, indicating possible unaccounted interactions and missing pharmacological properties that can lead to drug repositioning for the 15% drugs which seem to be inconsistent with the predicted property. Also, by using network centralities, we can rank drugs according to their interaction potential for both simple and complex multi-pathology therapies. Moreover, our clustering approach can be extended for applications such as analyzing drug-target interactions or phenotyping patients in personalized medicine applications. PMID:27599720
Wang, Shen-Tsu; Li, Meng-Hua
2014-01-01
When an enterprise has thousands of varieties in its inventory, the use of a single management method could not be a feasible approach. A better way to manage this problem would be to categorise inventory items into several clusters according to inventory decisions and to use different management methods for managing different clusters. The present study applies DPSO (dynamic particle swarm optimisation) to a problem of clustering of inventory items. Without the requirement of prior inventory knowledge, inventory items are automatically clustered into near optimal clustering number. The obtained clustering results should satisfy the inventory objective equation, which consists of different objectives such as total cost, backorder rate, demand relevance, and inventory turnover rate. This study integrates the above four objectives into a multiobjective equation, and inputs the actual inventory items of the enterprise into DPSO. In comparison with other clustering methods, the proposed method can consider different objectives and obtain an overall better solution to obtain better convergence results and inventory decisions.
Keshtkaran, Mohammad Reza; Yang, Zhi
2017-06-01
Spike sorting is a fundamental preprocessing step for many neuroscience studies which rely on the analysis of spike trains. Most of the feature extraction and dimensionality reduction techniques that have been used for spike sorting give a projection subspace which is not necessarily the most discriminative one. Therefore, the clusters which appear inherently separable in some discriminative subspace may overlap if projected using conventional feature extraction approaches leading to a poor sorting accuracy especially when the noise level is high. In this paper, we propose a noise-robust and unsupervised spike sorting algorithm based on learning discriminative spike features for clustering. The proposed algorithm uses discriminative subspace learning to extract low dimensional and most discriminative features from the spike waveforms and perform clustering with automatic detection of the number of the clusters. The core part of the algorithm involves iterative subspace selection using linear discriminant analysis and clustering using Gaussian mixture model with outlier detection. A statistical test in the discriminative subspace is proposed to automatically detect the number of the clusters. Comparative results on publicly available simulated and real in vivo datasets demonstrate that our algorithm achieves substantially improved cluster distinction leading to higher sorting accuracy and more reliable detection of clusters which are highly overlapping and not detectable using conventional feature extraction techniques such as principal component analysis or wavelets. By providing more accurate information about the activity of more number of individual neurons with high robustness to neural noise and outliers, the proposed unsupervised spike sorting algorithm facilitates more detailed and accurate analysis of single- and multi-unit activities in neuroscience and brain machine interface studies.
Ligand-protected gold clusters: the structure, synthesis and applications
NASA Astrophysics Data System (ADS)
Pichugina, D. A.; Kuz'menko, N. E.; Shestakov, A. F.
2015-11-01
Modern concepts of the structure and properties of atomic gold clusters protected by thiolate, selenolate, phosphine and phenylacetylene ligands are analyzed. Within the framework of the superatom theory, the 'divide and protect' approach and the structure rule, the stability and composition of a cluster are determined by the structure of the cluster core, the type of ligands and the total number of valence electrons. Methods of selective synthesis of gold clusters in solution and on the surface of inorganic composites based, in particular, on the reaction of Aun with RS, RSe, PhC≡C, Hal ligands or functional groups of proteins, on stabilization of clusters in cavities of the α-, β and γ-cyclodextrin molecules (Au15 and Au25) and on anchorage to a support surface (Au25/SiO2, Au20/C, Au10/FeOx) are reviewed. Problems in this field are also discussed. Among the methods for cluster structure prediction, particular attention is given to the theoretical approaches based on the density functional theory (DFT). The structures of a number of synthesized clusters are described using the results obtained by X-ray diffraction analysis and DFT calculations. A possible mechanism of formation of the SR(AuSR)n 'staple' units in the cluster shell is proposed. The structure and properties of bimetallic clusters MxAunLm (M=Pd, Pt, Ag, Cu) are discussed. The Pd or Pt atom is located at the centre of the cluster, whereas Ag and Cu atoms form bimetallic compounds in which the heteroatom is located on the surface of the cluster core or in the 'staple' units. The optical properties, fluorescence and luminescence of ligand-protected gold clusters originate from the quantum effects of the Au atoms in the cluster core and in the oligomeric SR(AuSR)x units in the cluster shell. Homogeneous and heterogeneous reactions catalyzed by atomic gold clusters are discussed in the context of the reaction mechanism and the nature of the active sites. The bibliography includes 345 references.
A cost-function approach to rival penalized competitive learning (RPCL).
Ma, Jinwen; Wang, Taijun
2006-08-01
Rival penalized competitive learning (RPCL) has been shown to be a useful tool for clustering on a set of sample data in which the number of clusters is unknown. However, the RPCL algorithm was proposed heuristically and is still in lack of a mathematical theory to describe its convergence behavior. In order to solve the convergence problem, we investigate it via a cost-function approach. By theoretical analysis, we prove that a general form of RPCL, called distance-sensitive RPCL (DSRPCL), is associated with the minimization of a cost function on the weight vectors of a competitive learning network. As a DSRPCL process decreases the cost to a local minimum, a number of weight vectors eventually fall into a hypersphere surrounding the sample data, while the other weight vectors diverge to infinity. Moreover, it is shown by the theoretical analysis and simulation experiments that if the cost reduces into the global minimum, a correct number of weight vectors is automatically selected and located around the centers of the actual clusters, respectively. Finally, we apply the DSRPCL algorithms to unsupervised color image segmentation and classification of the wine data.
Wavelet-based clustering of resting state MRI data in the rat.
Medda, Alessio; Hoffmann, Lukas; Magnuson, Matthew; Thompson, Garth; Pan, Wen-Ju; Keilholz, Shella
2016-01-01
While functional connectivity has typically been calculated over the entire length of the scan (5-10min), interest has been growing in dynamic analysis methods that can detect changes in connectivity on the order of cognitive processes (seconds). Previous work with sliding window correlation has shown that changes in functional connectivity can be observed on these time scales in the awake human and in anesthetized animals. This exciting advance creates a need for improved approaches to characterize dynamic functional networks in the brain. Previous studies were performed using sliding window analysis on regions of interest defined based on anatomy or obtained from traditional steady-state analysis methods. The parcellation of the brain may therefore be suboptimal, and the characteristics of the time-varying connectivity between regions are dependent upon the length of the sliding window chosen. This manuscript describes an algorithm based on wavelet decomposition that allows data-driven clustering of voxels into functional regions based on temporal and spectral properties. Previous work has shown that different networks have characteristic frequency fingerprints, and the use of wavelets ensures that both the frequency and the timing of the BOLD fluctuations are considered during the clustering process. The method was applied to resting state data acquired from anesthetized rats, and the resulting clusters agreed well with known anatomical areas. Clusters were highly reproducible across subjects. Wavelet cross-correlation values between clusters from a single scan were significantly higher than the values from randomly matched clusters that shared no temporal information, indicating that wavelet-based analysis is sensitive to the relationship between areas. Copyright © 2015 Elsevier Inc. All rights reserved.
OMERACT-based fibromyalgia symptom subgroups: an exploratory cluster analysis.
Vincent, Ann; Hoskin, Tanya L; Whipple, Mary O; Clauw, Daniel J; Barton, Debra L; Benzo, Roberto P; Williams, David A
2014-10-16
The aim of this study was to identify subsets of patients with fibromyalgia with similar symptom profiles using the Outcome Measures in Rheumatology (OMERACT) core symptom domains. Female patients with a diagnosis of fibromyalgia and currently meeting fibromyalgia research survey criteria completed the Brief Pain Inventory, the 30-item Profile of Mood States, the Medical Outcomes Sleep Scale, the Multidimensional Fatigue Inventory, the Multiple Ability Self-Report Questionnaire, the Fibromyalgia Impact Questionnaire-Revised (FIQ-R) and the Short Form-36 between 1 June 2011 and 31 October 2011. Hierarchical agglomerative clustering was used to identify subgroups of patients with similar symptom profiles. To validate the results from this sample, hierarchical agglomerative clustering was repeated in an external sample of female patients with fibromyalgia with similar inclusion criteria. A total of 581 females with a mean age of 55.1 (range, 20.1 to 90.2) years were included. A four-cluster solution best fit the data, and each clustering variable differed significantly (P <0.0001) among the four clusters. The four clusters divided the sample into severity levels: Cluster 1 reflects the lowest average levels across all symptoms, and cluster 4 reflects the highest average levels. Clusters 2 and 3 capture moderate symptoms levels. Clusters 2 and 3 differed mainly in profiles of anxiety and depression, with Cluster 2 having lower levels of depression and anxiety than Cluster 3, despite higher levels of pain. The results of the cluster analysis of the external sample (n = 478) looked very similar to those found in the original cluster analysis, except for a slight difference in sleep problems. This was despite having patients in the validation sample who were significantly younger (P <0.0001) and had more severe symptoms (higher FIQ-R total scores (P = 0.0004)). In our study, we incorporated core OMERACT symptom domains, which allowed for clustering based on a comprehensive symptom profile. Although our exploratory cluster solution needs confirmation in a longitudinal study, this approach could provide a rationale to support the study of individualized clinical evaluation and intervention.
Identification of PM10 air pollution origins at a rural background site
NASA Astrophysics Data System (ADS)
Reizer, Magdalena; Orza, José A. G.
2018-01-01
Trajectory cluster analysis and concentration weighted trajectory (CWT) approach have been applied to investigate the origins of PM10 air pollution recorded at a rural background site in North-eastern Poland (Diabla Góra). Air mass back-trajectories used in this study have been computed with the Hybrid Single-Particle Lagrangian Integrated Trajectory (HYSPLIT) model for a 10-year period of 2006-2015. A cluster analysis grouped back-trajectories into 7 clusters. Most of the trajectories correspond to fast and moderately moving westerly and northerly flows (45% and 25% of the cases, respectively). However, significantly higher PM10 concentrations were observed for slow moving easterly (11%) and southerly (20%) air masses. The CWT analysis shows that high PM10 levels are observed at Diabla Góra site when air masses are originated and passed over the heavily industrialized areas in Central-Eastern Europe located to the south and south-east of the site.
Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient.
Yao, Jianchao; Chang, Chunqi; Salmi, Mari L; Hung, Yeung Sam; Loraine, Ann; Roux, Stanley J
2008-06-18
Currently, clustering with some form of correlation coefficient as the gene similarity metric has become a popular method for profiling genomic data. The Pearson correlation coefficient and the standard deviation (SD)-weighted correlation coefficient are the two most widely-used correlations as the similarity metrics in clustering microarray data. However, these two correlations are not optimal for analyzing replicated microarray data generated by most laboratories. An effective correlation coefficient is needed to provide statistically sufficient analysis of replicated microarray data. In this study, we describe a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. The value of SCC is revealed by its comparison with two other correlation coefficients that are currently the most widely-used (Pearson correlation coefficient and SD-weighted correlation coefficient) using statistical measures on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. Two leading clustering methods, hierarchical and k-means clustering were applied for the comparison. The comparison indicated that using SCC achieves better clustering performance. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. Functional analysis suggested that some of the genetic mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns. This study shows that SCC is an alternative to the Pearson correlation coefficient and the SD-weighted correlation coefficient, and is particularly useful for clustering replicated microarray data. This computational approach should be generally useful for proteomic data or other high-throughput analysis methodology.
Patterns of Childhood Abuse and Neglect in a Representative German Population Sample
Schilling, Christoph; Weidner, Kerstin; Brähler, Elmar; Glaesmer, Heide; Häuser, Winfried; Pöhlmann, Karin
2016-01-01
Background Different types of childhood maltreatment, like emotional abuse, emotional neglect, physical abuse, physical neglect and sexual abuse are interrelated because of their co-occurrence. Different patterns of childhood abuse and neglect are associated with the degree of severity of mental disorders in adulthood. The purpose of this study was (a) to identify different patterns of childhood maltreatment in a representative German community sample, (b) to replicate the patterns of childhood neglect and abuse recently found in a clinical German sample, (c) to examine whether participants reporting exposure to specific patterns of child maltreatment would report different levels of psychological distress, and (d) to compare the results of the typological approach and the results of a cumulative risk model based on our data set. Methods In a cross-sectional survey conducted in 2010, a representative random sample of 2504 German participants aged between 14 and 92 years completed the Childhood Trauma Questionnaire (CTQ). General anxiety and depression were assessed by standardized questionnaires (GAD-2, PHQ-2). Cluster analysis was conducted with the CTQ-subscales to identify different patterns of childhood maltreatment. Results Three different patterns of childhood abuse and neglect could be identified by cluster analysis. Cluster one showed low values on all CTQ-scales. Cluster two showed high values in emotional and physical neglect. Only cluster three showed high values in physical and sexual abuse. The three patterns of childhood maltreatment showed different degrees of depression (PHQ-2) and anxiety (GAD-2). Cluster one showed lowest levels of psychological distress, cluster three showed highest levels of mental distress. Conclusion The results show that different types of childhood maltreatment are interrelated and can be grouped into specific patterns of childhood abuse and neglect, which are associated with differing severity of psychological distress in adulthood. The results correspond to those recently found in a German clinical sample and support a typological approach in the research of maltreatment. While cumulative risk models focus on the number of maltreatment types, the typological approach takes the number as well as the severity of the maltreatment types into account. Thus, specific patterns of maltreatment can be examined with regard to specific long-term psychological consequences. PMID:27442446
Motivational Profiles of Adult Learners
ERIC Educational Resources Information Center
Rothes, Ana; Lemos, Marina S.; Gonçalves, Teresa
2017-01-01
This study investigated profiles of autonomous and controlled motivation and their effects in a sample of 188 adult learners from two Portuguese urban areas. Using a person-centered approach, results of cluster analysis and multivariate analysis of covariance revealed four motivational groups with different effects in self-efficacy, engagement,…
Parkinson's Disease Subtypes in the Oxford Parkinson Disease Centre (OPDC) Discovery Cohort.
Lawton, Michael; Baig, Fahd; Rolinski, Michal; Ruffman, Claudio; Nithi, Kannan; May, Margaret T; Ben-Shlomo, Yoav; Hu, Michele T M
2015-01-01
Within Parkinson's there is a spectrum of clinical features at presentation which may represent sub-types of the disease. However there is no widely accepted consensus of how best to group patients. Use a data-driven approach to unravel any heterogeneity in the Parkinson's phenotype in a well-characterised, population-based incidence cohort. 769 consecutive patients, with mean disease duration of 1.3 years, were assessed using a broad range of motor, cognitive and non-motor metrics. Multiple imputation was carried out using the chained equations approach to deal with missing data. We used an exploratory and then a confirmatory factor analysis to determine suitable domains to include within our cluster analysis. K-means cluster analysis of the factor scores and all the variables not loading into a factor was used to determine phenotypic subgroups. Our factor analysis found three important factors that were characterised by: psychological well-being features; non-tremor motor features, such as posture and rigidity; and cognitive features. Our subsequent five cluster model identified groups characterised by (1) mild motor and non-motor disease (25.4%), (2) poor posture and cognition (23.3%), (3) severe tremor (20.8%), (4) poor psychological well-being, RBD and sleep (18.9%), and (5) severe motor and non-motor disease with poor psychological well-being (11.7%). Our approach identified several Parkinson's phenotypic sub-groups driven by largely dopaminergic-resistant features (RBD, impaired cognition and posture, poor psychological well-being) that, in addition to dopaminergic-responsive motor features may be important for studying the aetiology, progression, and medication response of early Parkinson's.
Fast Whole-Engine Stirling Analysis
NASA Technical Reports Server (NTRS)
Dyson, Rodger W.; Wilson, Scott D.; Tew, Roy C.; Demko, Rikako
2006-01-01
This presentation discusses the simulation approach to whole-engine for physical consistency, REV regenerator modeling, grid layering for smoothness, and quality, conjugate heat transfer method adjustment, high-speed low cost parallel cluster, and debugging.
Gartstein, Maria A; Prokasky, Amanda; Bell, Martha Ann; Calkins, Susan; Bridgett, David J; Braungart-Rieker, Julia; Leerkes, Esther; Cheatham, Carol L; Eiden, Rina D; Mize, Krystal D; Jones, Nancy Aaron; Mireault, Gina; Seamon, Erich
2017-10-01
There is renewed interest in person-centered approaches to understanding the structure of temperament. However, questions concerning temperament types are not frequently framed in a developmental context, especially during infancy. In addition, the most common person-centered techniques, cluster analysis (CA) and latent profile analysis (LPA), have not been compared with respect to derived temperament types. To address these gaps, we set out to identify temperament types for younger and older infants, comparing LPA and CA techniques. Multiple data sets (N = 1,356; 672 girls, 677 boys) with maternal ratings of infant temperament obtained using the Infant Behavior Questionnaire-Revised (Gartstein & Rothbart, 2003) were combined. All infants were between 3 and 12 months of age (M = 7.85; SD = 3.00). Due to rapid development in the first year of life, LPA and CA were performed separately for younger (n = 731; 3 to 8 months of age) and older (n = 625; 9 to 12 months of age) infants. Results supported 3-profile/cluster solutions as optimal for younger infants, and 5-profile/cluster solutions for the older subsample, indicating considerable differences between early/mid and late infancy. LPA and CA solutions produced relatively comparable types for younger and older infants. Results are discussed in the context of developmental changes unique to the end of the first year of life, which likely account for the present findings. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Ng, Edmond S-W; Diaz-Ordaz, Karla; Grieve, Richard; Nixon, Richard M; Thompson, Simon G; Carpenter, James R
2016-10-01
Multilevel models provide a flexible modelling framework for cost-effectiveness analyses that use cluster randomised trial data. However, there is a lack of guidance on how to choose the most appropriate multilevel models. This paper illustrates an approach for deciding what level of model complexity is warranted; in particular how best to accommodate complex variance-covariance structures, right-skewed costs and missing data. Our proposed models differ according to whether or not they allow individual-level variances and correlations to differ across treatment arms or clusters and by the assumed cost distribution (Normal, Gamma, Inverse Gaussian). The models are fitted by Markov chain Monte Carlo methods. Our approach to model choice is based on four main criteria: the characteristics of the data, model pre-specification informed by the previous literature, diagnostic plots and assessment of model appropriateness. This is illustrated by re-analysing a previous cost-effectiveness analysis that uses data from a cluster randomised trial. We find that the most useful criterion for model choice was the deviance information criterion, which distinguishes amongst models with alternative variance-covariance structures, as well as between those with different cost distributions. This strategy for model choice can help cost-effectiveness analyses provide reliable inferences for policy-making when using cluster trials, including those with missing data. © The Author(s) 2013.
NASA Astrophysics Data System (ADS)
Besic, Nikola; Ventura, Jordi Figueras i.; Grazioli, Jacopo; Gabella, Marco; Germann, Urs; Berne, Alexis
2016-09-01
Polarimetric radar-based hydrometeor classification is the procedure of identifying different types of hydrometeors by exploiting polarimetric radar observations. The main drawback of the existing supervised classification methods, mostly based on fuzzy logic, is a significant dependency on a presumed electromagnetic behaviour of different hydrometeor types. Namely, the results of the classification largely rely upon the quality of scattering simulations. When it comes to the unsupervised approach, it lacks the constraints related to the hydrometeor microphysics. The idea of the proposed method is to compensate for these drawbacks by combining the two approaches in a way that microphysical hypotheses can, to a degree, adjust the content of the classes obtained statistically from the observations. This is done by means of an iterative approach, performed offline, which, in a statistical framework, examines clustered representative polarimetric observations by comparing them to the presumed polarimetric properties of each hydrometeor class. Aside from comparing, a routine alters the content of clusters by encouraging further statistical clustering in case of non-identification. By merging all identified clusters, the multi-dimensional polarimetric signatures of various hydrometeor types are obtained for each of the studied representative datasets, i.e. for each radar system of interest. These are depicted by sets of centroids which are then employed in operational labelling of different hydrometeors. The method has been applied on three C-band datasets, each acquired by different operational radar from the MeteoSwiss Rad4Alp network, as well as on two X-band datasets acquired by two research mobile radars. The results are discussed through a comparative analysis which includes a corresponding supervised and unsupervised approach, emphasising the operational potential of the proposed method.
Miladinia, Mojtaba; Baraz, Shahram; Shariati, Abdolali; Malehi, Amal Saki
Patients with acute leukemia usually experience pain, fatigue, and sleep disorders, which affect their quality of life. Massage therapy, as a nondrug approach, can be useful in controlling such problems. However, very few studies have been conducted on the effects of massage therapy on the complications of leukemia. The aim of this study was to examine the effects of slow-stroke back massage (SSBM) on the symptom cluster in acute leukemia adult patients undergoing chemotherapy. In this randomized controlled trial, 60 patients with acute leukemia were allocated randomly to either the intervention or control group. The intervention group received SSBM 3 times a week (every other day for 10 minutes) for 4 weeks. The pain, fatigue, and sleep disorder intensities were measured using the numeric rating scale. The sleep quality was measured using the Pittsburgh Sleep Quality Index. Statistical tests of χ, t test, and the repeated-measure analysis of variance were used for data analysis. Results showed that the SSBM intervention significantly reduced the progressive sleep disorder, pain, fatigue, and improved sleep quality over time. Slow-stroke back massage, as a simple, noninvasive, and cost-effective approach, along with routine nursing care, can be used to improve the symptom cluster of pain, fatigue, and sleep disorders in leukemia patients. Oncology nurses can increase their knowledge regarding this symptom cluster and work to diminish the cluster components by using SSBM in adult leukemia patients.
Coronal Mass Ejection Data Clustering and Visualization of Decision Trees
NASA Astrophysics Data System (ADS)
Ma, Ruizhe; Angryk, Rafal A.; Riley, Pete; Filali Boubrahimi, Soukaina
2018-05-01
Coronal mass ejections (CMEs) can be categorized as either “magnetic clouds” (MCs) or non-MCs. Features such as a large magnetic field, low plasma-beta, and low proton temperature suggest that a CME event is also an MC event; however, so far there is neither a definitive method nor an automatic process to distinguish the two. Human labeling is time-consuming, and results can fluctuate owing to the imprecise definition of such events. In this study, we approach the problem of MC and non-MC distinction from a time series data analysis perspective and show how clustering can shed some light on this problem. Although many algorithms exist for traditional data clustering in the Euclidean space, they are not well suited for time series data. Problems such as inadequate distance measure, inaccurate cluster center description, and lack of intuitive cluster representations need to be addressed for effective time series clustering. Our data analysis in this work is twofold: clustering and visualization. For clustering we compared the results from the popular hierarchical agglomerative clustering technique to a distance density clustering heuristic we developed previously for time series data clustering. In both cases, dynamic time warping will be used for similarity measure. For classification as well as visualization, we use decision trees to aggregate single-dimensional clustering results to form a multidimensional time series decision tree, with averaged time series to present each decision. In this study, we achieved modest accuracy and, more importantly, an intuitive interpretation of how different parameters contribute to an MC event.
Electrofacies analysis for coal lithotype profiling based on high-resolution wireline log data
NASA Astrophysics Data System (ADS)
Roslin, A.; Esterle, J. S.
2016-06-01
The traditional approach to coal lithotype analysis is based on a visual characterisation of coal in core, mine or outcrop exposures. As not all wells are fully cored, the petroleum and coal mining industries increasingly use geophysical wireline logs for lithology interpretation.This study demonstrates a method for interpreting coal lithotypes from geophysical wireline logs, and in particular discriminating between bright or banded, and dull coal at similar densities to a decimetre level. The study explores the optimum combination of geophysical log suites for training the coal electrofacies interpretation, using neural network conception, and then propagating the results to wells with fewer wireline data. This approach is objective and has a recordable reproducibility and rule set.In addition to conventional gamma ray and density logs, laterolog resistivity, microresistivity and PEF data were used in the study. Array resistivity data from a compact micro imager (CMI tool) were processed into a single microresistivity curve and integrated with the conventional resistivity data in the cluster analysis. Microresistivity data were tested in the analysis to test the hypothesis that the improved vertical resolution of microresistivity curve can enhance the accuracy of the clustering analysis. The addition of PEF log allowed discrimination between low density bright to banded coal electrofacies and low density inertinite-rich dull electrofacies.The results of clustering analysis were validated statistically and the results of the electrofacies results were compared to manually derived coal lithotype logs.
The development of an episode selection and aggregation approach, designed to support distributional estimation of use with the Models-3 Community Multiscale Air Quality (CMAQ) model, is described. The approach utilized cluster analysis of the 700-hPa east-west and north-south...
ERIC Educational Resources Information Center
Corpus, Jennifer Henderlong; Wormington, Stephanie V.; Haimovitz, Kyla
2016-01-01
A person-centered, mixed-methods approach (self-report surveys, semistructured interviews, school records) was used to characterize and evaluate profiles of intrinsic and extrinsic motivations among 243 third- through eighth-grade students. Cluster analysis suggested four distinct profiles: high quantity (high intrinsic, high extrinsic), primarily…
Swarm Intelligence in Text Document Clustering
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cui, Xiaohui; Potok, Thomas E
2008-01-01
Social animals or insects in nature often exhibit a form of emergent collective behavior. The research field that attempts to design algorithms or distributed problem-solving devices inspired by the collective behavior of social insect colonies is called Swarm Intelligence. Compared to the traditional algorithms, the swarm algorithms are usually flexible, robust, decentralized and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document collection clustering. The major challenge of today's information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role inmore » helping users to effectively navigate, summarize, and organize the overwhelmed information. In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. These clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools and ant food forage.« less
Mixed Pattern Matching-Based Traffic Abnormal Behavior Recognition
Cui, Zhiming; Zhao, Pengpeng
2014-01-01
A motion trajectory is an intuitive representation form in time-space domain for a micromotion behavior of moving target. Trajectory analysis is an important approach to recognize abnormal behaviors of moving targets. Against the complexity of vehicle trajectories, this paper first proposed a trajectory pattern learning method based on dynamic time warping (DTW) and spectral clustering. It introduced the DTW distance to measure the distances between vehicle trajectories and determined the number of clusters automatically by a spectral clustering algorithm based on the distance matrix. Then, it clusters sample data points into different clusters. After the spatial patterns and direction patterns learned from the clusters, a recognition method for detecting vehicle abnormal behaviors based on mixed pattern matching was proposed. The experimental results show that the proposed technical scheme can recognize main types of traffic abnormal behaviors effectively and has good robustness. The real-world application verified its feasibility and the validity. PMID:24605045
Ranked centroid projection: a data visualization approach with self-organizing maps.
Yen, G G; Wu, Z
2008-02-01
The self-organizing map (SOM) is an efficient tool for visualizing high-dimensional data. In this paper, the clustering and visualization capabilities of the SOM, especially in the analysis of textual data, i.e., document collections, are reviewed and further developed. A novel clustering and visualization approach based on the SOM is proposed for the task of text mining. The proposed approach first transforms the document space into a multidimensional vector space by means of document encoding. Afterwards, a growing hierarchical SOM (GHSOM) is trained and used as a baseline structure to automatically produce maps with various levels of detail. Following the GHSOM training, the new projection method, namely the ranked centroid projection (RCP), is applied to project the input vectors to a hierarchy of 2-D output maps. The RCP is used as a data analysis tool as well as a direct interface to the data. In a set of simulations, the proposed approach is applied to an illustrative data set and two real-world scientific document collections to demonstrate its applicability.
Machine-learned cluster identification in high-dimensional data.
Ultsch, Alfred; Lötsch, Jörn
2017-02-01
High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. Ward clustering imposed cluster structures on cluster-less "golf ball", "cuboid" and "S-shaped" data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Defining objective clusters for rabies virus sequences using affinity propagation clustering
Fischer, Susanne; Freuling, Conrad M.; Pfaff, Florian; Bodenhofer, Ulrich; Höper, Dirk; Fischer, Mareike; Marston, Denise A.; Fooks, Anthony R.; Mettenleiter, Thomas C.; Conraths, Franz J.; Homeier-Bachmann, Timo
2018-01-01
Rabies is caused by lyssaviruses, and is one of the oldest known zoonoses. In recent years, more than 21,000 nucleotide sequences of rabies viruses (RABV), from the prototype species rabies lyssavirus, have been deposited in public databases. Subsequent phylogenetic analyses in combination with metadata suggest geographic distributions of RABV. However, these analyses somewhat experience technical difficulties in defining verifiable criteria for cluster allocations in phylogenetic trees inviting for a more rational approach. Therefore, we applied a relatively new mathematical clustering algorythm named ‘affinity propagation clustering’ (AP) to propose a standardized sub-species classification utilizing full-genome RABV sequences. Because AP has the advantage that it is computationally fast and works for any meaningful measure of similarity between data samples, it has previously been applied successfully in bioinformatics, for analysis of microarray and gene expression data, however, cluster analysis of sequences is still in its infancy. Existing (516) and original (46) full genome RABV sequences were used to demonstrate the application of AP for RABV clustering. On a global scale, AP proposed four clusters, i.e. New World cluster, Arctic/Arctic-like, Cosmopolitan, and Asian as previously assigned by phylogenetic studies. By combining AP with established phylogenetic analyses, it is possible to resolve phylogenetic relationships between verifiably determined clusters and sequences. This workflow will be useful in confirming cluster distributions in a uniform transparent manner, not only for RABV, but also for other comparative sequence analyses. PMID:29357361
NASA Astrophysics Data System (ADS)
Kawahara, Hajime; Reese, Erik D.; Kitayama, Tetsu; Sasaki, Shin; Suto, Yasushi
2008-11-01
Our previous analysis indicates that small-scale fluctuations in the intracluster medium (ICM) from cosmological hydrodynamic simulations follow the lognormal probability density function. In order to test the lognormal nature of the ICM directly against X-ray observations of galaxy clusters, we develop a method of extracting statistical information about the three-dimensional properties of the fluctuations from the two-dimensional X-ray surface brightness. We first create a set of synthetic clusters with lognormal fluctuations around their mean profile given by spherical isothermal β-models, later considering polytropic temperature profiles as well. Performing mock observations of these synthetic clusters, we find that the resulting X-ray surface brightness fluctuations also follow the lognormal distribution fairly well. Systematic analysis of the synthetic clusters provides an empirical relation between the three-dimensional density fluctuations and the two-dimensional X-ray surface brightness. We analyze Chandra observations of the galaxy cluster Abell 3667, and find that its X-ray surface brightness fluctuations follow the lognormal distribution. While the lognormal model was originally motivated by cosmological hydrodynamic simulations, this is the first observational confirmation of the lognormal signature in a real cluster. Finally we check the synthetic cluster results against clusters from cosmological hydrodynamic simulations. As a result of the complex structure exhibited by simulated clusters, the empirical relation between the two- and three-dimensional fluctuation properties calibrated with synthetic clusters when applied to simulated clusters shows large scatter. Nevertheless we are able to reproduce the true value of the fluctuation amplitude of simulated clusters within a factor of 2 from their two-dimensional X-ray surface brightness alone. Our current methodology combined with existing observational data is useful in describing and inferring the statistical properties of the three-dimensional inhomogeneity in galaxy clusters.
NASA Astrophysics Data System (ADS)
Li, Hui; Yu, Jun-Ling; Yu, Le-An; Sun, Jie
2014-05-01
Case-based reasoning (CBR) is one of the main forecasting methods in business forecasting, which performs well in prediction and holds the ability of giving explanations for the results. In business failure prediction (BFP), the number of failed enterprises is relatively small, compared with the number of non-failed ones. However, the loss is huge when an enterprise fails. Therefore, it is necessary to develop methods (trained on imbalanced samples) which forecast well for this small proportion of failed enterprises and performs accurately on total accuracy meanwhile. Commonly used methods constructed on the assumption of balanced samples do not perform well in predicting minority samples on imbalanced samples consisting of the minority/failed enterprises and the majority/non-failed ones. This article develops a new method called clustering-based CBR (CBCBR), which integrates clustering analysis, an unsupervised process, with CBR, a supervised process, to enhance the efficiency of retrieving information from both minority and majority in CBR. In CBCBR, various case classes are firstly generated through hierarchical clustering inside stored experienced cases, and class centres are calculated out by integrating cases information in the same clustered class. When predicting the label of a target case, its nearest clustered case class is firstly retrieved by ranking similarities between the target case and each clustered case class centre. Then, nearest neighbours of the target case in the determined clustered case class are retrieved. Finally, labels of the nearest experienced cases are used in prediction. In the empirical experiment with two imbalanced samples from China, the performance of CBCBR was compared with the classical CBR, a support vector machine, a logistic regression and a multi-variant discriminate analysis. The results show that compared with the other four methods, CBCBR performed significantly better in terms of sensitivity for identifying the minority samples and generated high total accuracy meanwhile. The proposed approach makes CBR useful in imbalanced forecasting.
Pyne, Matthew I.; Carlisle, Daren M.; Konrad, Christopher P.; Stein, Eric D.
2017-01-01
Regional classification of streams is an early step in the Ecological Limits of Hydrologic Alteration framework. Many stream classifications are based on an inductive approach using hydrologic data from minimally disturbed basins, but this approach may underrepresent streams from heavily disturbed basins or sparsely gaged arid regions. An alternative is a deductive approach, using watershed climate, land use, and geomorphology to classify streams, but this approach may miss important hydrological characteristics of streams. We classified all stream reaches in California using both approaches. First, we used Bayesian and hierarchical clustering to classify reaches according to watershed characteristics. Streams were clustered into seven classes according to elevation, sedimentary rock, and winter precipitation. Permutation-based analysis of variance and random forest analyses were used to determine which hydrologic variables best separate streams into their respective classes. Stream typology (i.e., the class that a stream reach is assigned to) is shaped mainly by patterns of high and mean flow behavior within the stream's landscape context. Additionally, random forest was used to determine which hydrologic variables best separate minimally disturbed reference streams from non-reference streams in each of the seven classes. In contrast to stream typology, deviation from reference conditions is more difficult to detect and is largely defined by changes in low-flow variables, average daily flow, and duration of flow. Our combined deductive/inductive approach allows us to estimate flow under minimally disturbed conditions based on the deductive analysis and compare to measured flow based on the inductive analysis in order to estimate hydrologic change.
Fogel, Paul; Gaston-Mathé, Yann; Hawkins, Douglas; Fogel, Fajwel; Luta, George; Young, S. Stanley
2016-01-01
Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and the original NMF cannot be used. Our idea is to split and then concatenate the positive and negative parts of the matrix, after taking the absolute value of the negative elements. NMF applied to the concatenated data, which we call PosNegNMF, offers the advantages of the original NMF approach, while giving equal weight to large and small values. We use two public health datasets to illustrate the new method and compare it with alternative clustering methods, such as K-means and clustering methods based on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA). With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, we recommend that the epidemiologists and environmental scientists use the new method to obtain clusters with improved quality and interpretability. PMID:27213413
Fogel, Paul; Gaston-Mathé, Yann; Hawkins, Douglas; Fogel, Fajwel; Luta, George; Young, S Stanley
2016-05-18
Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and the original NMF cannot be used. Our idea is to split and then concatenate the positive and negative parts of the matrix, after taking the absolute value of the negative elements. NMF applied to the concatenated data, which we call PosNegNMF, offers the advantages of the original NMF approach, while giving equal weight to large and small values. We use two public health datasets to illustrate the new method and compare it with alternative clustering methods, such as K-means and clustering methods based on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA). With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, we recommend that the epidemiologists and environmental scientists use the new method to obtain clusters with improved quality and interpretability.
Cognitive Clusters in Specific Learning Disorder.
Poletti, Michele; Carretta, Elisa; Bonvicini, Laura; Giorgi-Rossi, Paolo
The heterogeneity among children with learning disabilities still represents a barrier and a challenge in their conceptualization. Although a dimensional approach has been gaining support, the categorical approach is still the most adopted, as in the recent fifth edition of the Diagnostic and Statistical Manual of Mental Disorders. The introduction of the single overarching diagnostic category of specific learning disorder (SLD) could underemphasize interindividual clinical differences regarding intracategory cognitive functioning and learning proficiency, according to current models of multiple cognitive deficits at the basis of neurodevelopmental disorders. The characterization of specific cognitive profiles associated with an already manifest SLD could help identify possible early cognitive markers of SLD risk and distinct trajectories of atypical cognitive development leading to SLD. In this perspective, we applied a cluster analysis to identify groups of children with a Diagnostic and Statistical Manual-based diagnosis of SLD with similar cognitive profiles and to describe the association between clusters and SLD subtypes. A sample of 205 children with a diagnosis of SLD were enrolled. Cluster analyses (agglomerative hierarchical and nonhierarchical iterative clustering technique) were used successively on 10 core subtests of the Wechsler Intelligence Scale for Children-Fourth Edition. The 4-cluster solution was adopted, and external validation found differences in terms of SLD subtype frequencies and learning proficiency among clusters. Clinical implications of these findings are discussed, tracing directions for further studies.
NASA Astrophysics Data System (ADS)
Murthy, Ch; MIEEE; Mohanta, D. K.; SMIEE; Meher, Mahendra
2017-08-01
Continuous monitoring and control of the power system is essential for its healthy operation. This can be achieved by making the system observable as well as controllable. Many efforts have been made by several researchers to make the system observable by placing the Phasor Measurement Units (PMUs) at the optimal locations. But so far the idea of controllability with PMUs is not considered. This paper contributes how to check whether the system is controllable or not, if not then how make it controllable using a clustering approach. IEEE 14 bus system is considered to illustrate the concept of controllability.
Koren, Omry; Knights, Dan; Gonzalez, Antonio; Waldron, Levi; Segata, Nicola; Knight, Rob; Huttenhower, Curtis; Ley, Ruth E
2013-01-01
Recent analyses of human-associated bacterial diversity have categorized individuals into 'enterotypes' or clusters based on the abundances of key bacterial genera in the gut microbiota. There is a lack of consensus, however, on the analytical basis for enterotypes and on the interpretation of these results. We tested how the following factors influenced the detection of enterotypes: clustering methodology, distance metrics, OTU-picking approaches, sequencing depth, data type (whole genome shotgun (WGS) vs.16S rRNA gene sequence data), and 16S rRNA region. We included 16S rRNA gene sequences from the Human Microbiome Project (HMP) and from 16 additional studies and WGS sequences from the HMP and MetaHIT. In most body sites, we observed smooth abundance gradients of key genera without discrete clustering of samples. Some body habitats displayed bimodal (e.g., gut) or multimodal (e.g., vagina) distributions of sample abundances, but not all clustering methods and workflows accurately highlight such clusters. Because identifying enterotypes in datasets depends not only on the structure of the data but is also sensitive to the methods applied to identifying clustering strength, we recommend that multiple approaches be used and compared when testing for enterotypes.
Waldron, Levi; Segata, Nicola; Knight, Rob; Huttenhower, Curtis; Ley, Ruth E.
2013-01-01
Recent analyses of human-associated bacterial diversity have categorized individuals into ‘enterotypes’ or clusters based on the abundances of key bacterial genera in the gut microbiota. There is a lack of consensus, however, on the analytical basis for enterotypes and on the interpretation of these results. We tested how the following factors influenced the detection of enterotypes: clustering methodology, distance metrics, OTU-picking approaches, sequencing depth, data type (whole genome shotgun (WGS) vs.16S rRNA gene sequence data), and 16S rRNA region. We included 16S rRNA gene sequences from the Human Microbiome Project (HMP) and from 16 additional studies and WGS sequences from the HMP and MetaHIT. In most body sites, we observed smooth abundance gradients of key genera without discrete clustering of samples. Some body habitats displayed bimodal (e.g., gut) or multimodal (e.g., vagina) distributions of sample abundances, but not all clustering methods and workflows accurately highlight such clusters. Because identifying enterotypes in datasets depends not only on the structure of the data but is also sensitive to the methods applied to identifying clustering strength, we recommend that multiple approaches be used and compared when testing for enterotypes. PMID:23326225
Unidimensional and Bidimensional Approaches to Measuring Acculturation.
Shin, Cha-Nam; Todd, Michael; An, Kyungeh; Kim, Wonsun Sunny
2017-08-01
Researchers easily overlook the complexity of acculturation measurement in research. This study is to elaborate the shortcomings of unidimensional approaches to conceptualizing acculturation and highlight the importance of using bidimensional approaches in health research. We conducted a secondary data analysis on acculturation measures and eating habits obtained from 261 Korean American adults in a Midwestern city. Bidimensional approaches better conceptualized acculturation and explained more of the variance in eating habits than did unidimensional approaches. Bidimensional acculturation measures combined with appropriate analytical methods, such as a cluster analysis, are recommended in health research because they provide a more comprehensive understanding of acculturation and its association with health behaviors than do other methods.
Fast clustering using adaptive density peak detection.
Wang, Xiao-Feng; Xu, Yifan
2017-12-01
Common limitations of clustering methods include the slow algorithm convergence, the instability of the pre-specification on a number of intrinsic parameters, and the lack of robustness to outliers. A recent clustering approach proposed a fast search algorithm of cluster centers based on their local densities. However, the selection of the key intrinsic parameters in the algorithm was not systematically investigated. It is relatively difficult to estimate the "optimal" parameters since the original definition of the local density in the algorithm is based on a truncated counting measure. In this paper, we propose a clustering procedure with adaptive density peak detection, where the local density is estimated through the nonparametric multivariate kernel estimation. The model parameter is then able to be calculated from the equations with statistical theoretical justification. We also develop an automatic cluster centroid selection method through maximizing an average silhouette index. The advantage and flexibility of the proposed method are demonstrated through simulation studies and the analysis of a few benchmark gene expression data sets. The method only needs to perform in one single step without any iteration and thus is fast and has a great potential to apply on big data analysis. A user-friendly R package ADPclust is developed for public use.
Yang, Guang; Nawaz, Tahir; Barrick, Thomas R; Howe, Franklyn A; Slabaugh, Greg
2015-12-01
Many approaches have been considered for automatic grading of brain tumors by means of pattern recognition with magnetic resonance spectroscopy (MRS). Providing an improved technique which can assist clinicians in accurately identifying brain tumor grades is our main objective. The proposed technique, which is based on the discrete wavelet transform (DWT) of whole-spectral or subspectral information of key metabolites, combined with unsupervised learning, inspects the separability of the extracted wavelet features from the MRS signal to aid the clustering. In total, we included 134 short echo time single voxel MRS spectra (SV MRS) in our study that cover normal controls, low grade and high grade tumors. The combination of DWT-based whole-spectral or subspectral analysis and unsupervised clustering achieved an overall clustering accuracy of 94.8% and a balanced error rate of 7.8%. To the best of our knowledge, it is the first study using DWT combined with unsupervised learning to cluster brain SV MRS. Instead of dimensionality reduction on SV MRS or feature selection using model fitting, our study provides an alternative method of extracting features to obtain promising clustering results.
Yang, Mingxing; Li, Xiumin; Li, Zhibin; Ou, Zhimin; Liu, Ming; Liu, Suhuan; Li, Xuejun; Yang, Shuyu
2013-01-01
DNA microarray analysis is characterized by obtaining a large number of gene variables from a small number of observations. Cluster analysis is widely used to analyze DNA microarray data to make classification and diagnosis of disease. Because there are so many irrelevant and insignificant genes in a dataset, a feature selection approach must be employed in data analysis. The performance of cluster analysis of this high-throughput data depends on whether the feature selection approach chooses the most relevant genes associated with disease classes. Here we proposed a new method using multiple Orthogonal Partial Least Squares-Discriminant Analysis (mOPLS-DA) models and S-plots to select the most relevant genes to conduct three-class disease classification and prediction. We tested our method using Golub's leukemia microarray data. For three classes with subtypes, we proposed hierarchical orthogonal partial least squares-discriminant analysis (OPLS-DA) models and S-plots to select features for two main classes and their subtypes. For three classes in parallel, we employed three OPLS-DA models and S-plots to choose marker genes for each class. The power of feature selection to classify and predict three-class disease was evaluated using cluster analysis. Further, the general performance of our method was tested using four public datasets and compared with those of four other feature selection methods. The results revealed that our method effectively selected the most relevant features for disease classification and prediction, and its performance was better than that of the other methods.
Deep Learning Nuclei Detection in Digitized Histology Images by Superpixels.
Sornapudi, Sudhir; Stanley, Ronald Joe; Stoecker, William V; Almubarak, Haidar; Long, Rodney; Antani, Sameer; Thoma, George; Zuna, Rosemary; Frazier, Shelliane R
2018-01-01
Advances in image analysis and computational techniques have facilitated automatic detection of critical features in histopathology images. Detection of nuclei is critical for squamous epithelium cervical intraepithelial neoplasia (CIN) classification into normal, CIN1, CIN2, and CIN3 grades. In this study, a deep learning (DL)-based nuclei segmentation approach is investigated based on gathering localized information through the generation of superpixels using a simple linear iterative clustering algorithm and training with a convolutional neural network. The proposed approach was evaluated on a dataset of 133 digitized histology images and achieved an overall nuclei detection (object-based) accuracy of 95.97%, with demonstrated improvement over imaging-based and clustering-based benchmark techniques. The proposed DL-based nuclei segmentation Method with superpixel analysis has shown improved segmentation results in comparison to state-of-the-art methods.
Sani-Kast, Nicole; Scheringer, Martin; Slomberg, Danielle; Labille, Jérôme; Praetorius, Antonia; Ollivier, Patrick; Hungerbühler, Konrad
2015-12-01
Engineered nanoparticle (ENP) fate models developed to date - aimed at predicting ENP concentration in the aqueous environment - have limited applicability because they employ constant environmental conditions along the modeled system or a highly specific environmental representation; both approaches do not show the effects of spatial and/or temporal variability. To address this conceptual gap, we developed a novel modeling strategy that: 1) incorporates spatial variability in environmental conditions in an existing ENP fate model; and 2) analyzes the effect of a wide range of randomly sampled environmental conditions (representing variations in water chemistry). This approach was employed to investigate the transport of nano-TiO2 in the Lower Rhône River (France) under numerous sets of environmental conditions. The predicted spatial concentration profiles of nano-TiO2 were then grouped according to their similarity by using cluster analysis. The analysis resulted in a small number of clusters representing groups of spatial concentration profiles. All clusters show nano-TiO2 accumulation in the sediment layer, supporting results from previous studies. Analysis of the characteristic features of each cluster demonstrated a strong association between the water conditions in regions close to the ENP emission source and the cluster membership of the corresponding spatial concentration profiles. In particular, water compositions favoring heteroaggregation between the ENPs and suspended particulate matter resulted in clusters of low variability. These conditions are, therefore, reliable predictors of the eventual fate of the modeled ENPs. The conclusions from this study are also valid for ENP fate in other large river systems. Our results, therefore, shift the focus of future modeling and experimental research of ENP environmental fate to the water characteristic in regions near the expected ENP emission sources. Under conditions favoring heteroaggregation in these regions, the fate of the ENPs can be readily predicted. Copyright © 2014 Elsevier B.V. All rights reserved.
KinFin: Software for Taxon-Aware Analysis of Clustered Protein Sequences.
Laetsch, Dominik R; Blaxter, Mark L
2017-10-05
The field of comparative genomics is concerned with the study of similarities and differences between the information encoded in the genomes of organisms. A common approach is to define gene families by clustering protein sequences based on sequence similarity, and analyze protein cluster presence and absence in different species groups as a guide to biology. Due to the high dimensionality of these data, downstream analysis of protein clusters inferred from large numbers of species, or species with many genes, is nontrivial, and few solutions exist for transparent, reproducible, and customizable analyses. We present KinFin, a streamlined software solution capable of integrating data from common file formats and delivering aggregative annotation of protein clusters. KinFin delivers analyses based on systematic taxonomy of the species analyzed, or on user-defined, groupings of taxa, for example, sets based on attributes such as life history traits, organismal phenotypes, or competing phylogenetic hypotheses. Results are reported through graphical and detailed text output files. We illustrate the utility of the KinFin pipeline by addressing questions regarding the biology of filarial nematodes, which include parasites of veterinary and medical importance. We resolve the phylogenetic relationships between the species and explore functional annotation of proteins in clusters in key lineages and between custom taxon sets, identifying gene families of interest. KinFin can easily be integrated into existing comparative genomic workflows, and promotes transparent and reproducible analysis of clustered protein data. Copyright © 2017 Laetsch and Blaxter.
Zhang, X; Patel, L A; Beckwith, O; Schneider, R; Weeden, C J; Kindt, J T
2017-11-14
Micelle cluster distributions from molecular dynamics simulations of a solvent-free coarse-grained model of sodium octyl sulfate (SOS) were analyzed using an improved method to extract equilibrium association constants from small-system simulations containing one or two micelle clusters at equilibrium with free surfactants and counterions. The statistical-thermodynamic and mathematical foundations of this partition-enabled analysis of cluster histograms (PEACH) approach are presented. A dramatic reduction in computational time for analysis was achieved through a strategy similar to the selector variable method to circumvent the need for exhaustive enumeration of the possible partitions of surfactants and counterions into clusters. Using statistics from a set of small-system (up to 60 SOS molecules) simulations as input, equilibrium association constants for micelle clusters were obtained as a function of both number of surfactants and number of associated counterions through a global fitting procedure. The resulting free energies were able to accurately predict micelle size and charge distributions in a large (560 molecule) system. The evolution of micelle size and charge with SOS concentration as predicted by the PEACH-derived free energies and by a phenomenological four-parameter model fit, along with the sensitivity of these predictions to variations in cluster definitions, are analyzed and discussed.
Systematic detection and classification of earthquake clusters in Italy
NASA Astrophysics Data System (ADS)
Poli, P.; Ben-Zion, Y.; Zaliapin, I. V.
2017-12-01
We perform a systematic analysis of spatio-temporal clustering of 2007-2017 earthquakes in Italy with magnitudes m>3. The study employs the nearest-neighbor approach of Zaliapin and Ben-Zion [2013a, 2013b] with basic data-driven parameters. The results indicate that seismicity in Italy (an extensional tectonic regime) is dominated by clustered events, with smaller proportion of background events than in California. Evaluation of internal cluster properties allows separation of swarm-like from burst-like seismicity. This classification highlights a strong geographical coherence of cluster properties. Swarm-like seismicity are dominant in regions characterized by relatively slow deformation with possible elevated temperature and/or fluids (e.g. Alto Tiberina, Pollino), while burst-like seismicity are observed in crystalline tectonic regions (Alps and Calabrian Arc) and in Central Italy where moderate to large earthquakes are frequent (e.g. L'Aquila, Amatrice). To better assess the variation of seismicity style across Italy, we also perform a clustering analysis with region-specific parameters. This analysis highlights clear spatial changes of the threshold separating background and clustered seismicity, and permits better resolution of different clusters in specific geological regions. For example, a large proportion of repeaters is found in the Etna region as expected for volcanic-induced seismicity. A similar behavior is observed in the northern Apennines with high pore pressure associated with mantle degassing. The observed variations of earthquakes properties highlight shortcomings of practices using large-scale average seismic properties, and points to connections between seismicity and local properties of the lithosphere. The observations help to improve the understanding of the physics governing the occurrence of earthquakes in different regions.
Tse, Herman; Chen, Jonathan H.K.; Tang, Ying; Lau, Susanna K.P.; Woo, Patrick C.Y.
2014-01-01
Streptococcus sinensis is a recently discovered human pathogen isolated from blood cultures of patients with infective endocarditis. Its phylogenetic position, as well as those of its closely related species, remains inconclusive when single genes were used for phylogenetic analysis. For example, S. sinensis branched out from members of the anginosus, mitis, and sanguinis groups in the 16S ribosomal RNA gene phylogenetic tree, but it was clustered with members of the anginosus and sanguinis groups when groEL gene sequences used for analysis. In this study, we sequenced the draft genome of S. sinensis and used a polyphasic approach, including concatenated genes, whole genomes, and matrix-assisted laser desorption ionization-time of flight mass spectrometry to analyze the phylogeny of S. sinensis. The size of the S. sinensis draft genome is 2.06 Mb, with GC content of 42.2%. Phylogenetic analysis using 50 concatenated genes or whole genomes revealed that S. sinensis formed a distinct cluster with Streptococcus oligofermentans and Streptococcus cristatus, and these three streptococci were clustered with the “sanguinis group.” As for phylogenetic analysis using hierarchical cluster analysis of the mass spectra of streptococci, S. sinensis also formed a distinct cluster with S. oligofermentans and S. cristatus, but these three streptococci were clustered with the “mitis group.” On the basis of the findings, we propose a novel group, named “sinensis group,” to include S. sinensis, S. oligofermentans, and S. cristatus, in the Streptococcus genus. Our study also illustrates the power of phylogenomic analyses for resolving ambiguities in bacterial taxonomy. PMID:25331233
Teng, Jade L L; Huang, Yi; Tse, Herman; Chen, Jonathan H K; Tang, Ying; Lau, Susanna K P; Woo, Patrick C Y
2014-10-20
Streptococcus sinensis is a recently discovered human pathogen isolated from blood cultures of patients with infective endocarditis. Its phylogenetic position, as well as those of its closely related species, remains inconclusive when single genes were used for phylogenetic analysis. For example, S. sinensis branched out from members of the anginosus, mitis, and sanguinis groups in the 16S ribosomal RNA gene phylogenetic tree, but it was clustered with members of the anginosus and sanguinis groups when groEL gene sequences used for analysis. In this study, we sequenced the draft genome of S. sinensis and used a polyphasic approach, including concatenated genes, whole genomes, and matrix-assisted laser desorption ionization-time of flight mass spectrometry to analyze the phylogeny of S. sinensis. The size of the S. sinensis draft genome is 2.06 Mb, with GC content of 42.2%. Phylogenetic analysis using 50 concatenated genes or whole genomes revealed that S. sinensis formed a distinct cluster with Streptococcus oligofermentans and Streptococcus cristatus, and these three streptococci were clustered with the "sanguinis group." As for phylogenetic analysis using hierarchical cluster analysis of the mass spectra of streptococci, S. sinensis also formed a distinct cluster with S. oligofermentans and S. cristatus, but these three streptococci were clustered with the "mitis group." On the basis of the findings, we propose a novel group, named "sinensis group," to include S. sinensis, S. oligofermentans, and S. cristatus, in the Streptococcus genus. Our study also illustrates the power of phylogenomic analyses for resolving ambiguities in bacterial taxonomy. © The Author(s) 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
NASA Technical Reports Server (NTRS)
Dasarathy, B. V.
1976-01-01
An algorithm is proposed for dimensionality reduction in the context of clustering techniques based on histogram analysis. The approach is based on an evaluation of the hills and valleys in the unidimensional histograms along the different features and provides an economical means of assessing the significance of the features in a nonparametric unsupervised data environment. The method has relevance to remote sensing applications.
Characterization and analysis of a transcriptome from the boreal spider crab Hyas araneus.
Harms, Lars; Frickenhaus, Stephan; Schiffer, Melanie; Mark, Felix C; Storch, Daniela; Pörtner, Hans-Otto; Held, Christoph; Lucassen, Magnus
2013-12-01
Research investigating the genetic basis of physiological responses has significantly broadened our understanding of the mechanisms underlying organismic response to environmental change. However, genomic data are currently available for few taxa only, thus excluding physiological model species from this approach. In this study we report the transcriptome of the model organism Hyas araneus from Spitsbergen (Arctic). We generated 20,479 transcripts, using the 454 GS FLX sequencing technology in combination with an Illumina HiSeq sequencing approach. Annotation by Blastx revealed 7159 blast hits in the NCBI non-redundant protein database. The comparison between the spider crab H. araneus transcriptome and EST libraries of the European lobster Homarus americanus and the porcelain crab Petrolisthes cinctipes yielded 3229/2581 sequences with a significant hit, respectively. The clustering by the Markov Clustering Algorithm (MCL) revealed a common core of 1710 clusters present in all three species and 5903 unique clusters for H. araneus. The combined sequencing approaches generated transcripts that will greatly expand the limited genomic data available for crustaceans. We introduce the MCL clustering for transcriptome comparisons as a simple approach to estimate similarities between transcriptomic libraries of different size and quality and to analyze homologies within the selected group of species. In particular, we identified a large variety of reverse transcriptase (RT) sequences not only in the H. araneus transcriptome and other decapod crustaceans, but also sea urchin, supporting the hypothesis of a heritable, anti-viral immunity and the proposed viral fragment integration by host-derived RTs in marine invertebrates. © 2013.
Profiles of Intrinsic and Extrinsic Motivations in Elementary School: A Longitudinal Analysis
ERIC Educational Resources Information Center
Corpus, Jennifer Henderlong; Wormington, Stephanie V.
2014-01-01
The authors used a person-centered, longitudinal approach to identify and evaluate naturally occurring combinations of intrinsic and extrinsic motivations among 490 third- through fifth-grade students. Cluster analysis revealed 3 groups, characterized by high levels of both motivations ("high quantity"): high intrinsic motivation but low…
ERIC Educational Resources Information Center
Hubert, Lawrence J.; Baker, Frank B.
1978-01-01
The "Traveling Salesman" and similar combinatorial programming tasks encountered in operations research are discussed as possible data analysis models in psychology, for example, in developmental scaling, Guttman scaling, profile smoothing, and data array clustering. A short overview of various computational approaches from this area of…
ERIC Educational Resources Information Center
Ullrich-French, Sarah; Cox, Anne E.; Cooper, Brittany Rhoades
2016-01-01
Previous research has used cluster analysis to examine how social physique anxiety (SPA) combines with motivation in physical education. This study utilized a more advanced analytic approach, latent profile analysis (LPA), to identify profiles of SPA and motivation regulations. Students in grades 9-12 (N = 298) completed questionnaires at two time…
Knutson, Stacy T.; Westwood, Brian M.; Leuthaeuser, Janelle B.; Turner, Brandon E.; Nguyendac, Don; Shea, Gabrielle; Kumar, Kiran; Hayden, Julia D.; Harper, Angela F.; Brown, Shoshana D.; Morris, John H.; Ferrin, Thomas E.; Babbitt, Patricia C.
2017-01-01
Abstract Protein function identification remains a significant problem. Solving this problem at the molecular functional level would allow mechanistic determinant identification—amino acids that distinguish details between functional families within a superfamily. Active site profiling was developed to identify mechanistic determinants. DASP and DASP2 were developed as tools to search sequence databases using active site profiling. Here, TuLIP (Two‐Level Iterative clustering Process) is introduced as an iterative, divisive clustering process that utilizes active site profiling to separate structurally characterized superfamily members into functionally relevant clusters. Underlying TuLIP is the observation that functionally relevant families (curated by Structure‐Function Linkage Database, SFLD) self‐identify in DASP2 searches; clusters containing multiple functional families do not. Each TuLIP iteration produces candidate clusters, each evaluated to determine if it self‐identifies using DASP2. If so, it is deemed a functionally relevant group. Divisive clustering continues until each structure is either a functionally relevant group member or a singlet. TuLIP is validated on enolase and glutathione transferase structures, superfamilies well‐curated by SFLD. Correlation is strong; small numbers of structures prevent statistically significant analysis. TuLIP‐identified enolase clusters are used in DASP2 GenBank searches to identify sequences sharing functional site features. Analysis shows a true positive rate of 96%, false negative rate of 4%, and maximum false positive rate of 4%. F‐measure and performance analysis on the enolase search results and comparison to GEMMA and SCI‐PHY demonstrate that TuLIP avoids the over‐division problem of these methods. Mechanistic determinants for enolase families are evaluated and shown to correlate well with literature results. PMID:28054422
Knutson, Stacy T; Westwood, Brian M; Leuthaeuser, Janelle B; Turner, Brandon E; Nguyendac, Don; Shea, Gabrielle; Kumar, Kiran; Hayden, Julia D; Harper, Angela F; Brown, Shoshana D; Morris, John H; Ferrin, Thomas E; Babbitt, Patricia C; Fetrow, Jacquelyn S
2017-04-01
Protein function identification remains a significant problem. Solving this problem at the molecular functional level would allow mechanistic determinant identification-amino acids that distinguish details between functional families within a superfamily. Active site profiling was developed to identify mechanistic determinants. DASP and DASP2 were developed as tools to search sequence databases using active site profiling. Here, TuLIP (Two-Level Iterative clustering Process) is introduced as an iterative, divisive clustering process that utilizes active site profiling to separate structurally characterized superfamily members into functionally relevant clusters. Underlying TuLIP is the observation that functionally relevant families (curated by Structure-Function Linkage Database, SFLD) self-identify in DASP2 searches; clusters containing multiple functional families do not. Each TuLIP iteration produces candidate clusters, each evaluated to determine if it self-identifies using DASP2. If so, it is deemed a functionally relevant group. Divisive clustering continues until each structure is either a functionally relevant group member or a singlet. TuLIP is validated on enolase and glutathione transferase structures, superfamilies well-curated by SFLD. Correlation is strong; small numbers of structures prevent statistically significant analysis. TuLIP-identified enolase clusters are used in DASP2 GenBank searches to identify sequences sharing functional site features. Analysis shows a true positive rate of 96%, false negative rate of 4%, and maximum false positive rate of 4%. F-measure and performance analysis on the enolase search results and comparison to GEMMA and SCI-PHY demonstrate that TuLIP avoids the over-division problem of these methods. Mechanistic determinants for enolase families are evaluated and shown to correlate well with literature results. © 2017 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dietrich, J.P.; et al.
Uncertainty in the mass-observable scaling relations is currently the limiting factor for galaxy cluster based cosmology. Weak gravitational lensing can provide a direct mass calibration and reduce the mass uncertainty. We present new ground-based weak lensing observations of 19 South Pole Telescope (SPT) selected clusters and combine them with previously reported space-based observations of 13 galaxy clusters to constrain the cluster mass scaling relations with the Sunyaev-Zel'dovich effect (SZE), the cluster gas massmore » $$M_\\mathrm{gas}$$, and $$Y_\\mathrm{X}$$, the product of $$M_\\mathrm{gas}$$ and X-ray temperature. We extend a previously used framework for the analysis of scaling relations and cosmological constraints obtained from SPT-selected clusters to make use of weak lensing information. We introduce a new approach to estimate the effective average redshift distribution of background galaxies and quantify a number of systematic errors affecting the weak lensing modelling. These errors include a calibration of the bias incurred by fitting a Navarro-Frenk-White profile to the reduced shear using $N$-body simulations. We blind the analysis to avoid confirmation bias. We are able to limit the systematic uncertainties to 6.4% in cluster mass (68% confidence). Our constraints on the mass-X-ray observable scaling relations parameters are consistent with those obtained by earlier studies, and our constraints for the mass-SZE scaling relation are consistent with the the simulation-based prior used in the most recent SPT-SZ cosmology analysis. We can now replace the external mass calibration priors used in previous SPT-SZ cosmology studies with a direct, internal calibration obtained on the same clusters.« less
Oberle, Michael; Wohlwend, Nadia; Jonas, Daniel; Maurer, Florian P.; Jost, Geraldine; Tschudin-Sutter, Sarah; Vranckx, Katleen; Egli, Adrian
2016-01-01
Background The technical, biological, and inter-center reproducibility of matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI TOF MS) typing data has not yet been explored. The aim of this study is to compare typing data from multiple centers employing bioinformatics using bacterial strains from two past outbreaks and non-related strains. Material/Methods Participants received twelve extended spectrum betalactamase-producing E. coli isolates and followed the same standard operating procedure (SOP) including a full-protein extraction protocol. All laboratories provided visually read spectra via flexAnalysis (Bruker, Germany). Raw data from each laboratory allowed calculating the technical and biological reproducibility between centers using BioNumerics (Applied Maths NV, Belgium). Results Technical and biological reproducibility ranged between 96.8–99.4% and 47.6–94.4%, respectively. The inter-center reproducibility showed a comparable clustering among identical isolates. Principal component analysis indicated a higher tendency to cluster within the same center. Therefore, we used a discriminant analysis, which completely separated the clusters. Next, we defined a reference center and performed a statistical analysis to identify specific peaks to identify the outbreak clusters. Finally, we used a classifier algorithm and a linear support vector machine on the determined peaks as classifier. A validation showed that within the set of the reference center, the identification of the cluster was 100% correct with a large contrast between the score with the correct cluster and the next best scoring cluster. Conclusions Based on the sufficient technical and biological reproducibility of MALDI-TOF MS based spectra, detection of specific clusters is possible from spectra obtained from different centers. However, we believe that a shared SOP and a bioinformatics approach are required to make the analysis robust and reliable. PMID:27798637
Musmeci, Nicoló; Aste, Tomaso; Di Matteo, T
2015-01-01
We quantify the amount of information filtered by different hierarchical clustering methods on correlations between stock returns comparing the clustering structure with the underlying industrial activity classification. We apply, for the first time to financial data, a novel hierarchical clustering approach, the Directed Bubble Hierarchical Tree and we compare it with other methods including the Linkage and k-medoids. By taking the industrial sector classification of stocks as a benchmark partition, we evaluate how the different methods retrieve this classification. The results show that the Directed Bubble Hierarchical Tree can outperform other methods, being able to retrieve more information with fewer clusters. Moreover,we show that the economic information is hidden at different levels of the hierarchical structures depending on the clustering method. The dynamical analysis on a rolling window also reveals that the different methods show different degrees of sensitivity to events affecting financial markets, like crises. These results can be of interest for all the applications of clustering methods to portfolio optimization and risk hedging [corrected].
Novel approaches to pin cluster synchronization on complex dynamical networks in Lur'e forms
NASA Astrophysics Data System (ADS)
Tang, Ze; Park, Ju H.; Feng, Jianwen
2018-04-01
This paper investigates the cluster synchronization of complex dynamical networks consisted of identical or nonidentical Lur'e systems. Due to the special topology structure of the complex networks and the existence of stochastic perturbations, a kind of randomly occurring pinning controller is designed which not only synchronizes all Lur'e systems in the same cluster but also decreases the negative influence among different clusters. Firstly, based on an extended integral inequality, the convex combination theorem and S-procedure, the conditions for cluster synchronization of identical Lur'e networks are derived in a convex domain. Secondly, randomly occurring adaptive pinning controllers with two independent Bernoulli stochastic variables are designed and then sufficient conditions are obtained for the cluster synchronization on complex networks consisted of nonidentical Lur'e systems. In addition, suitable control gains for successful cluster synchronization of nonidentical Lur'e networks are acquired by designing some adaptive updating laws. Finally, we present two numerical examples to demonstrate the validity of the control scheme and the theoretical analysis.
Shah, Sohil Atul
2017-01-01
Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank. PMID:28851838
Musmeci, Nicoló; Aste, Tomaso; Di Matteo, T.
2015-01-01
We quantify the amount of information filtered by different hierarchical clustering methods on correlations between stock returns comparing the clustering structure with the underlying industrial activity classification. We apply, for the first time to financial data, a novel hierarchical clustering approach, the Directed Bubble Hierarchical Tree and we compare it with other methods including the Linkage and k-medoids. By taking the industrial sector classification of stocks as a benchmark partition, we evaluate how the different methods retrieve this classification. The results show that the Directed Bubble Hierarchical Tree can outperform other methods, being able to retrieve more information with fewer clusters. Moreover, we show that the economic information is hidden at different levels of the hierarchical structures depending on the clustering method. The dynamical analysis on a rolling window also reveals that the different methods show different degrees of sensitivity to events affecting financial markets, like crises. These results can be of interest for all the applications of clustering methods to portfolio optimization and risk hedging. PMID:25786703
Combinatoric analysis of heterogeneous stochastic self-assembly.
D'Orsogna, Maria R; Zhao, Bingyu; Berenji, Bijan; Chou, Tom
2013-09-28
We analyze a fully stochastic model of heterogeneous nucleation and self-assembly in a closed system with a fixed total particle number M, and a fixed number of seeds Ns. Each seed can bind a maximum of N particles. A discrete master equation for the probability distribution of the cluster sizes is derived and the corresponding cluster concentrations are found using kinetic Monte-Carlo simulations in terms of the density of seeds, the total mass, and the maximum cluster size. In the limit of slow detachment, we also find new analytic expressions and recursion relations for the cluster densities at intermediate times and at equilibrium. Our analytic and numerical findings are compared with those obtained from classical mass-action equations and the discrepancies between the two approaches analyzed.
Coping profiles, perceived stress and health-related behaviors: a cluster analysis approach.
Doron, Julie; Trouillet, Raphael; Maneveau, Anaïs; Ninot, Grégory; Neveu, Dorine
2015-03-01
Using cluster analytical procedure, this study aimed (i) to determine whether people could be differentiated on the basis of coping profiles (or unique combinations of coping strategies); and (ii) to examine the relationships between these profiles and perceived stress and health-related behaviors. A sample of 578 French students (345 females, 233 males; M(age)= 21.78, SD(age)= 2.21) completed the Perceived Stress Scale-14 ( Bruchon-Schweitzer, 2002), the Brief COPE ( Muller and Spitz, 2003) and a series of items measuring health-related behaviors. A two-phased cluster analytic procedure (i.e. hierarchical and non-hierarchical-k-means) was employed to derive clusters of coping strategy profiles. The results yielded four distinctive coping profiles: High Copers, Adaptive Copers, Avoidant Copers and Low Copers. The results showed that clusters differed significantly in perceived stress and health-related behaviors. High Copers and Avoidant Copers displayed higher levels of perceived stress and engaged more in unhealthy behavior, compared with Adaptive Copers and Low Copers who reported lower levels of stress and engaged more in healthy behaviors. These findings suggested that individuals' relative reliance on some strategies and de-emphasis on others may be a more advantageous way of understanding the manner in which individuals cope with stress. Therefore, cluster analysis approach may provide an advantage over more traditional statistical techniques by identifying distinct coping profiles that might best benefit from interventions. Future research should consider coping profiles to provide a deeper understanding of the relationships between coping strategies and health outcomes and to identify risk groups. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Hybrid approach of selecting hyperparameters of support vector machine for regression.
Jeng, Jin-Tsong
2006-06-01
To select the hyperparameters of the support vector machine for regression (SVR), a hybrid approach is proposed to determine the kernel parameter of the Gaussian kernel function and the epsilon value of Vapnik's epsilon-insensitive loss function. The proposed hybrid approach includes a competitive agglomeration (CA) clustering algorithm and a repeated SVR (RSVR) approach. Since the CA clustering algorithm is used to find the nearly "optimal" number of clusters and the centers of clusters in the clustering process, the CA clustering algorithm is applied to select the Gaussian kernel parameter. Additionally, an RSVR approach that relies on the standard deviation of a training error is proposed to obtain an epsilon in the loss function. Finally, two functions, one real data set (i.e., a time series of quarterly unemployment rate for West Germany) and an identification of nonlinear plant are used to verify the usefulness of the hybrid approach.
Transcriptional Regulatory Network Analysis of MYB Transcription Factor Family Genes in Rice.
Smita, Shuchi; Katiyar, Amit; Chinnusamy, Viswanathan; Pandey, Dev M; Bansal, Kailash C
2015-01-01
MYB transcription factor (TF) is one of the largest TF families and regulates defense responses to various stresses, hormone signaling as well as many metabolic and developmental processes in plants. Understanding these regulatory hierarchies of gene expression networks in response to developmental and environmental cues is a major challenge due to the complex interactions between the genetic elements. Correlation analyses are useful to unravel co-regulated gene pairs governing biological process as well as identification of new candidate hub genes in response to these complex processes. High throughput expression profiling data are highly useful for construction of co-expression networks. In the present study, we utilized transcriptome data for comprehensive regulatory network studies of MYB TFs by "top-down" and "guide-gene" approaches. More than 50% of OsMYBs were strongly correlated under 50 experimental conditions with 51 hub genes via "top-down" approach. Further, clusters were identified using Markov Clustering (MCL). To maximize the clustering performance, parameter evaluation of the MCL inflation score (I) was performed in terms of enriched GO categories by measuring F-score. Comparison of co-expressed cluster and clads analyzed from phylogenetic analysis signifies their evolutionarily conserved co-regulatory role. We utilized compendium of known interaction and biological role with Gene Ontology enrichment analysis to hypothesize function of coexpressed OsMYBs. In the other part, the transcriptional regulatory network analysis by "guide-gene" approach revealed 40 putative targets of 26 OsMYB TF hubs with high correlation value utilizing 815 microarray data. The putative targets with MYB-binding cis-elements enrichment in their promoter region, functional co-occurrence as well as nuclear localization supports our finding. Specially, enrichment of MYB binding regions involved in drought-inducibility implying their regulatory role in drought response in rice. Thus, the co-regulatory network analysis facilitated the identification of complex OsMYB regulatory networks, and candidate target regulon genes of selected guide MYB genes. The results contribute to the candidate gene screening, and experimentally testable hypotheses for potential regulatory MYB TFs, and their targets under stress conditions.
Modified multidimensional scaling approach to analyze financial markets.
Yin, Yi; Shang, Pengjian
2014-06-01
Detrended cross-correlation coefficient (σDCCA) and dynamic time warping (DTW) are introduced as the dissimilarity measures, respectively, while multidimensional scaling (MDS) is employed to translate the dissimilarities between daily price returns of 24 stock markets. We first propose MDS based on σDCCA dissimilarity and MDS based on DTW dissimilarity creatively, while MDS based on Euclidean dissimilarity is also employed to provide a reference for comparisons. We apply these methods in order to further visualize the clustering between stock markets. Moreover, we decide to confront MDS with an alternative visualization method, "Unweighed Average" clustering method, for comparison. The MDS analysis and "Unweighed Average" clustering method are employed based on the same dissimilarity. Through the results, we find that MDS gives us a more intuitive mapping for observing stable or emerging clusters of stock markets with similar behavior, while the MDS analysis based on σDCCA dissimilarity can provide more clear, detailed, and accurate information on the classification of the stock markets than the MDS analysis based on Euclidean dissimilarity. The MDS analysis based on DTW dissimilarity indicates more knowledge about the correlations between stock markets particularly and interestingly. Meanwhile, it reflects more abundant results on the clustering of stock markets and is much more intensive than the MDS analysis based on Euclidean dissimilarity. In addition, the graphs, originated from applying MDS methods based on σDCCA dissimilarity and DTW dissimilarity, may also guide the construction of multivariate econometric models.
Three-dimensional x-ray diffraction nanoscopy
NASA Astrophysics Data System (ADS)
Nikulin, Andrei Y.; Dilanian, Ruben A.; Zatsepin, Nadia A.; Muddle, Barry C.
2008-08-01
A novel approach to x-ray diffraction data analysis for non-destructive determination of the shape of nanoscale particles and clusters in three-dimensions is illustrated with representative examples of composite nanostructures. The technique is insensitive to the x-rays coherence, which allows 3D reconstruction of a modal image without tomographic synthesis and in-situ analysis of large (over a several cubic millimeters) volume of material with a spatial resolution of few nanometers, rendering the approach suitable for laboratory facilities.
A Cross-Cultural Comparison of Symptom Reporting and Symptom Clusters in Heart Failure.
Park, Jumin; Johantgen, Mary E
2017-07-01
An understanding of symptoms in heart failure (HF) among different cultural groups has become increasingly important. The purpose of this study was to compare symptom reporting and symptom clusters in HF patients between a Western (the United States) and an Eastern Asian sample (China and Taiwan). A secondary analysis of a cross-sectional observational study was conducted. The data were obtained from a matched HF patient sample from the United States and China/Taiwan ( N = 240 in each). Eight selective items related to HF symptoms from the Minnesota Living with Heart Failure Questionnaire were analyzed. Compared with the U.S. sample, HF patients from China/Taiwan reported a lower level of symptom distress. Analysis of two different regional groups did not result in the same number of clusters using latent class approach: the United States (four classes) and China/Taiwan (three classes). The study demonstrated that symptom reporting and identification of symptom clusters might be influenced by cultural factors.
Chang, Hsien-Tsung; Mishra, Nilamadhab; Lin, Chung-Chih
2015-01-01
The current rapid growth of Internet of Things (IoT) in various commercial and non-commercial sectors has led to the deposition of large-scale IoT data, of which the time-critical analytic and clustering of knowledge granules represent highly thought-provoking application possibilities. The objective of the present work is to inspect the structural analysis and clustering of complex knowledge granules in an IoT big-data environment. In this work, we propose a knowledge granule analytic and clustering (KGAC) framework that explores and assembles knowledge granules from IoT big-data arrays for a business intelligence (BI) application. Our work implements neuro-fuzzy analytic architecture rather than a standard fuzzified approach to discover the complex knowledge granules. Furthermore, we implement an enhanced knowledge granule clustering (e-KGC) mechanism that is more elastic than previous techniques when assembling the tactical and explicit complex knowledge granules from IoT big-data arrays. The analysis and discussion presented here show that the proposed framework and mechanism can be implemented to extract knowledge granules from an IoT big-data array in such a way as to present knowledge of strategic value to executives and enable knowledge users to perform further BI actions.
Chang, Hsien-Tsung; Mishra, Nilamadhab; Lin, Chung-Chih
2015-01-01
The current rapid growth of Internet of Things (IoT) in various commercial and non-commercial sectors has led to the deposition of large-scale IoT data, of which the time-critical analytic and clustering of knowledge granules represent highly thought-provoking application possibilities. The objective of the present work is to inspect the structural analysis and clustering of complex knowledge granules in an IoT big-data environment. In this work, we propose a knowledge granule analytic and clustering (KGAC) framework that explores and assembles knowledge granules from IoT big-data arrays for a business intelligence (BI) application. Our work implements neuro-fuzzy analytic architecture rather than a standard fuzzified approach to discover the complex knowledge granules. Furthermore, we implement an enhanced knowledge granule clustering (e-KGC) mechanism that is more elastic than previous techniques when assembling the tactical and explicit complex knowledge granules from IoT big-data arrays. The analysis and discussion presented here show that the proposed framework and mechanism can be implemented to extract knowledge granules from an IoT big-data array in such a way as to present knowledge of strategic value to executives and enable knowledge users to perform further BI actions. PMID:26600156
Adamek, Martina; Alanjary, Mohammad; Sales-Ortells, Helena; Goodfellow, Michael; Bull, Alan T; Winkler, Anika; Wibberg, Daniel; Kalinowski, Jörn; Ziemert, Nadine
2018-06-01
Genome mining tools have enabled us to predict biosynthetic gene clusters that might encode compounds with valuable functions for industrial and medical applications. With the continuously increasing number of genomes sequenced, we are confronted with an overwhelming number of predicted clusters. In order to guide the effective prioritization of biosynthetic gene clusters towards finding the most promising compounds, knowledge about diversity, phylogenetic relationships and distribution patterns of biosynthetic gene clusters is necessary. Here, we provide a comprehensive analysis of the model actinobacterial genus Amycolatopsis and its potential for the production of secondary metabolites. A phylogenetic characterization, together with a pan-genome analysis showed that within this highly diverse genus, four major lineages could be distinguished which differed in their potential to produce secondary metabolites. Furthermore, we were able to distinguish gene cluster families whose distribution correlated with phylogeny, indicating that vertical gene transfer plays a major role in the evolution of secondary metabolite gene clusters. Still, the vast majority of the diverse biosynthetic gene clusters were derived from clusters unique to the genus, and also unique in comparison to a database of known compounds. Our study on the locations of biosynthetic gene clusters in the genomes of Amycolatopsis' strains showed that clusters acquired by horizontal gene transfer tend to be incorporated into non-conserved regions of the genome thereby allowing us to distinguish core and hypervariable regions in Amycolatopsis genomes. Using a comparative genomics approach, it was possible to determine the potential of the genus Amycolatopsis to produce a huge diversity of secondary metabolites. Furthermore, the analysis demonstrates that horizontal and vertical gene transfer play an important role in the acquisition and maintenance of valuable secondary metabolites. Our results cast light on the interconnections between secondary metabolite gene clusters and provide a way to prioritize biosynthetic pathways in the search and discovery of novel compounds.
Paternal age related schizophrenia (PARS): Latent subgroups detected by k-means clustering analysis.
Lee, Hyejoo; Malaspina, Dolores; Ahn, Hongshik; Perrin, Mary; Opler, Mark G; Kleinhaus, Karine; Harlap, Susan; Goetz, Raymond; Antonius, Daniel
2011-05-01
Paternal age related schizophrenia (PARS) has been proposed as a subgroup of schizophrenia with distinct etiology, pathophysiology and symptoms. This study uses a k-means clustering analysis approach to generate hypotheses about differences between PARS and other cases of schizophrenia. We studied PARS (operationally defined as not having any family history of schizophrenia among first and second-degree relatives and fathers' age at birth ≥ 35 years) in a series of schizophrenia cases recruited from a research unit. Data were available on demographic variables, symptoms (Positive and Negative Syndrome Scale; PANSS), cognitive tests (Wechsler Adult Intelligence Scale-Revised; WAIS-R) and olfaction (University of Pennsylvania Smell Identification Test; UPSIT). We conducted a series of k-means clustering analyses to identify clusters of cases containing high concentrations of PARS. Two analyses generated clusters with high concentrations of PARS cases. The first analysis (N=136; PARS=34) revealed a cluster containing 83% PARS cases, in which the patients showed a significant discrepancy between verbal and performance intelligence. The mean paternal and maternal ages were 41 and 33, respectively. The second analysis (N=123; PARS=30) revealed a cluster containing 71% PARS cases, of which 93% were females; the mean age of onset of psychosis, at 17.2, was significantly early. These results strengthen the evidence that PARS cases differ from other patients with schizophrenia. Hypothesis-generating findings suggest that features of PARS may include a discrepancy between verbal and performance intelligence, and in females, an early age of onset. These findings provide a rationale for separating these phenotypes from others in future clinical, genetic and pathophysiologic studies of schizophrenia and in considering responses to treatment. Copyright © 2011 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Dekkers, M. J.; Heslop, D.; Herrero-Bervera, E.; Acton, G.; Krasa, D.
2014-12-01
Ocean Drilling Program (ODP)/Integrated ODP (IODP) Hole 1256D (6.44.1' N, 91.56.1' W) on the Cocos Plate occurs in 15.2 Ma oceanic crust generated by superfast seafloor spreading. Presently, it is the only drill hole that has sampled all three oceanic crust layers in a tectonically undisturbed setting. Here we interpret down-hole trends in several rock-magnetic parameters with fuzzy c-means cluster analysis, a multivariate statistical technique. The parameters include the magnetization ratio, the coercivity ratio, the coercive force, the low-field susceptibility, and the Curie temperature. By their combined, multivariate, analysis the effects of magmatic and hydrothermal processes can be evaluated. The optimal number of clusters - a key point in the analysis because there is no a priori information on this - was determined through a combination of approaches: by calculation of several cluster validity indices, by testing for coherent cluster distributions on non-linear-map plots, and importantly by testing for stability of the cluster solution from all possible starting points. Here, we consider a solution robust if the cluster allocation is independent of the starting configuration. The five-cluster solution appeared to be robust. Three clusters are distinguished in the extrusive segment of the Hole that express increasing hydrothermal alteration of the lavas. The sheeted dike and gabbro portions are characterized by two clusters, both with higher coercivities than in lava samples. Extensive alteration, however, can obliterate magnetic property differences between lavas, dikes, and gabbros. The imprint of thermochemical alteration on the iron-titanium oxides is only partially related to the porosity of the rocks. All clusters display rock magnetic characteristics in line with a stable NRM. This implies that the entire sampled sequence of ocean crust can contribute to marine magnetic anomalies. Determination of the absolute paleointensity with thermal techniques is not straightforward because of the propensity of oxyexsolution during laboratory heating and/or the presence of intergrowths. The upper part of the extrusive sequence, the granoblastic portion of the dikes, and moderately altered gabbros may contain a comparatively uncontaminated thermoremanent magnetization.
NASA Astrophysics Data System (ADS)
Lespinats, Sylvain; Pinker-Domenig, Katja; Wengert, Georg; Houben, Ivo; Lobbes, Marc; Stadlbauer, Andreas; Meyer-Bäse, Anke
2016-05-01
Glioma-derived cancer stem cells (GSCs) are tumor-initiating cells and may be refractory to radiation and chemotherapy and thus have important implications for tumor biology and therapeutics. The analysis and interpretation of large proteomic data sets requires the development of new data mining and visualization approaches. Traditional techniques are insufficient to interpret and visualize these resulting experimental data. The emphasis of this paper lies in the application of novel approaches for the visualization, clustering and projection representation to unveil hidden data structures relevant for the accurate interpretation of biological experiments. These qualitative and quantitative methods are applied to the proteomic analysis of data sets derived from the GSCs. The achieved clustering and visualization results provide a more detailed insight into the protein-level fold changes and putative upstream regulators for the GSCs. However the extracted molecular information is insufficient in classifying GSCs and paving the pathway to an improved therapeutics of the heterogeneous glioma.
ERIC Educational Resources Information Center
John, Robert; Kerby, Dave S.; Hennessy, Catherine Hagan
2003-01-01
Purpose: The purpose of this study is to suggest a new approach to identifying patterns of comorbidity and multimorbidity. Design and Methods: A random sample of 1,039 rural community-resident American Indian elders aged 60 years and older was surveyed. Comorbidity was investigated with four standard approaches, and with cluster analysis. Results:…
2016-07-01
Abstract Flexible approach walls are being considered for retrofits, replacements, or upgrades to Corps lock structures that have exceeded their...case of Lock and Dam 3, the peak reaction force for any individual pile group was 11% of the peak impact load. DISCLAIMER: The contents of this...Generation Flexible Approach Walls ......................................... 1 1.2.1 Lock and Dam 3
Zhang, Bo; Liu, Wei; Zhang, Zhiwei; Qu, Yanping; Chen, Zhen; Albert, Paul S
2017-08-01
Joint modeling and within-cluster resampling are two approaches that are used for analyzing correlated data with informative cluster sizes. Motivated by a developmental toxicity study, we examined the performances and validity of these two approaches in testing covariate effects in generalized linear mixed-effects models. We show that the joint modeling approach is robust to the misspecification of cluster size models in terms of Type I and Type II errors when the corresponding covariates are not included in the random effects structure; otherwise, statistical tests may be affected. We also evaluate the performance of the within-cluster resampling procedure and thoroughly investigate the validity of it in modeling correlated data with informative cluster sizes. We show that within-cluster resampling is a valid alternative to joint modeling for cluster-specific covariates, but it is invalid for time-dependent covariates. The two methods are applied to a developmental toxicity study that investigated the effect of exposure to diethylene glycol dimethyl ether.
Identification of sea ice types in spaceborne synthetic aperture radar data
NASA Technical Reports Server (NTRS)
Kwok, Ronald; Rignot, Eric; Holt, Benjamin; Onstott, R.
1992-01-01
This study presents an approach for identification of sea ice types in spaceborne SAR image data. The unsupervised classification approach involves cluster analysis for segmentation of the image data followed by cluster labeling based on previously defined look-up tables containing the expected backscatter signatures of different ice types measured by a land-based scatterometer. Extensive scatterometer observations and experience accumulated in field campaigns during the last 10 yr were used to construct these look-up tables. The classification approach, its expected performance, the dependence of this performance on radar system performance, and expected ice scattering characteristics are discussed. Results using both aircraft and simulated ERS-1 SAR data are presented and compared to limited field ice property measurements and coincident passive microwave imagery. The importance of an integrated postlaunch program for the validation and improvement of this approach is discussed.
Berry, Jack W; Schwebel, David C
2009-10-01
This study used two configural approaches to understand how temperament factors (surgency/extraversion, negative affect, and effortful control) might predict child injury risk. In the first approach, clustering procedures were applied to trait dimensions to identify discrete personality prototypes. In the second approach, two- and three-way trait interactions were considered dimensionally in regression models predicting injury outcomes. Injury risk was assessed through four measures: lifetime prevalence of injuries requiring professional medical attention, scores on the Injury Behavior Checklist, and frequency and severity of injuries reported in a 2-week injury diary. In the prototype analysis, three temperament clusters were obtained, which resembled resilient, overcontrolled, and undercontrolled types found in previous research. Undercontrolled children had greater risk of injury than children in the other groups. In the dimensional interaction analyses, an interaction between surgency/extraversion and negative affect tended to predict injury, especially when children lacked capacity for effortful control.
Meisel, Matthew K; Clifton, Allan D; MacKillop, James; Goodie, Adam S
2015-12-01
The current study applied egocentric social network analysis (SNA) to investigate the prevalence of addictive behavior and co-occurring substance use in college students' networks. Specifically, we examined individuals' perceptions of the frequency of network members' co-occurring addictive behavior and investigated whether co-occurring addictive behavior is spread evenly throughout networks or is more localized in clusters. We also examined differences in network composition between individuals with varying levels of alcohol use. The study utilized an egocentric SNA approach in which respondents ("egos") enumerated 30 of their closest friends, family members, co-workers, and significant others ("alters") and the relations among alters listed. Participants were 281 undergraduates at a large university in the Southeastern United States. Robust associations were observed among the frequencies of gambling, smoking, drinking, and using marijuana by network members. We also found that alters tended to cluster together into two distinct groups: one cluster moderate-to-high on co-occurring addictive behavior and the other low on co-occurring addictive behavior. Lastly, significant differences were present when examining egos' perceptions of alters' substance use between the networks of at-risk, light, and nondrinkers. These findings provide empirical evidence of distinct clustering of addictive behavior among young adults and suggest the promise of social network-based interventions for this cohort. Copyright © 2015. Published by Elsevier Ltd.
Li, Meng-Hua
2014-01-01
When an enterprise has thousands of varieties in its inventory, the use of a single management method could not be a feasible approach. A better way to manage this problem would be to categorise inventory items into several clusters according to inventory decisions and to use different management methods for managing different clusters. The present study applies DPSO (dynamic particle swarm optimisation) to a problem of clustering of inventory items. Without the requirement of prior inventory knowledge, inventory items are automatically clustered into near optimal clustering number. The obtained clustering results should satisfy the inventory objective equation, which consists of different objectives such as total cost, backorder rate, demand relevance, and inventory turnover rate. This study integrates the above four objectives into a multiobjective equation, and inputs the actual inventory items of the enterprise into DPSO. In comparison with other clustering methods, the proposed method can consider different objectives and obtain an overall better solution to obtain better convergence results and inventory decisions. PMID:25197713
Coherent clusters of inertial particles in homogeneous turbulence
NASA Astrophysics Data System (ADS)
Baker, Lucia; Frankel, Ari; Mani, Ali; Coletti, Filippo
2016-11-01
Clustering of heavy particles in turbulent flows manifests itself in a broad spectrum of physical phenomena, including sediment transport, cloud formation, and spray combustion. However, a clear topological definition of particle cluster has been lacking, limiting our ability to describe their features and dynamics. Here we introduce a definition of coherent cluster based on self-similarity, and apply it to the distribution of heavy particles in direct numerical simulations of homogeneous isotropic turbulence. We consider a range of particle Stokes numbers, with and without the effect of gravity. Clusters show self-similarity at length scales larger than twice the Kolmogorov length, with a specific fractal dimension. In the absence of gravity, clusters demonstrate a tendency to sample regions of the flow where strain is dominant over vorticity, and to align themselves with the local vorticity vector; when gravity is present, the clusters tend to align themselves with gravity, and their fall speed is different from the average settling velocity. This approach yields observations which are consistent with findings obtained from previous studies while opening new avenues for analysis of the topology and evolution of particle clusters in a wealth of applications.
Scalable cluster administration - Chiba City I approach and lessons learned.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Navarro, J. P.; Evard, R.; Nurmi, D.
2002-07-01
Systems administrators of large clusters often need to perform the same administrative activity hundreds or thousands of times. Often such activities are time-consuming, especially the tasks of installing and maintaining software. By combining network services such as DHCP, TFTP, FTP, HTTP, and NFS with remote hardware control, cluster administrators can automate all administrative tasks. Scalable cluster administration addresses the following challenge: What systems design techniques can cluster builders use to automate cluster administration on very large clusters? We describe the approach used in the Mathematics and Computer Science Division of Argonne National Laboratory on Chiba City I, a 314-node Linuxmore » cluster; and we analyze the scalability, flexibility, and reliability benefits and limitations from that approach.« less
De Brún, Aoife; McAuliffe, Eilish
2018-03-13
Health systems research recognizes the complexity of healthcare, and the interacting and interdependent nature of components of a health system. To better understand such systems, innovative methods are required to depict and analyze their structures. This paper describes social network analysis as a methodology to depict, diagnose, and evaluate health systems and networks therein. Social network analysis is a set of techniques to map, measure, and analyze social relationships between people, teams, and organizations. Through use of a case study exploring support relationships among senior managers in a newly established hospital group, this paper illustrates some of the commonly used network- and node-level metrics in social network analysis, and demonstrates the value of these maps and metrics to understand systems. Network analysis offers a valuable approach to health systems and services researchers as it offers a means to depict activity relevant to network questions of interest, to identify opinion leaders, influencers, clusters in the network, and those individuals serving as bridgers across clusters. The strengths and limitations inherent in the method are discussed, and the applications of social network analysis in health services research are explored.
Multidimensional assessment of awareness in early-stage dementia: a cluster analytic approach.
Clare, Linda; Whitaker, Christopher J; Nelis, Sharon M; Martyr, Anthony; Markova, Ivana S; Roth, Ilona; Woods, Robert T; Morris, Robin G
2011-01-01
Research on awareness in dementia has yielded variable and inconsistent associations between awareness and other factors. This study examined awareness using a multidimensional approach and applied cluster analytic techniques to identify associations between the level of awareness and other variables. Participants were 101 individuals with early-stage dementia (PwD) and their carers. Explicit awareness was assessed at 3 levels: performance monitoring in relation to memory, evaluative judgement in relation to memory, everyday activities and socio-emotional functioning, and metacognitive reflection in relation to the experience and impact of the condition. Implicit awareness was assessed with an emotional Stroop task. Different measures of explicit awareness scores were related only to a limited extent. Cluster analysis yielded 3 groups with differing degrees of explicit awareness. These groups showed no differences in implicit awareness. Lower explicit awareness was associated with greater age, lower MMSE scores, poorer recall and naming scores, lower anxiety and greater carer stress. Multidimensional assessment offers a more robust approach to classifying PwD according to level of awareness and hence to examining correlates and predictors of awareness. Copyright © 2011 S. Karger AG, Basel.
Multiscale Embedded Gene Co-expression Network Analysis
Song, Won-Min; Zhang, Bin
2015-01-01
Gene co-expression network analysis has been shown effective in identifying functional co-expressed gene modules associated with complex human diseases. However, existing techniques to construct co-expression networks require some critical prior information such as predefined number of clusters, numerical thresholds for defining co-expression/interaction, or do not naturally reproduce the hallmarks of complex systems such as the scale-free degree distribution of small-worldness. Previously, a graph filtering technique called Planar Maximally Filtered Graph (PMFG) has been applied to many real-world data sets such as financial stock prices and gene expression to extract meaningful and relevant interactions. However, PMFG is not suitable for large-scale genomic data due to several drawbacks, such as the high computation complexity O(|V|3), the presence of false-positives due to the maximal planarity constraint, and the inadequacy of the clustering framework. Here, we developed a new co-expression network analysis framework called Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) by: i) introducing quality control of co-expression similarities, ii) parallelizing embedded network construction, and iii) developing a novel clustering technique to identify multi-scale clustering structures in Planar Filtered Networks (PFNs). We applied MEGENA to a series of simulated data and the gene expression data in breast carcinoma and lung adenocarcinoma from The Cancer Genome Atlas (TCGA). MEGENA showed improved performance over well-established clustering methods and co-expression network construction approaches. MEGENA revealed not only meaningful multi-scale organizations of co-expressed gene clusters but also novel targets in breast carcinoma and lung adenocarcinoma. PMID:26618778
Multiscale Embedded Gene Co-expression Network Analysis.
Song, Won-Min; Zhang, Bin
2015-11-01
Gene co-expression network analysis has been shown effective in identifying functional co-expressed gene modules associated with complex human diseases. However, existing techniques to construct co-expression networks require some critical prior information such as predefined number of clusters, numerical thresholds for defining co-expression/interaction, or do not naturally reproduce the hallmarks of complex systems such as the scale-free degree distribution of small-worldness. Previously, a graph filtering technique called Planar Maximally Filtered Graph (PMFG) has been applied to many real-world data sets such as financial stock prices and gene expression to extract meaningful and relevant interactions. However, PMFG is not suitable for large-scale genomic data due to several drawbacks, such as the high computation complexity O(|V|3), the presence of false-positives due to the maximal planarity constraint, and the inadequacy of the clustering framework. Here, we developed a new co-expression network analysis framework called Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) by: i) introducing quality control of co-expression similarities, ii) parallelizing embedded network construction, and iii) developing a novel clustering technique to identify multi-scale clustering structures in Planar Filtered Networks (PFNs). We applied MEGENA to a series of simulated data and the gene expression data in breast carcinoma and lung adenocarcinoma from The Cancer Genome Atlas (TCGA). MEGENA showed improved performance over well-established clustering methods and co-expression network construction approaches. MEGENA revealed not only meaningful multi-scale organizations of co-expressed gene clusters but also novel targets in breast carcinoma and lung adenocarcinoma.
Analysis of the nutritional status of algae by Fourier transform infrared chemical imaging
NASA Astrophysics Data System (ADS)
Hirschmugl, Carol J.; Bayarri, Zuheir-El; Bunta, Maria; Holt, Justin B.; Giordano, Mario
2006-09-01
A new non-destructive method to study the nutritional status of algal cells and their environments is demonstrated. This approach allows rapid examination of whole cells without any or little pre-treatment providing a large amount of information on the biochemical composition of cells and growth medium. The method is based on the analysis of a collection of infrared (IR) spectra for individual cells; each spectrum describes the biochemical composition of a portion of a cell; a complete set of spectra is used to reconstruct an image of the entire cell. To obtain spatially resolved information synchrotron radiation was used as a bright IR source. We tested this method on the green flagellate Euglena gracilis; a comparison was conducted between cells grown in nutrient replete conditions (Type 1) and on cells allowed to deplete their medium (Type 2). Complete sets of spectra for individual cells of both types were analyzed with agglomerative hierarchical clustering, leading to distinct clusters representative of the two types of cells. The average spectra for the clusters confirmed the similarities between the clusters and the types of cells. The clustering analysis, therefore, allows the distinction of cells of the same species, but with different nutritional histories. In order to facilitate the application of the method and reduce manipulation (washing), we analyzed the cells in the presence of residual medium. The results obtained showed that even with residual medium the outcome of the clustering analysis is reliable. Our results demonstrate the applicability FTIR microspectroscopy for ecological and ecophysiological studies.
Temporary disaster debris management site identification using binomial cluster analysis and GIS.
Grzeda, Stanislaw; Mazzuchi, Thomas A; Sarkani, Shahram
2014-04-01
An essential component of disaster planning and preparation is the identification and selection of temporary disaster debris management sites (DMS). However, since DMS identification is a complex process involving numerous variable constraints, many regional, county and municipal jurisdictions initiate this process during the post-disaster response and recovery phases, typically a period of severely stressed resources. Hence, a pre-disaster approach in identifying the most likely sites based on the number of locational constraints would significantly contribute to disaster debris management planning. As disasters vary in their nature, location and extent, an effective approach must facilitate scalability, flexibility and adaptability to variable local requirements, while also being generalisable to other regions and geographical extents. This study demonstrates the use of binomial cluster analysis in potential DMS identification in a case study conducted in Hamilton County, Indiana. © 2014 The Author(s). Disasters © Overseas Development Institute, 2014.
Deep Learning Nuclei Detection in Digitized Histology Images by Superpixels
Sornapudi, Sudhir; Stanley, Ronald Joe; Stoecker, William V.; Almubarak, Haidar; Long, Rodney; Antani, Sameer; Thoma, George; Zuna, Rosemary; Frazier, Shelliane R.
2018-01-01
Background: Advances in image analysis and computational techniques have facilitated automatic detection of critical features in histopathology images. Detection of nuclei is critical for squamous epithelium cervical intraepithelial neoplasia (CIN) classification into normal, CIN1, CIN2, and CIN3 grades. Methods: In this study, a deep learning (DL)-based nuclei segmentation approach is investigated based on gathering localized information through the generation of superpixels using a simple linear iterative clustering algorithm and training with a convolutional neural network. Results: The proposed approach was evaluated on a dataset of 133 digitized histology images and achieved an overall nuclei detection (object-based) accuracy of 95.97%, with demonstrated improvement over imaging-based and clustering-based benchmark techniques. Conclusions: The proposed DL-based nuclei segmentation Method with superpixel analysis has shown improved segmentation results in comparison to state-of-the-art methods. PMID:29619277
ERIC Educational Resources Information Center
Janson, Harald; Mathiesen, Kristin S.
2008-01-01
The authors applied I-States as Objects Analysis (ISOA), a recently proposed person-oriented analytic approach, to the study of temperament development in 921 Norwegian children from a population-based sample. A 5-profile classification based on cluster analysis of standardized mother reports of activity, sociability, emotionality, and shyness at…
Field of Study Choice: Using Conjoint Analysis and Clustering
ERIC Educational Resources Information Center
Shtudiner, Ze'ev; Zwilling, Moti; Kantor, Jeffrey
2017-01-01
Purpose: The purpose of this paper is to measure student's preferences regarding various attributes that affect their decision process while choosing a higher education area of study. Design/ Methodology/Approach: The paper exhibits two different models which shed light on the perceived value of each examined area of study: conjoint analysis and…
Jiang, Jheng Jie; Lee, Chon Lin; Fang, Meng Der; Boyd, Kenneth G.; Gibb, Stuart W.
2015-01-01
This paper presents a methodology based on multivariate data analysis for characterizing potential source contributions of emerging contaminants (ECs) detected in 26 river water samples across multi-scape regions during dry and wet seasons. Based on this methodology, we unveil an approach toward potential source contributions of ECs, a concept we refer to as the “Pharmaco-signature.” Exploratory analysis of data points has been carried out by unsupervised pattern recognition (hierarchical cluster analysis, HCA) and receptor model (principal component analysis-multiple linear regression, PCA-MLR) in an attempt to demonstrate significant source contributions of ECs in different land-use zone. Robust cluster solutions grouped the database according to different EC profiles. PCA-MLR identified that 58.9% of the mean summed ECs were contributed by domestic impact, 9.7% by antibiotics application, and 31.4% by drug abuse. Diclofenac, ibuprofen, codeine, ampicillin, tetracycline, and erythromycin-H2O have significant pollution risk quotients (RQ>1), indicating potentially high risk to aquatic organisms in Taiwan. PMID:25874375
Comparative genomic analysis by microbial COGs self-attraction rate.
Santoni, Daniele; Romano-Spica, Vincenzo
2009-06-21
Whole genome analysis provides new perspectives to determine phylogenetic relationships among microorganisms. The availability of whole nucleotide sequences allows different levels of comparison among genomes by several approaches. In this work, self-attraction rates were considered for each cluster of orthologous groups of proteins (COGs) class in order to analyse gene aggregation levels in physical maps. Phylogenetic relationships among microorganisms were obtained by comparing self-attraction coefficients. Eighteen-dimensional vectors were computed for a set of 168 completely sequenced microbial genomes (19 archea, 149 bacteria). The components of the vector represent the aggregation rate of the genes belonging to each of 18 COGs classes. Genes involved in nonessential functions or related to environmental conditions showed the highest aggregation rates. On the contrary genes involved in basic cellular tasks showed a more uniform distribution along the genome, except for translation genes. Self-attraction clustering approach allowed classification of Proteobacteria, Bacilli and other species belonging to Firmicutes. Rearrangement and Lateral Gene Transfer events may influence divergences from classical taxonomy. Each set of COG classes' aggregation values represents an intrinsic property of the microbial genome. This novel approach provides a new point of view for whole genome analysis and bacterial characterization.
`Inter-Arrival Time' Inspired Algorithm and its Application in Clustering and Molecular Phylogeny
NASA Astrophysics Data System (ADS)
Kolekar, Pandurang S.; Kale, Mohan M.; Kulkarni-Kale, Urmila
2010-10-01
Bioinformatics, being multidisciplinary field, involves applications of various methods from allied areas of Science for data mining using computational approaches. Clustering and molecular phylogeny is one of the key areas in Bioinformatics, which help in study of classification and evolution of organisms. Molecular phylogeny algorithms can be divided into distance based and character based methods. But most of these methods are dependent on pre-alignment of sequences and become computationally intensive with increase in size of data and hence demand alternative efficient approaches. `Inter arrival time distribution' (IATD) is a popular concept in the theory of stochastic system modeling but its potential in molecular data analysis has not been fully explored. The present study reports application of IATD in Bioinformatics for clustering and molecular phylogeny. The proposed method provides IATDs of nucleotides in genomic sequences. The distance function based on statistical parameters of IATDs is proposed and distance matrix thus obtained is used for the purpose of clustering and molecular phylogeny. The method is applied on a dataset of 3' non-coding region sequences (NCR) of Dengue virus type 3 (DENV-3), subtype III, reported in 2008. The phylogram thus obtained revealed the geographical distribution of DENV-3 isolates. Sri Lankan DENV-3 isolates were further observed to be clustered in two sub-clades corresponding to pre and post Dengue hemorrhagic fever emergence groups. These results are consistent with those reported earlier, which are obtained using pre-aligned sequence data as an input. These findings encourage applications of the IATD based method in molecular phylogenetic analysis in particular and data mining in general.
Synchronized changes to relative neuron populations in postnatal human neocortical development
Cooper, David L.; Gentle, James E.; Barreto, Ernest
2010-01-01
Mammalian prenatal neocortical development is dominated by the synchronized formation of the laminae and migration of neurons. Postnatal development likewise contains “sensitive periods” during which functions such as ocular dominance emerge. Here we introduce a novel neuroinformatics approach to identify and study these periods of active development. Although many aspects of the approach can be used in other studies, some specific techniques were chosen because of a legacy dataset of human histological data (Conel in The postnatal development of the human cerebral cortex, vol 1–8. Harvard University Press, Cambridge, 1939–1967). Our method calculates normalized change vectors from the raw histological data, and then employs k-means cluster analysis of the change vectors to explore the population dynamics of neurons from 37 neocortical areas across eight postnatal developmental stages from birth to 72 months in 54 subjects. We show that the cortical “address” (Brodmann area/sub-area and layer) provides the necessary resolution to segregate neuron population changes into seven correlated “k-clusters” in k-means cluster analysis. The members in each k-cluster share a single change interval where the relative share of the cortex by the members undergoes its maximum change. The maximum change occurs in a different change interval for each k-cluster. Each k-cluster has at least one totally connected maximal “clique” which appears to correspond to cortical function. Electronic supplementary material The online version of this article (doi:10.1007/s11571-010-9103-3) contains supplementary material, which is available to authorized users. PMID:21629587
Relative efficiency and sample size for cluster randomized trials with variable cluster sizes.
You, Zhiying; Williams, O Dale; Aban, Inmaculada; Kabagambe, Edmond Kato; Tiwari, Hemant K; Cutter, Gary
2011-02-01
The statistical power of cluster randomized trials depends on two sample size components, the number of clusters per group and the numbers of individuals within clusters (cluster size). Variable cluster sizes are common and this variation alone may have significant impact on study power. Previous approaches have taken this into account by either adjusting total sample size using a designated design effect or adjusting the number of clusters according to an assessment of the relative efficiency of unequal versus equal cluster sizes. This article defines a relative efficiency of unequal versus equal cluster sizes using noncentrality parameters, investigates properties of this measure, and proposes an approach for adjusting the required sample size accordingly. We focus on comparing two groups with normally distributed outcomes using t-test, and use the noncentrality parameter to define the relative efficiency of unequal versus equal cluster sizes and show that statistical power depends only on this parameter for a given number of clusters. We calculate the sample size required for an unequal cluster sizes trial to have the same power as one with equal cluster sizes. Relative efficiency based on the noncentrality parameter is straightforward to calculate and easy to interpret. It connects the required mean cluster size directly to the required sample size with equal cluster sizes. Consequently, our approach first determines the sample size requirements with equal cluster sizes for a pre-specified study power and then calculates the required mean cluster size while keeping the number of clusters unchanged. Our approach allows adjustment in mean cluster size alone or simultaneous adjustment in mean cluster size and number of clusters, and is a flexible alternative to and a useful complement to existing methods. Comparison indicated that we have defined a relative efficiency that is greater than the relative efficiency in the literature under some conditions. Our measure of relative efficiency might be less than the measure in the literature under some conditions, underestimating the relative efficiency. The relative efficiency of unequal versus equal cluster sizes defined using the noncentrality parameter suggests a sample size approach that is a flexible alternative and a useful complement to existing methods.
NASA Astrophysics Data System (ADS)
Alexandroni, Guy; Zimmerman Moreno, Gali; Sochen, Nir; Greenspan, Hayit
2016-03-01
Recent advances in Diffusion Weighted Magnetic Resonance Imaging (DW-MRI) of white matter in conjunction with improved tractography produce impressive reconstructions of White Matter (WM) pathways. These pathways (fiber sets) often contain hundreds of thousands of fibers, or more. In order to make fiber based analysis more practical, the fiber set needs to be preprocessed to eliminate redundancies and to keep only essential representative fibers. In this paper we demonstrate and compare two distinctive frameworks for selecting this reduced set of fibers. The first framework entails pre-clustering the fibers using k-means, followed by Hierarchical Clustering and replacing each cluster with one representative. For the second clustering stage seven distance metrics were evaluated. The second framework is based on an efficient geometric approximation paradigm named coresets. Coresets present a new approach to optimization and have huge success especially in tasks requiring large computation time and/or memory. We propose a modified version of the coresets algorithm, Density Coreset. It is used for extracting the main fibers from dense datasets, leaving a small set that represents the main structures and connectivity of the brain. A novel approach, based on a 3D indicator structure, is used for comparing the frameworks. This comparison was applied to High Angular Resolution Diffusion Imaging (HARDI) scans of 4 healthy individuals. We show that among the clustering based methods, that cosine distance gives the best performance. In comparing the clustering schemes with coresets, Density Coreset method achieves the best performance.
Clustering and variable selection in the presence of mixed variable types and missing data.
Storlie, C B; Myers, S M; Katusic, S K; Weaver, A L; Voigt, R G; Croarkin, P E; Stoeckel, R E; Port, J D
2018-05-17
We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines. Copyright © 2018 John Wiley & Sons, Ltd.
A Multivariate Analysis of Galaxy Cluster Properties
NASA Astrophysics Data System (ADS)
Ogle, P. M.; Djorgovski, S.
1993-05-01
We have assembled from the literature a data base on on 394 clusters of galaxies, with up to 16 parameters per cluster. They include optical and x-ray luminosities, x-ray temperatures, galaxy velocity dispersions, central galaxy and particle densities, optical and x-ray core radii and ellipticities, etc. In addition, derived quantities, such as the mass-to-light ratios and x-ray gas masses are included. Doubtful measurements have been identified, and deleted from the data base. Our goal is to explore the correlations between these parameters, and interpret them in the framework of our understanding of evolution of clusters and large-scale structure, such as the Gott-Rees scaling hierarchy. Among the simple, monovariate correlations we found, the most significant include those between the optical and x-ray luminosities, x-ray temperatures, cluster velocity dispersions, and central galaxy densities, in various mutual combinations. While some of these correlations have been discussed previously in the literature, generally smaller samples of objects have been used. We will also present the results of a multivariate statistical analysis of the data, including a principal component analysis (PCA). Such an approach has not been used previously for studies of cluster properties, even though it is much more powerful and complete than the simple monovariate techniques which are commonly employed. The observed correlations may lead to powerful constraints for theoretical models of formation and evolution of galaxy clusters. P.M.O. was supported by a Caltech graduate fellowship. S.D. acknowledges a partial support from the NASA contract NAS5-31348 and the NSF PYI award AST-9157412.
Clustering multilayer omics data using MuNCut.
Teran Hidalgo, Sebastian J; Ma, Shuangge
2018-03-14
Omics profiling is now a routine component of biomedical studies. In the analysis of omics data, clustering is an essential step and serves multiple purposes including for example revealing the unknown functionalities of omics units, assisting dimension reduction in outcome model building, and others. In the most recent omics studies, a prominent trend is to conduct multilayer profiling, which collects multiple types of genetic, genomic, epigenetic and other measurements on the same subjects. In the literature, clustering methods tailored to multilayer omics data are still limited. Directly applying the existing clustering methods to multilayer omics data and clustering each layer first and then combing across layers are both "suboptimal" in that they do not accommodate the interconnections within layers and across layers in an informative way. In this study, we develop the MuNCut (Multilayer NCut) clustering approach. It is tailored to multilayer omics data and sufficiently accounts for both across- and within-layer connections. It is based on the novel NCut technique and also takes advantages of regularized sparse estimation. It has an intuitive formulation and is computationally very feasible. To facilitate implementation, we develop the function muncut in the R package NcutYX. Under a wide spectrum of simulation settings, it outperforms competitors. The analysis of TCGA (The Cancer Genome Atlas) data on breast cancer and cervical cancer shows that MuNCut generates biologically meaningful results which differ from those using the alternatives. We propose a more effective clustering analysis of multiple omics data. It provides a new venue for jointly analyzing genetic, genomic, epigenetic and other measurements.
Jiang, L Crystal; Wang, Zhen-Zhen; Peng, Tai-Quan; Zhu, Jonathan J H
2015-01-01
Social scientific approach has become an important approach in e-Health studies over the past decade. However, there has been little systematical examination of what aspects of e-Health social scientists have studied and how relevant and informative knowledge has been produced and diffused by this line of inquiry. This study performed a systematic review of the body of e-Health literature in mainstream social science journals over the past decade by testing the applicability of a 5A categorization (i.e., access, availability, appropriateness, acceptability, and applicability), proposed by the U.S. Department of Health and Human Services, as a framework for understanding social scientific research in e-Health. This study used a quantitative, bottom-up approach to review the e-Health literature in social sciences published from 2000 to 2009. A total of 3005 e-Health studies identified from two social sciences databases (i.e., Social Sciences Citation Index and Arts & Humanities Citation Index) were analyzed with text topic modeling and structural analysis of co-word network, co-citation network, and scientific food web. There have been dramatic increases in the scale of e-Health studies in social sciences over the past decade in terms of the numbers of publications, journal outlets and participating disciplines. The results empirically confirm the presence of the 5A clusters in e-Health research, with the cluster of applicability as the dominant research area and the cluster of availability as the major knowledge producer for other clusters. The network analysis also reveals that the five distinctive clusters share much more in common in research concerns than what e-Health scholars appear to recognize. It is time to explicate and, more importantly, tap into the shared concerns cutting across the seemingly divided scholarly communities. In particular, more synergy exercises are needed to promote adherence of the field. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Sarparandeh, Mohammadali; Hezarkhani, Ardeshir
2017-12-01
The use of efficient methods for data processing has always been of interest to researchers in the field of earth sciences. Pattern recognition techniques are appropriate methods for high-dimensional data such as geochemical data. Evaluation of the geochemical distribution of rare earth elements (REEs) requires the use of such methods. In particular, the multivariate nature of REE data makes them a good target for numerical analysis. The main subject of this paper is application of unsupervised pattern recognition approaches in evaluating geochemical distribution of REEs in the Kiruna type magnetite-apatite deposit of Se-Chahun. For this purpose, 42 bulk lithology samples were collected from the Se-Chahun iron ore deposit. In this study, 14 rare earth elements were measured with inductively coupled plasma mass spectrometry (ICP-MS). Pattern recognition makes it possible to evaluate the relations between the samples based on all these 14 features, simultaneously. In addition to providing easy solutions, discovery of the hidden information and relations of data samples is the advantage of these methods. Therefore, four clustering methods (unsupervised pattern recognition) - including a modified basic sequential algorithmic scheme (MBSAS), hierarchical (agglomerative) clustering, k-means clustering and self-organizing map (SOM) - were applied and results were evaluated using the silhouette criterion. Samples were clustered in four types. Finally, the results of this study were validated with geological facts and analysis results from, for example, scanning electron microscopy (SEM), X-ray diffraction (XRD), ICP-MS and optical mineralogy. The results of the k-means clustering and SOM methods have the best matches with reality, with experimental studies of samples and with field surveys. Since only the rare earth elements are used in this division, a good agreement of the results with lithology is considerable. It is concluded that the combination of the proposed methods and geological studies leads to finding some hidden information, and this approach has the best results compared to using only one of them.
Pérot, Stéphanie; Regad, Leslie; Reynès, Christelle; Spérandio, Olivier; Miteva, Maria A; Villoutreix, Bruno O; Camproux, Anne-Claude
2013-01-01
Pockets are today at the cornerstones of modern drug discovery projects and at the crossroad of several research fields, from structural biology to mathematical modeling. Being able to predict if a small molecule could bind to one or more protein targets or if a protein could bind to some given ligands is very useful for drug discovery endeavors, anticipation of binding to off- and anti-targets. To date, several studies explore such questions from chemogenomic approach to reverse docking methods. Most of these studies have been performed either from the viewpoint of ligands or targets. However it seems valuable to use information from both ligands and target binding pockets. Hence, we present a multivariate approach relating ligand properties with protein pocket properties from the analysis of known ligand-protein interactions. We explored and optimized the pocket-ligand pair space by combining pocket and ligand descriptors using Principal Component Analysis and developed a classification engine on this paired space, revealing five main clusters of pocket-ligand pairs sharing specific and similar structural or physico-chemical properties. These pocket-ligand pair clusters highlight correspondences between pocket and ligand topological and physico-chemical properties and capture relevant information with respect to protein-ligand interactions. Based on these pocket-ligand correspondences, a protocol of prediction of clusters sharing similarity in terms of recognition characteristics is developed for a given pocket-ligand complex and gives high performances. It is then extended to cluster prediction for a given pocket in order to acquire knowledge about its expected ligand profile or to cluster prediction for a given ligand in order to acquire knowledge about its expected pocket profile. This prediction approach shows promising results and could contribute to predict some ligand properties critical for binding to a given pocket, and conversely, some key pocket properties for ligand binding.
Reynès, Christelle; Spérandio, Olivier; Miteva, Maria A.; Villoutreix, Bruno O.; Camproux, Anne-Claude
2013-01-01
Pockets are today at the cornerstones of modern drug discovery projects and at the crossroad of several research fields, from structural biology to mathematical modeling. Being able to predict if a small molecule could bind to one or more protein targets or if a protein could bind to some given ligands is very useful for drug discovery endeavors, anticipation of binding to off- and anti-targets. To date, several studies explore such questions from chemogenomic approach to reverse docking methods. Most of these studies have been performed either from the viewpoint of ligands or targets. However it seems valuable to use information from both ligands and target binding pockets. Hence, we present a multivariate approach relating ligand properties with protein pocket properties from the analysis of known ligand-protein interactions. We explored and optimized the pocket-ligand pair space by combining pocket and ligand descriptors using Principal Component Analysis and developed a classification engine on this paired space, revealing five main clusters of pocket-ligand pairs sharing specific and similar structural or physico-chemical properties. These pocket-ligand pair clusters highlight correspondences between pocket and ligand topological and physico-chemical properties and capture relevant information with respect to protein-ligand interactions. Based on these pocket-ligand correspondences, a protocol of prediction of clusters sharing similarity in terms of recognition characteristics is developed for a given pocket-ligand complex and gives high performances. It is then extended to cluster prediction for a given pocket in order to acquire knowledge about its expected ligand profile or to cluster prediction for a given ligand in order to acquire knowledge about its expected pocket profile. This prediction approach shows promising results and could contribute to predict some ligand properties critical for binding to a given pocket, and conversely, some key pocket properties for ligand binding. PMID:23840299
Population-based 3D genome structure analysis reveals driving forces in spatial genome organization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tjong, Harianto; Li, Wenyuan; Kalhor, Reza
Conformation capture technologies (e.g., Hi-C) chart physical interactions between chromatin regions on a genome-wide scale. However, the structural variability of the genome between cells poses a great challenge to interpreting ensemble-averaged Hi-C data, particularly for long-range and interchromosomal interactions. Here, we present a probabilistic approach for deconvoluting Hi-C data into a model population of distinct diploid 3D genome structures, which facilitates the detection of chromatin interactions likely to co-occur in individual cells. Here, our approach incorporates the stochastic nature of chromosome conformations and allows a detailed analysis of alternative chromatin structure states. For example, we predict and experimentally confirm themore » presence of large centromere clusters with distinct chromosome compositions varying between individual cells. The stability of these clusters varies greatly with their chromosome identities. We show that these chromosome-specific clusters can play a key role in the overall chromosome positioning in the nucleus and stabilizing specific chromatin interactions. By explicitly considering genome structural variability, our population-based method provides an important tool for revealing novel insights into the key factors shaping the spatial genome organization.« less
NASA Astrophysics Data System (ADS)
Quang Nguyen, Sang; Kong, Hyung Yun
2016-11-01
In this article, the presence of multi-hop relaying, eavesdropper and co-channel interference (CCI) in the same system model is investigated. Specifically, the effect of CCI on a secured multi-hop relaying network is studied, in which the source communicates with the destination via multi-relay-hopping under the presence of an eavesdropper and CCI at each node. The optimal relay at each cluster is selected to help forward the message from the source to the destination. We apply two relay selection approaches to such a system model, i.e. the optimal relay is chosen based on (1) the maximum channel gain from the transmitter to all relays in the desired cluster and (2) the minimum channel gain from the eavesdropper to all relays in each cluster. For the performance evaluation and comparison, we derived the exact closed form of the secrecy outage probability of the two approaches. That analysis is verified by Monte Carlo simulation. Finally, the effects of the number of hops, the transmit power at the source, relays and the external sources, the distance between the external sources and each node in the system, and the location of the eavesdropper are presented and discussed.
Population-based 3D genome structure analysis reveals driving forces in spatial genome organization
Tjong, Harianto; Li, Wenyuan; Kalhor, Reza; ...
2016-03-07
Conformation capture technologies (e.g., Hi-C) chart physical interactions between chromatin regions on a genome-wide scale. However, the structural variability of the genome between cells poses a great challenge to interpreting ensemble-averaged Hi-C data, particularly for long-range and interchromosomal interactions. Here, we present a probabilistic approach for deconvoluting Hi-C data into a model population of distinct diploid 3D genome structures, which facilitates the detection of chromatin interactions likely to co-occur in individual cells. Here, our approach incorporates the stochastic nature of chromosome conformations and allows a detailed analysis of alternative chromatin structure states. For example, we predict and experimentally confirm themore » presence of large centromere clusters with distinct chromosome compositions varying between individual cells. The stability of these clusters varies greatly with their chromosome identities. We show that these chromosome-specific clusters can play a key role in the overall chromosome positioning in the nucleus and stabilizing specific chromatin interactions. By explicitly considering genome structural variability, our population-based method provides an important tool for revealing novel insights into the key factors shaping the spatial genome organization.« less
Managing distance and covariate information with point-based clustering.
Whigham, Peter A; de Graaf, Brandon; Srivastava, Rashmi; Glue, Paul
2016-09-01
Geographic perspectives of disease and the human condition often involve point-based observations and questions of clustering or dispersion within a spatial context. These problems involve a finite set of point observations and are constrained by a larger, but finite, set of locations where the observations could occur. Developing a rigorous method for pattern analysis in this context requires handling spatial covariates, a method for constrained finite spatial clustering, and addressing bias in geographic distance measures. An approach, based on Ripley's K and applied to the problem of clustering with deliberate self-harm (DSH), is presented. Point-based Monte-Carlo simulation of Ripley's K, accounting for socio-economic deprivation and sources of distance measurement bias, was developed to estimate clustering of DSH at a range of spatial scales. A rotated Minkowski L1 distance metric allowed variation in physical distance and clustering to be assessed. Self-harm data was derived from an audit of 2 years' emergency hospital presentations (n = 136) in a New Zealand town (population ~50,000). Study area was defined by residential (housing) land parcels representing a finite set of possible point addresses. Area-based deprivation was spatially correlated. Accounting for deprivation and distance bias showed evidence for clustering of DSH for spatial scales up to 500 m with a one-sided 95 % CI, suggesting that social contagion may be present for this urban cohort. Many problems involve finite locations in geographic space that require estimates of distance-based clustering at many scales. A Monte-Carlo approach to Ripley's K, incorporating covariates and models for distance bias, are crucial when assessing health-related clustering. The case study showed that social network structure defined at the neighbourhood level may account for aspects of neighbourhood clustering of DSH. Accounting for covariate measures that exhibit spatial clustering, such as deprivation, are crucial when assessing point-based clustering.
NASA Astrophysics Data System (ADS)
Mokhtar, Nurkhairany Amyra; Zubairi, Yong Zulina; Hussin, Abdul Ghapor
2017-05-01
Outlier detection has been used extensively in data analysis to detect anomalous observation in data and has important application in fraud detection and robust analysis. In this paper, we propose a method in detecting multiple outliers for circular variables in linear functional relationship model. Using the residual values of the Caires and Wyatt model, we applied the hierarchical clustering procedure. With the use of tree diagram, we illustrate the graphical approach of the detection of outlier. A simulation study is done to verify the accuracy of the proposed method. Also, an illustration to a real data set is given to show its practical applicability.
Roguev, Assen; Ryan, Colm J; Xu, Jiewei; Colson, Isabelle; Hartsuiker, Edgar; Krogan, Nevan
2018-02-01
This protocol describes computational analysis of genetic interaction screens, ranging from data capture (plate imaging) to downstream analyses. Plate imaging approaches using both digital camera and office flatbed scanners are included, along with a protocol for the extraction of colony size measurements from the resulting images. A commonly used genetic interaction scoring method, calculation of the S-score, is discussed. These methods require minimal computer skills, but some familiarity with MATLAB and Linux/Unix is a plus. Finally, an outline for using clustering and visualization software for analysis of resulting data sets is provided. © 2018 Cold Spring Harbor Laboratory Press.
Kennedy, Angie C; Adams, Adrienne E
2016-04-01
Using a cluster analysis approach with a sample of 205 young mothers recruited from community sites in an urban Midwestern setting, we examined the effects of cumulative violence exposure (community violence exposure, witnessing intimate partner violence, physical abuse by a caregiver, and sexual victimization, all with onset prior to age 13) on school participation, as mediated by attention and behavior problems in school. We identified five clusters of cumulative exposure, and found that the HiAll cluster (high levels of exposure to all four types) consistently fared the worst, with significantly higher attention and behavior problems, and lower school participation, in comparison with the LoAll cluster (low levels of exposure to all types). Behavior problems were a significant mediator of the effects of cumulative violence exposure on school participation, but attention problems were not. © The Author(s) 2014.
Liu, Chao; Abu-Jamous, Basel; Brattico, Elvira; Nandi, Asoke K
2017-03-01
In the past decades, neuroimaging of humans has gained a position of status within neuroscience, and data-driven approaches and functional connectivity analyses of functional magnetic resonance imaging (fMRI) data are increasingly favored to depict the complex architecture of human brains. However, the reliability of these findings is jeopardized by too many analysis methods and sometimes too few samples used, which leads to discord among researchers. We propose a tunable consensus clustering paradigm that aims at overcoming the clustering methods selection problem as well as reliability issues in neuroimaging by means of first applying several analysis methods (three in this study) on multiple datasets and then integrating the clustering results. To validate the method, we applied it to a complex fMRI experiment involving affective processing of hundreds of music clips. We found that brain structures related to visual, reward, and auditory processing have intrinsic spatial patterns of coherent neuroactivity during affective processing. The comparisons between the results obtained from our method and those from each individual clustering algorithm demonstrate that our paradigm has notable advantages over traditional single clustering algorithms in being able to evidence robust connectivity patterns even with complex neuroimaging data involving a variety of stimuli and affective evaluations of them. The consensus clustering method is implemented in the R package "UNCLES" available on http://cran.r-project.org/web/packages/UNCLES/index.html .
Jung, Inuk; Jo, Kyuri; Kang, Hyejin; Ahn, Hongryul; Yu, Youngjae; Kim, Sun
2017-12-01
Identifying biologically meaningful gene expression patterns from time series gene expression data is important to understand the underlying biological mechanisms. To identify significantly perturbed gene sets between different phenotypes, analysis of time series transcriptome data requires consideration of time and sample dimensions. Thus, the analysis of such time series data seeks to search gene sets that exhibit similar or different expression patterns between two or more sample conditions, constituting the three-dimensional data, i.e. gene-time-condition. Computational complexity for analyzing such data is very high, compared to the already difficult NP-hard two dimensional biclustering algorithms. Because of this challenge, traditional time series clustering algorithms are designed to capture co-expressed genes with similar expression pattern in two sample conditions. We present a triclustering algorithm, TimesVector, specifically designed for clustering three-dimensional time series data to capture distinctively similar or different gene expression patterns between two or more sample conditions. TimesVector identifies clusters with distinctive expression patterns in three steps: (i) dimension reduction and clustering of time-condition concatenated vectors, (ii) post-processing clusters for detecting similar and distinct expression patterns and (iii) rescuing genes from unclassified clusters. Using four sets of time series gene expression data, generated by both microarray and high throughput sequencing platforms, we demonstrated that TimesVector successfully detected biologically meaningful clusters of high quality. TimesVector improved the clustering quality compared to existing triclustering tools and only TimesVector detected clusters with differential expression patterns across conditions successfully. The TimesVector software is available at http://biohealth.snu.ac.kr/software/TimesVector/. sunkim.bioinfo@snu.ac.kr. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Real Time Intelligent Target Detection and Analysis with Machine Vision
NASA Technical Reports Server (NTRS)
Howard, Ayanna; Padgett, Curtis; Brown, Kenneth
2000-01-01
We present an algorithm for detecting a specified set of targets for an Automatic Target Recognition (ATR) application. ATR involves processing images for detecting, classifying, and tracking targets embedded in a background scene. We address the problem of discriminating between targets and nontarget objects in a scene by evaluating 40x40 image blocks belonging to an image. Each image block is first projected onto a set of templates specifically designed to separate images of targets embedded in a typical background scene from those background images without targets. These filters are found using directed principal component analysis which maximally separates the two groups. The projected images are then clustered into one of n classes based on a minimum distance to a set of n cluster prototypes. These cluster prototypes have previously been identified using a modified clustering algorithm based on prior sensed data. Each projected image pattern is then fed into the associated cluster's trained neural network for classification. A detailed description of our algorithm will be given in this paper. We outline our methodology for designing the templates, describe our modified clustering algorithm, and provide details on the neural network classifiers. Evaluation of the overall algorithm demonstrates that our detection rates approach 96% with a false positive rate of less than 0.03%.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhao, Wenyang; Wang, Zhao; Malonzo, Camille D.
The process of nanocasting in metal-organic frameworks (MOFs) is a versatile approach to modify these porous materials by introducing supporting scaffolds. The nanocast scaffolds can stabilize metal-oxo clusters in MOFs at high temperatures and modulate their chemical environments. Here we demonstrate a range of nanocasting approaches in the MOF NU-1000, which contains hexanuclear oxozirconium clusters (denoted as Zr6 clusters) that are suitable for modification with other metals. We developed methods for introducing SiO2, TiO2, polymeric, and carbon scaffolds into the NU-1000 structure. The responses of NU-1000 towards different scaffold precursors were studied, including the effects on morphology, precursor distribution, andmore » porosity after nanocasting. Upon removal of organic linkers in the MOF by calcination/pyrolysis at 500 °C or above, the Zr6 clusters remained accessible and maintained their Lewis acidity in SiO2 nanocast samples, whereas additional treatment was necessary for Zr6 clusters to become accessible in carbon nanocast samples. Aggregation of Zr6 clusters was largely prevented with SiO2 or carbon scaffolds even after thermal treatment at 500 °C or above. In the case of titania nanocasting, NU- 1000 crystals underwent a pseudomorphic transformation, in which Zr6 clusters reacted with titania to form small oxaggregates of a Zr/Ti mixed oxide with a local structure resembling that of ZrTi2O6. The ability to maintain high densities of discrete Lewis acidic Zr6 clusters on SiO2 or carbon supports at high temperatures provides a starting point for designing new thermally stable catalysts.« less
MODEL-BASED CLUSTERING FOR CLASSIFICATION OF AQUATIC SYSTEMS AND DIAGNOSIS OF ECOLOGICAL STRESS
Clustering approaches were developed using the classification likelihood, the mixture likelihood, and also using a randomization approach with a model index. Using a clustering approach based on the mixture and classification likelihoods, we have developed an algorithm that...
University students' achievement goals and approaches to learning in mathematics.
Cano, Francisco; Berbén, A B G
2009-03-01
Achievement goals (AG) and students' approaches to learning (SAL) are two research perspectives on student motivation and learning in higher education that have until now been pursued quite independently. This study sets out: (a) to explore the relationship between the most representative variables of SAL and AG; (b) to identify subgroups (clusters) of students with multiple AG; and (c) to examine the differences between these clusters with respect to various SAL and AG characteristics. The participants were 680 male and female 1st year university students studying different subjects (e.g. mathematics, physics, economics) but all enrolled on mathematics courses (e.g. algebra, calculus). Participants completed a series of questionnaires that measured their conceptions of mathematics, approaches to learning, course experience, personal 2 x 2 AG, and perceived AG. SAL and AG variables were moderately associated and related to both the way students perceived their academic environment and the way they conceived of the nature of mathematics (i.e. the perceptual-cognitive framework). Four clusters of students with distinctive multiple AG were identified and when the differences between clusters were analysed, we were able to attribute them to various constructs including perceptual-cognitive framework, learning approaches, and academic performance. This study reveals a consistent pattern of relationships between SAL and AG perspectives across different methods of analysis, supports the relevance of the 2 x 2 AG framework in a mathematics learning context and suggests that AG and SAL may be intertwined aspects of students' experience of learning mathematics at university.
Graph-based analysis of kinetics on multidimensional potential-energy surfaces.
Okushima, T; Niiyama, T; Ikeda, K S; Shimizu, Y
2009-09-01
The aim of this paper is twofold: one is to give a detailed description of an alternative graph-based analysis method, which we call saddle connectivity graph, for analyzing the global topography and the dynamical properties of many-dimensional potential-energy landscapes and the other is to give examples of applications of this method in the analysis of the kinetics of realistic systems. A Dijkstra-type shortest path algorithm is proposed to extract dynamically dominant transition pathways by kinetically defining transition costs. The applicability of this approach is first confirmed by an illustrative example of a low-dimensional random potential. We then show that a coarse-graining procedure tailored for saddle connectivity graphs can be used to obtain the kinetic properties of 13- and 38-atom Lennard-Jones clusters. The coarse-graining method not only reduces the complexity of the graphs, but also, with iterative use, reveals a self-similar hierarchical structure in these clusters. We also propose that the self-similarity is common to many-atom Lennard-Jones clusters.
Copy-number analysis and inference of subclonal populations in cancer genomes using Sclust.
Cun, Yupeng; Yang, Tsun-Po; Achter, Viktor; Lang, Ulrich; Peifer, Martin
2018-06-01
The genomes of cancer cells constantly change during pathogenesis. This evolutionary process can lead to the emergence of drug-resistant mutations in subclonal populations, which can hinder therapeutic intervention in patients. Data derived from massively parallel sequencing can be used to infer these subclonal populations using tumor-specific point mutations. The accurate determination of copy-number changes and tumor impurity is necessary to reliably infer subclonal populations by mutational clustering. This protocol describes how to use Sclust, a copy-number analysis method with a recently developed mutational clustering approach. In a series of simulations and comparisons with alternative methods, we have previously shown that Sclust accurately determines copy-number states and subclonal populations. Performance tests show that the method is computationally efficient, with copy-number analysis and mutational clustering taking <10 min. Sclust is designed such that even non-experts in computational biology or bioinformatics with basic knowledge of the Linux/Unix command-line syntax should be able to carry out analyses of subclonal populations.
Cuthbertson, Daniel; Andrews, Preston K.; Reganold, John P.; Davies, Neal M.; Lange, B. Markus
2012-01-01
A gas chromatography–mass spectrometry approach was employed to evaluate the use of metabolite patterns to differentiate fruit from six commercially grown apple cultivars harvested in 2008. Principal component analysis (PCA) of apple fruit peel and flesh data indicated that individual cultivar replicates clustered together and were separated from all other cultivar samples. An independent metabolomics investigation with fruit harvested in 2003 confirmed the separate clustering of fruit from different cultivars. Further evidence for cultivar separation was obtained using a hierarchical clustering analysis. An evaluation of PCA component loadings revealed specific metabolite classes that contributed the most to each principal component, whereas a correlation analysis demonstrated that specific metabolites correlate directly with quality traits such as antioxidant activity, total phenolics, and total anthocyanins, which are important parameters in the selection of breeding germplasm. These data sets lay the foundation for elucidating the metabolic basis of commercially important fruit quality traits. PMID:22881116
A possibilistic approach to clustering
NASA Technical Reports Server (NTRS)
Krishnapuram, Raghu; Keller, James M.
1993-01-01
Fuzzy clustering has been shown to be advantageous over crisp (or traditional) clustering methods in that total commitment of a vector to a given class is not required at each image pattern recognition iteration. Recently fuzzy clustering methods have shown spectacular ability to detect not only hypervolume clusters, but also clusters which are actually 'thin shells', i.e., curves and surfaces. Most analytic fuzzy clustering approaches are derived from the 'Fuzzy C-Means' (FCM) algorithm. The FCM uses the probabilistic constraint that the memberships of a data point across classes sum to one. This constraint was used to generate the membership update equations for an iterative algorithm. Recently, we cast the clustering problem into the framework of possibility theory using an approach in which the resulting partition of the data can be interpreted as a possibilistic partition, and the membership values may be interpreted as degrees of possibility of the points belonging to the classes. We show the ability of this approach to detect linear and quartic curves in the presence of considerable noise.
Xu, Xin; Huang, Zhenhua; Graves, Daniel; Pedrycz, Witold
2014-12-01
In order to deal with the sequential decision problems with large or continuous state spaces, feature representation and function approximation have been a major research topic in reinforcement learning (RL). In this paper, a clustering-based graph Laplacian framework is presented for feature representation and value function approximation (VFA) in RL. By making use of clustering-based techniques, that is, K-means clustering or fuzzy C-means clustering, a graph Laplacian is constructed by subsampling in Markov decision processes (MDPs) with continuous state spaces. The basis functions for VFA can be automatically generated from spectral analysis of the graph Laplacian. The clustering-based graph Laplacian is integrated with a class of approximation policy iteration algorithms called representation policy iteration (RPI) for RL in MDPs with continuous state spaces. Simulation and experimental results show that, compared with previous RPI methods, the proposed approach needs fewer sample points to compute an efficient set of basis functions and the learning control performance can be improved for a variety of parameter settings.
ERIC Educational Resources Information Center
Quinnell, Rosanne; May, Elizabeth; Peat, Mary
2012-01-01
We surveyed first year students at the start and at the end of their first semester of university biology (n = 285) as to their approaches to study ("surface", "deep") and their conceptions of biology ("fragmented", "cohesive"). Hierarchical cluster analysis was used to group students who responded similarly…
Quantifying site-specific physical heterogeneity within an estuarine seascape
Kennedy, Cristina G.; Mather, Martha E.; Smith, Joseph M.
2017-01-01
Quantifying physical heterogeneity is essential for meaningful ecological research and effective resource management. Spatial patterns of multiple, co-occurring physical features are rarely quantified across a seascape because of methodological challenges. Here, we identified approaches that measured total site-specific heterogeneity, an often overlooked aspect of estuarine ecosystems. Specifically, we examined 23 metrics that quantified four types of common physical features: (1) river and creek confluences, (2) bathymetric variation including underwater drop-offs, (3) land features such as islands/sandbars, and (4) major underwater channel networks. Our research at 40 sites throughout Plum Island Estuary (PIE) provided solutions to two problems. The first problem was that individual metrics that measured heterogeneity of a single physical feature showed different regional patterns. We solved this first problem by combining multiple metrics for a single feature using a within-physical feature cluster analysis. With this approach, we identified sites with four different types of confluences and three different types of underwater drop-offs. The second problem was that when multiple physical features co-occurred, new patterns of total site-specific heterogeneity were created across the seascape. This pattern of total heterogeneity has potential ecological relevance to structure-oriented predators. To address this second problem, we identified sites with similar types of total physical heterogeneity using an across-physical feature cluster analysis. Then, we calculated an additive heterogeneity index, which integrated all physical features at a site. Finally, we tested if site-specific additive heterogeneity index values differed for across-physical feature clusters. In PIE, the sites with the highest additive heterogeneity index values were clustered together and corresponded to sites where a fish predator, adult striped bass (Morone saxatilis), aggregated in a related acoustic tracking study. In summary, we have shown general approaches to quantifying site-specific heterogeneity.
Higher-order neural networks, Polyà polynomials, and Fermi cluster diagrams
NASA Astrophysics Data System (ADS)
Kürten, K. E.; Clark, J. W.
2003-09-01
The problem of controlling higher-order interactions in neural networks is addressed with techniques commonly applied in the cluster analysis of quantum many-particle systems. For multineuron synaptic weights chosen according to a straightforward extension of the standard Hebbian learning rule, we show that higher-order contributions to the stimulus felt by a given neuron can be readily evaluated via Polyà’s combinatoric group-theoretical approach or equivalently by exploiting a precise formal analogy with fermion diagrammatics.
Comprehensive assessment of cancer missense mutation clustering in protein structures.
Kamburov, Atanas; Lawrence, Michael S; Polak, Paz; Leshchiner, Ignaty; Lage, Kasper; Golub, Todd R; Lander, Eric S; Getz, Gad
2015-10-06
Large-scale tumor sequencing projects enabled the identification of many new cancer gene candidates through computational approaches. Here, we describe a general method to detect cancer genes based on significant 3D clustering of mutations relative to the structure of the encoded protein products. The approach can also be used to search for proteins with an enrichment of mutations at binding interfaces with a protein, nucleic acid, or small molecule partner. We applied this approach to systematically analyze the PanCancer compendium of somatic mutations from 4,742 tumors relative to all known 3D structures of human proteins in the Protein Data Bank. We detected significant 3D clustering of missense mutations in several previously known oncoproteins including HRAS, EGFR, and PIK3CA. Although clustering of missense mutations is often regarded as a hallmark of oncoproteins, we observed that a number of tumor suppressors, including FBXW7, VHL, and STK11, also showed such clustering. Beside these known cases, we also identified significant 3D clustering of missense mutations in NUF2, which encodes a component of the kinetochore, that could affect chromosome segregation and lead to aneuploidy. Analysis of interaction interfaces revealed enrichment of mutations in the interfaces between FBXW7-CCNE1, HRAS-RASA1, CUL4B-CAND1, OGT-HCFC1, PPP2R1A-PPP2R5C/PPP2R2A, DICER1-Mg2+, MAX-DNA, SRSF2-RNA, and others. Together, our results indicate that systematic consideration of 3D structure can assist in the identification of cancer genes and in the understanding of the functional role of their mutations.
Comprehensive assessment of cancer missense mutation clustering in protein structures
Kamburov, Atanas; Lawrence, Michael S.; Polak, Paz; Leshchiner, Ignaty; Lage, Kasper; Golub, Todd R.; Lander, Eric S.; Getz, Gad
2015-01-01
Large-scale tumor sequencing projects enabled the identification of many new cancer gene candidates through computational approaches. Here, we describe a general method to detect cancer genes based on significant 3D clustering of mutations relative to the structure of the encoded protein products. The approach can also be used to search for proteins with an enrichment of mutations at binding interfaces with a protein, nucleic acid, or small molecule partner. We applied this approach to systematically analyze the PanCancer compendium of somatic mutations from 4,742 tumors relative to all known 3D structures of human proteins in the Protein Data Bank. We detected significant 3D clustering of missense mutations in several previously known oncoproteins including HRAS, EGFR, and PIK3CA. Although clustering of missense mutations is often regarded as a hallmark of oncoproteins, we observed that a number of tumor suppressors, including FBXW7, VHL, and STK11, also showed such clustering. Beside these known cases, we also identified significant 3D clustering of missense mutations in NUF2, which encodes a component of the kinetochore, that could affect chromosome segregation and lead to aneuploidy. Analysis of interaction interfaces revealed enrichment of mutations in the interfaces between FBXW7-CCNE1, HRAS-RASA1, CUL4B-CAND1, OGT-HCFC1, PPP2R1A-PPP2R5C/PPP2R2A, DICER1-Mg2+, MAX-DNA, SRSF2-RNA, and others. Together, our results indicate that systematic consideration of 3D structure can assist in the identification of cancer genes and in the understanding of the functional role of their mutations. PMID:26392535
Characterization of the CPAP-treated patient population in Catalonia
Gavaldá, Ricard; Teixidó, Ivan; Woehrle, Holger; Rué, Montserrat; Solsona, Francesc; Escarrabill, Joan; Colls, Cristina; García-Altés, Anna; de Batlle, Jordi; Sánchez de-la-Torre, Manuel
2017-01-01
There are different phenotypes of obstructive sleep apnoea (OSA), many of which have not been characterised. Identification of these different phenotypes is important in defining prognosis and guiding the therapeutic strategy. The aim of this study was to characterise the entire population of continuous positive airway pressure (CPAP)-treated patients in Catalonia and identify specific patient profiles using cluster analysis. A total of 72,217 CPAP-treated patients who contacted the Catalan Health System (CatSalut) during the years 2012 and 2013 were included. Six clusters were identified, classified as “Neoplastic patients” (Cluster 1, 10.4%), “Metabolic syndrome patients” (Cluster 2, 27.7%), “Asthmatic patients” (Cluster 3, 5.8%), “Musculoskeletal and joint disorder patients” (Cluster 4, 10.3%), “Patients with few comorbidities” (Cluster 5, 35.6%) and “Oldest and cardiac disease patients” (Cluster 6, 10.2%). Healthcare facility use and mortality were highest in patients from Cluster 1 and 6. Conversely, patients in Clusters 2 and 4 had low morbidity, mortality and healthcare resource use. Our findings highlight the heterogeneity of CPAP-treated patients, and suggest that OSA is associated with a different prognosis in the clusters identified. These results suggest the need for a comprehensive and individualised approach to CPAP treatment of OSA. PMID:28934303
Lee, Yii-Ching; Huang, Shian-Chang; Huang, Chih-Hsuan; Wu, Hsin-Hung
2016-01-01
This study uses kernel k-means cluster analysis to identify medical staffs with high burnout. The data collected in October to November 2014 are from the emotional exhaustion dimension of the Chinese version of Safety Attitudes Questionnaire in a regional teaching hospital in Taiwan. The number of effective questionnaires including the entire staffs such as physicians, nurses, technicians, pharmacists, medical administrators, and respiratory therapists is 680. The results show that 8 clusters are generated by kernel k-means method. Employees in clusters 1, 4, and 5 are relatively in good conditions, whereas employees in clusters 2, 3, 6, 7, and 8 need to be closely monitored from time to time because they have relatively higher degree of burnout. When employees with higher degree of burnout are identified, the hospital management can take actions to improve the resilience, reduce the potential medical errors, and, eventually, enhance the patient safety. This study also suggests that the hospital management needs to keep track of medical staffs’ fatigue conditions and provide timely assistance for burnout recovery through employee assistance programs, mindfulness-based stress reduction programs, positivity currency buildup, and forming appreciative inquiry groups. PMID:27895218
NASA Astrophysics Data System (ADS)
Adams, John E.; Stratt, Richard M.
1990-08-01
For the instantaneous normal mode analysis method to be generally useful in studying the dynamics of clusters of arbitrary size, it ought to yield values of atomic self-diffusion constants which agree with those derived directly from molecular dynamics calculations. The present study proposes that such agreement indeed can be obtained if a sufficiently sophisticated formalism for computing the diffusion constant is adopted, such as the one suggested by Madan, Keyes, and Seeley [J. Chem. Phys. 92, 7565 (1990)]. In order to implement this particular formalism, however, we have found it necessary to pay particular attention to the removal from the computed spectra of spurious rotational contributions. The utility of the formalism is demonstrated via a study of small argon clusters, for which numerous results generated using other approaches are available. We find the same temperature dependence of the Ar13 self-diffusion constant that Beck and Marchioro [J. Chem. Phys. 93, 1347 (1990)] do from their direct calculation of the velocity autocorrelation function: The diffusion constant rises quickly from zero to a liquid-like value as the cluster goes through (the finite-size equivalent of) the melting transition.
Density-cluster NMA: A new protein decomposition technique for coarse-grained normal mode analysis.
Demerdash, Omar N A; Mitchell, Julie C
2012-07-01
Normal mode analysis has emerged as a useful technique for investigating protein motions on long time scales. This is largely due to the advent of coarse-graining techniques, particularly Hooke's Law-based potentials and the rotational-translational blocking (RTB) method for reducing the size of the force-constant matrix, the Hessian. Here we present a new method for domain decomposition for use in RTB that is based on hierarchical clustering of atomic density gradients, which we call Density-Cluster RTB (DCRTB). The method reduces the number of degrees of freedom by 85-90% compared with the standard blocking approaches. We compared the normal modes from DCRTB against standard RTB using 1-4 residues in sequence in a single block, with good agreement between the two methods. We also show that Density-Cluster RTB and standard RTB perform well in capturing the experimentally determined direction of conformational change. Significantly, we report superior correlation of DCRTB with B-factors compared with 1-4 residue per block RTB. Finally, we show significant reduction in computational cost for Density-Cluster RTB that is nearly 100-fold for many examples. Copyright © 2012 Wiley Periodicals, Inc.
Krawczyk, Christopher; Gradziel, Pat; Geraghty, Estella M.
2014-01-01
Objectives. We used a geographic information system and cluster analyses to determine locations in need of enhanced Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) Program services. Methods. We linked documented births in the 2010 California Birth Statistical Master File with the 2010 data from the WIC Integrated Statewide Information System. Analyses focused on the density of pregnant women who were eligible for but not receiving WIC services in California’s 7049 census tracts. We used incremental spatial autocorrelation and hot spot analyses to identify clusters of WIC-eligible nonparticipants. Results. We detected clusters of census tracts with higher-than-expected densities, compared with the state mean density of WIC-eligible nonparticipants, in 21 of 58 (36.2%) California counties (P < .05). In subsequent county-level analyses, we located neighborhood-level clusters of higher-than-expected densities of eligible nonparticipants in Sacramento, San Francisco, Fresno, and Los Angeles Counties (P < .05). Conclusions. Hot spot analyses provided a rigorous and objective approach to determine the locations of statistically significant clusters of WIC-eligible nonparticipants. Results helped inform WIC program and funding decisions, including the opening of new WIC centers, and offered a novel approach for targeting public health services. PMID:24354821
NASA Technical Reports Server (NTRS)
Gurgiolo, Chris; Vinas, Adolfo F.
2009-01-01
This paper presents a spherical harmonic analysis of the plasma velocity distribution function using high-angular, energy, and time resolution Cluster data obtained from the PEACE spectrometer instrument to demonstrate how this analysis models the particle distribution function and its moments and anisotropies. The results show that spherical harmonic analysis produced a robust physical representation model of the velocity distribution function, resolving the main features of the measured distributions. From the spherical harmonic analysis, a minimum set of nine spectral coefficients was obtained from which the moment (up to the heat flux), anisotropy, and asymmetry calculations of the velocity distribution function were obtained. The spherical harmonic method provides a potentially effective "compression" technique that can be easily carried out onboard a spacecraft to determine the moments and anisotropies of the particle velocity distribution function for any species. These calculations were implemented using three different approaches, namely, the standard traditional integration, the spherical harmonic (SPH) spectral coefficients integration, and the singular value decomposition (SVD) on the spherical harmonic methods. A comparison among the various methods shows that both SPH and SVD approaches provide remarkable agreement with the standard moment integration method.
Nevo, Daniel; Zucker, David M.; Tamimi, Rulla M.; Wang, Molin
2017-01-01
A common paradigm in dealing with heterogeneity across tumors in cancer analysis is to cluster the tumors into subtypes using marker data on the tumor, and then to analyze each of the clusters separately. A more specific target is to investigate the association between risk factors and specific subtypes and to use the results for personalized preventive treatment. This task is usually carried out in two steps–clustering and risk factor assessment. However, two sources of measurement error arise in these problems. The first is the measurement error in the biomarker values. The second is the misclassification error when assigning observations to clusters. We consider the case with a specified set of relevant markers and propose a unified single-likelihood approach for normally distributed biomarkers. As an alternative, we consider a two-step procedure with the tumor type misclassification error taken into account in the second-step risk factor analysis. We describe our method for binary data and also for survival analysis data using a modified version of the Cox model. We present asymptotic theory for the proposed estimators. Simulation results indicate that our methods significantly lower the bias with a small price being paid in terms of variance. We present an analysis of breast cancer data from the Nurses’ Health Study to demonstrate the utility of our method. PMID:27558651
Optimization Techniques for Clustering,Connectivity, and Flow Problems in Complex Networks
2012-10-01
discrete optimization and for analysis of performance of algorithm portfolios; introducing a metaheuristic framework of variable objective search that...The results of empirical evaluation of the proposed algorithm are also included. 1.3 Theoretical analysis of heuristics and designing new metaheuristic ...analysis of heuristics for inapproximable problems and designing new metaheuristic approaches for the problems of interest; (IV) Developing new models